We place great emphasis on model explanations being faithful to model behavior. Ideally, feature importance explanations should surface and appropriately quantify all and only those factors that are causally responsible for the prediction. This matters especially when explanations need to be legally compliant and actionable.
How do we differentiate between features that are correlated with the outcome and those that cause the outcome? In other words, how do we think about the causality of a feature to a model output, or to a real-world task?
Causality in Models Is Hard
When explaining a model prediction, we want to quantify the contribution of each causal feature. In a credit risk model, we might want to know how important income or zip code is to the prediction.
Note that zip code may be causal to a model's prediction (changing zip code may change the model prediction) even though it may not be causal to the underlying task (changing zip code may not change whether a loan should be granted). However, these two things become related if the model's output is used in real-world decisions.
The good news is that since we have input-output access to the model, we can probe it with arbitrary inputs. This allows examining counterfactuals, inputs that differ from those of the prediction being explained. These counterfactuals might exist elsewhere in the dataset, or they might not.
Shapley values offer an elegant, axiomatic approach to quantify feature contributions. One challenge is that they rely on probing with an exponentially large set of counterfactuals. Several approaches approximate Shapley values, especially for specific model types.
The Correlation Trap
A more fundamental challenge emerges when features are correlated. Not all counterfactuals may be realistic. No clear consensus exists on how to address this issue.
It is tempting to rely on observational data: using observed data to define counterfactuals for Shapley values, or fitting an interpretable model to mimic the main model's predictions. But this can be dangerous.
Consider a credit risk model with features including income and zip code. Say the model internally only relies on zip code (it redlines applicants). Explanations based on observational data might reveal that income, by virtue of being correlated to zip code, is as predictive of the model's output. This may mislead us to explain the model's output in terms of income.
A naive explanation algorithm will split attributions equally between two perfectly correlated features. The explanation looks reasonable but is fundamentally misleading.
Intervention Reveals Causation
To learn more, we can intervene in features. One counterfactual changing zip code but not income will reveal that zip code causes the prediction to change. A second counterfactual changing income but not zip code will reveal that income does not affect the prediction.
These two together allow us to conclude that zip code is causal to the model's prediction, and income is not.
Explaining causality requires the right counterfactuals.
Causality in the Real World Is Harder
Above we outlined a method for explaining causality in models: study what happens when features change. To do so in the real world, you have to apply interventions. This is commonly called a randomized controlled trial (or A/B testing when there are two variants).
You divide a population into groups randomly and apply different interventions to each group. Randomization ensures the only differences among groups are your interventions. Therefore, you can conclude that your intervention causes the measurable differences.
The challenge is that not all interventions are feasible. You cannot ethically ask someone to take up smoking. In the real world, you may not be able to get the data needed to properly examine causality.
We can probe models as we wish, but not people.
Natural experiments can provide opportunities to examine situations where we would not normally intervene, like in epidemiology and economics. However, these provide a limited toolkit, leaving many questions up for debate.
Implications for AI Practice
Understanding the distinction between correlation and causation has practical implications for responsible AI.
For model development, be wary of explanations that attribute importance to correlated features. Test whether attribution shifts when you intervene on individual features.
For adverse action notices, explanations should point to factors the applicant can actually act on. If income appears important but the model actually relies on zip code, telling applicants to earn more is misleading and potentially harmful.
For bias detection, correlations with protected characteristics may indicate fairness problems even when protected attributes are not explicit inputs. Causal analysis can help distinguish genuine risk factors from proxy discrimination.
For AI governance, documentation should distinguish between features that are correlated with outcomes and features that causally drive model predictions. This distinction matters for regulatory compliance and risk management.
The Limits of Explainability
Explaining causality in models is hard. Explaining causality in the real world is even harder. These limitations do not mean explainability is worthless. They mean practitioners need humility about what explanations can and cannot tell us.
Feature importance methods provide valuable insight into model behavior. But they describe what the model has learned, not necessarily what is true about the world. A model may have learned spurious correlations that produce accurate predictions on training data but fail in deployment or cause harm to specific groups.
Effective explainability practice acknowledges these limitations while still extracting genuine value from the explanations we can compute.
