As machine learning models spread to high-stakes domains like lending, hiring, and healthcare, the need for explaining predictions grows from regulatory, operational, and societal perspectives. Two families of explainability methods have emerged: attribution-based explanations and counterfactual explanations.
Both approaches examine counterfactual inputs, yet they yield very different explanations. Understanding this difference is essential for practitioners implementing explainability in production systems.
What Are Counterfactual Explanations
Counterfactual explanations answer the question: how should the input change to obtain a different, more favorable prediction?
For instance, one could explain a credit rejection by saying: "Had you earned $5,000 more, your request for credit would have been approved." Counterfactual explanations are attractive because they are easy to comprehend and can offer a path of recourse to people receiving unfavorable decisions. Some researchers suggest counterfactual explanations may better serve the intended purpose of adverse action notices.
Obtaining a counterfactual explanation involves identifying the closest point to the current input that results in a different prediction. While this sounds simple, several challenges emerge.
The Distance Problem
Defining "closest" is tricky because features vary on different scales, and costs may vary non-linearly with feature values. Income may vary in tens of thousands while FICO scores vary in hundreds. Some approaches measure cost in terms of shifts over a data distribution. Others rely on domain experts to supply distance functions.
Computational Complexity
Solving this optimization problem is computationally challenging, especially with categorical features that turn the problem into combinatorial optimization. Even for linear models, optimal solutions require integer programming. For tree ensemble models, identifying any perturbation that results in a certain outcome is NP-complete. For black-box models where the mathematical relationship between prediction and input is hidden, one may only afford approximate solutions.
Feasibility Constraints
For suggested recourse to be practical, the perturbation must be feasible. Some approaches combat this by modeling the data manifold and restricting perturbations to lie on it. But this may still be insufficient.
For recourse to be practical, one must account for real-world feasibility of suggested changes and causal dependencies between features. Acting on a suggested recourse of "increasing income" almost always results in a change to job tenure. Waiting for a raise increases tenure while obtaining a new job resets it. This unforeseen change may adversely affect the prediction despite following the suggested recourse.
What Are Attribution-Based Explanations
Attribution explanations answer the question: what were the dominant features that contributed to this prediction?
The explanation quantifies the impact (called attribution) of each feature on the prediction. For a lending model, the explanation might note that a rejection was due to income being low and number of past delinquencies being high. SHAP, LIME, and Integrated Gradients are popular attribution methods.
Similar to counterfactual methods, most attribution methods compare the input at hand to counterfactual inputs (often called reference points or baselines). However, the role of counterfactuals here is to tease apart relative importance of features rather than identify new instances with favorable predictions.
SHAP, based on Shapley values from game theory, operates by considering counterfactuals that "turn off" features and noting the marginal effect on prediction. Integrated Gradients examines gradients at all counterfactual points that interpolate between the input and a certain "all off" input. LIME examines counterfactuals that randomly perturb features in the vicinity of the input.
Attribution Challenges
Defining counterfactuals that "turn off" a feature is tricky. What does it mean to turn off the income feature? Setting it to zero creates an atypical counterfactual. Some approaches use the training distribution median. Others randomly draw samples from a distribution.
Attributions are highly sensitive to the choice of counterfactuals. Various Shapley-based methods choose different distributions, leading to drastically different, sometimes misleading, attributions.
Attribution methods are also known to be sensitive to perturbations in input. A small perturbation that does not affect the prediction may still alter the attribution.
The Fundamental Difference
The two explanations are fundamentally different and complementary.
Attributions quantify the importance of features for the current prediction, while counterfactual explanations show how features should change to obtain a different prediction.
A feature highlighted by a counterfactual explanation may not have a large attribution. If most candidates in the accepted class have zero capital gains income, then a candidate with zero capital gains income will have most attribution fall on other features. However, increasing capital gains income may be a valid recourse to obtaining a favorable prediction.
Similarly, a feature with large attribution may not be highlighted by a counterfactual explanation. Many features may be immutable and therefore inapplicable for recourse, like past delinquencies. When features interact, a single feature alone may not suffice for recourse. A model requiring both credit score above 650 and income above $20,000 will not change its prediction from perturbing these features one at a time.
Using Both Approaches
Given their complementary nature, responsible AI practice should support both kinds of explanations. Together they offer a more complete picture of the model-data relationship. Attributions provide visibility into the decision-making process. Counterfactuals give actionable insight that can help with recourse.
The organizations that implement AI governance effectively recognize that no single explanation method tells the whole story. Different stakeholders need different kinds of explanations. Regulators may need attribution-based evidence that decisions are justified. Consumers may need counterfactual guidance on what they can do to get a different outcome.
Building comprehensive explainability infrastructure means implementing both approaches, understanding their limitations, and communicating appropriately to different audiences.
