Model Interpretability - Why Your Model Said That // Megha Bose

Helpful context:

A bank denies a loan. The applicant asks why. “Our model said so” is not an acceptable answer - in many jurisdictions it is illegal. A hospital uses ML to flag high-risk patients. A doctor needs to know which factors drove the risk score before acting on it. A self-driving system brakes unexpectedly. An engineer needs to understand what the model saw.

Interpretability is not a nice-to-have. It is how you debug models, build trust, comply with regulation, and catch models that learned the wrong thing. A model that achieves 94% accuracy because it learned to use a spurious feature (patient zipcode, transaction device type) will fail in deployment in ways that pure accuracy metrics never reveal.

The Spectrum of Interpretability

Some models are interpretable by design:

A linear regression with 5 features has coefficients that directly say “increasing age by 1 unit changes the prediction by X.”
A decision tree of depth 3 can be drawn on a whiteboard and followed by a human.

The problem is that these models are less accurate than complex alternatives on most real-world problems. The tension between accuracy and interpretability has driven a whole field of post-hoc explanation methods - techniques that look at a trained black-box model and explain its behavior after the fact.

Feature Importance: The Global Picture

Before diving into per-prediction explanations, start with global feature importance. Which features does the model rely on most?

Permutation importance: for each feature, shuffle its values randomly across the dataset (breaking its correlation with the target), recompute predictions, and measure how much worse the model gets. A feature that matters causes a large accuracy drop when shuffled. A feature that is redundant or irrelevant causes little change.

from sklearn.inspection import permutation_importance

result = permutation_importance(model, X_val, y_val, n_repeats=10, random_state=42)
# result.importances_mean[i] = mean drop in score when feature i is shuffled
sorted_idx = result.importances_mean.argsort()[::-1]

Permutation importance is model-agnostic and measures actual impact on held-out data, not training data. The main risk: if two features are correlated, shuffling one still leaves part of the information in the other, so correlated features can both appear unimportant even when the information they carry is critical.

Tree-based feature importance: tree models (random forests, XGBoost, LightGBM) compute built-in importance as the total gain from splits on each feature across all trees. This is fast but biased toward high-cardinality features (continuous features with many split points) and reflects training-set importance rather than generalization.

Partial Dependence Plots: How a Feature Affects Predictions

Permutation importance tells you which features matter. Partial Dependence Plots (PDPs) tell you how - the shape of the relationship between a feature and the model’s output.

For feature $j$, a PDP shows the average prediction across the dataset as we vary feature $j$’s value while keeping all other features at their observed values.

$$\text{PDP}j(v) = \frac{1}{n} \sum{i=1}^n f(x_i^{(-j)}, v)$$

where $x_i^{(-j)}$ is the $i$-th example with feature $j$ replaced by $v$. You sweep $v$ across its range and plot the average prediction.

from sklearn.inspection import PartialDependenceDisplay

PartialDependenceDisplay.from_estimator(
    model, X_train, features=[0, 2, (0, 2)],  # single features + interaction
    kind='average'
)

A PDP that rises steeply over some range tells you the model believes that region matters most. A flat PDP means the feature has little effect. Two-feature PDPs show interaction effects.

Limitation: PDPs show averages. If the relationship is heterogeneous - the feature matters a lot for some subgroups and not at all for others - the average can be misleading. Individual Conditional Expectation (ICE) plots show per-example curves instead of the average.

SHAP: Shapley Values From Game Theory

SHAP (SHapley Additive exPlanations, Lundberg and Lee, 2017) is the current gold standard for feature attribution. It gives each feature a numeric contribution to a specific prediction - not just globally but locally, for each individual example.

The mathematics comes from cooperative game theory. Imagine the features as players in a coalition game. The model’s prediction is the “payout.” Each feature’s Shapley value is its fair share of the payout, averaging its marginal contribution over all possible orderings in which features could join the prediction.

Formally, the Shapley value of feature $j$ for prediction $f(x)$ is:

$$\phi_j = \sum_{S \subseteq F \setminus {j}} \frac{|S|!(|F|-|S|-1)!}{|F|!} \left[ f(x_{S \cup {j}}) - f(x_S) \right]$$

where $F$ is the full set of features, $S$ ranges over all subsets not containing $j$, and $f(x_S)$ is the model’s prediction using only the features in $S$ (with remaining features averaged out). The factorials weight each subset by the probability that feature $j$ joins in that particular order.

This satisfies four properties that make it uniquely “fair”:

Efficiency: SHAP values sum to the difference between the prediction and the baseline (average prediction): $\sum_j \phi_j = f(x) - \mathbb{E}[f(X)]$.
Symmetry: if two features always contribute equally, they get equal Shapley values.
Dummy: if a feature never changes any prediction, its Shapley value is 0.
Additivity: the Shapley values for a model that is a sum of two sub-models equals the sum of Shapley values from each sub-model separately.

These properties are what distinguish Shapley values from simpler attribution schemes. Many intuitive approaches (like simply using gradients or split counts) violate one or more of these, leading to counterintuitive attributions.

The problem with naive Shapley: computing it exactly requires evaluating the model on $2^{|F|}$ feature subsets. With 100 features, that is $2^{100}$ evaluations - completely intractable.

TreeSHAP: Exact and Fast for Tree Models

Lundberg et al. developed TreeSHAP (2018), an algorithm that computes exact Shapley values for tree models (XGBoost, LightGBM, random forests) in $O(T L D^2)$ time, where $T$ is the number of trees, $L$ is the number of leaves, and $D$ is the depth. This is polynomial, not exponential.

The key insight: in a tree, when feature $j$ is not in subset $S$, its contribution is computed by the weighted average across the left and right branches at any split on feature $j$ (weighted by the fraction of training examples in each branch). TreeSHAP exploits the tree structure to compute this average analytically during a single tree traversal.

import shap

explainer = shap.TreeExplainer(model)  # works for XGBoost, LightGBM, sklearn trees
shap_values = explainer.shap_values(X_test)
# shap_values[i, j] = contribution of feature j to prediction for example i

For neural networks and other models, shap.DeepExplainer (uses the DeepLIFT algorithm) and shap.KernelExplainer (model-agnostic, slower) are available.

Reading SHAP Output

Local explanation (waterfall plot): for a single prediction, shows each feature’s contribution as a bar from the baseline (average prediction) to the final prediction. Features that pushed the prediction up are shown in one direction; features that pushed it down in the other. This is how you explain a single loan denial to a customer.

shap.plots.waterfall(shap_values[0])

Global explanation (beeswarm plot): for the entire dataset, shows the distribution of SHAP values for each feature. Every dot is one example - its horizontal position is its SHAP value, and its color represents the feature value (high vs low). This simultaneously shows which features matter globally (large spread) and the direction of their effect (positive SHAP values for high feature values = positive correlation).

shap.plots.beeswarm(shap_values)

Dependence plot: similar to a PDP but uses SHAP values instead of raw predictions. Plots one feature’s SHAP values on the y-axis against its raw values on the x-axis. Coloring by a second feature reveals interactions.

shap.plots.scatter(shap_values[:, "age"], color=shap_values[:, "income"])

LIME: Local Explanations via Approximation

LIME (Local Interpretable Model-agnostic Explanations, Ribeiro et al., 2016) takes a different approach: instead of tracing through the model’s internals, it fits a simple interpretable model locally around the prediction you want to explain.

For a prediction $f(x)$:

Perturb $x$ to create a neighborhood of similar examples.
Get the black-box model’s predictions on all of them.
Fit a weighted linear regression to those predictions (weighted by proximity to $x$).
The linear model’s coefficients are the explanation.

import lime
import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(
    X_train, feature_names=feature_names, class_names=['no', 'yes'], mode='classification'
)
exp = explainer.explain_instance(X_test[0], model.predict_proba, num_features=10)
exp.show_in_notebook()

LIME’s trade-off: it is model-agnostic (works for any black-box, including images and text). But the linear approximation can be unstable - running LIME twice on the same example with different random seeds can give meaningfully different explanations, because the perturbation neighborhood is sampled randomly. SHAP has better theoretical guarantees and is more reproducible. Use LIME when you cannot use TreeSHAP (non-tree models) and KernelSHAP is too slow.

Choosing the Right Tool

Tool	When to use	Limitation
Permutation importance	Quick global overview; understand which features matter	Breaks with correlated features
PDP / ICE	Understand the shape of feature effects; functional form	Averages hide heterogeneity
TreeSHAP	Tree models; exact Shapley values; fast	Trees only
KernelSHAP	Any model; SHAP guarantees	Slow on large datasets
LIME	Any model including images/text; quick local explanations	Unstable; local linear approximation can be poor

In practice: use SHAP for tree models (it is exact, fast, and theoretically sound). Use LIME for other modalities where SHAP is expensive. Always pair global importance (beeswarm) with local explanations (waterfall) for individual cases.

Concept	Key point
Permutation importance	Shuffle feature; measure accuracy drop; model-agnostic global importance
PDP	Average prediction as feature varies; shows functional relationship
Shapley value	Feature’s marginal contribution averaged over all orderings; uniquely fair
Efficiency	SHAP values sum to prediction minus baseline
TreeSHAP	Exact Shapley for trees in polynomial time; preferred for XGBoost/LightGBM
Beeswarm plot	Global SHAP: distribution of values per feature across dataset
Waterfall plot	Local SHAP: each feature’s push on one prediction
LIME	Local linear approximation around one prediction; model-agnostic but unstable

Read Next: