Imbalanced Classification - When 99% Accuracy Means the Model Is Useless // Megha Bose

Helpful context:

Imagine you build a cancer screening model. It achieves 99% accuracy. You deploy it proudly. Then you realize: cancer affects 1% of the population. Your model predicts “healthy” for everyone and still gets 99% right. It has never correctly identified a single cancer case. Accuracy told you nothing useful.

This is the central trap of imbalanced classification. When one class is rare - fraud in transactions, failures in manufacturing, diseases in screening - the naive model learns to ignore it. The cost of this mistake can be catastrophic, which is exactly why these are the problems where getting the ML right matters most.

Why Accuracy Fails

Accuracy = (correct predictions) / (all predictions). In a dataset that is 99% class 0 and 1% class 1, a model that always predicts 0 achieves 99% accuracy while having 0% recall on the minority class.

The fix is to use metrics that do not let the majority class dominate:

Precision: of all the examples the model labeled positive, what fraction actually were positive? $\text{Precision} = TP / (TP + FP)$. Low precision means many false alarms.

Recall (Sensitivity): of all actual positive examples, what fraction did the model find? $\text{Recall} = TP / (TP + FN)$. Low recall means missing real cases.

F1 score: harmonic mean of precision and recall: $F1 = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}$. Useful single number, but still hides which tradeoff you made.

AUC-ROC: the area under the Receiver Operating Characteristic curve, which plots true positive rate vs false positive rate at every threshold. Measures the model’s ability to rank positives above negatives. A score of 0.5 is random; 1.0 is perfect. AUC-ROC tells you about ranking quality without committing to a threshold.

AUC-PR (Precision-Recall curve): plots precision vs recall at every threshold. More informative than AUC-ROC for severe class imbalance, because AUC-ROC can look high even when precision is low. A model with AUC-ROC of 0.95 can still have very low precision at high recall, which the PR curve reveals.

For highly imbalanced problems, prefer AUC-PR over AUC-ROC. It tells you whether the model can find real positives without drowning you in false alarms.

Threshold Tuning: The Cheapest Fix

Every classifier outputs a probability (or score), and you apply a threshold to decide the label. The default is 0.5 - predict positive if $P(\text{positive}) > 0.5$. But 0.5 is arbitrary. For imbalanced classes, the right threshold is rarely 0.5.

Lowering the threshold increases recall (you catch more real positives) at the cost of precision (more false alarms). Raising it does the opposite.

The right threshold depends on the costs in your specific domain:

In fraud detection, a false negative (missed fraud) might cost $500. A false positive (blocked legitimate transaction) might cost $2 in customer service time. You should weight accordingly.
In cancer screening, a false negative (missed cancer) might cost a life. A false positive leads to a follow-up test. Recall must be very high even at the cost of precision.

To choose a threshold systematically: plot the PR curve, look at the operating points, and pick the threshold that matches your domain’s cost tradeoff. Do not let the default 0.5 make this decision for you.

Class Weights: Rebalancing the Loss

The simplest algorithmic fix is class weighting. Standard training treats every example equally. With class weights, you tell the model that minority class errors are more costly.

For a dataset with 99% class 0 and 1% class 1, you might use class weight 1 for class 0 and 99 for class 1. A single minority class misclassification contributes as much to the loss as 99 majority class misclassifications.

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Class weights inversely proportional to class frequency
clf = RandomForestClassifier(class_weight='balanced', random_state=42)
# 'balanced' automatically sets weights = n_samples / (n_classes * np.bincount(y))

# Or specify explicitly
clf = LogisticRegression(class_weight={0: 1, 1: 99})

Class weighting is low-cost and often surprisingly effective. Try it before any resampling strategy.

For XGBoost and LightGBM:

import lightgbm as lgb

# scale_pos_weight = count(negative) / count(positive)
clf = lgb.LGBMClassifier(scale_pos_weight=99)

Resampling: Fixing the Data Instead of the Model

If class weights are not enough, you can change the training data distribution directly.

Undersampling

Remove majority class examples until the classes are balanced (or less imbalanced).

Random undersampling: randomly remove majority examples. Simple, but you throw away real data that the model could learn from. A model trained on 1000 samples (500 each class) after undersampling from 99000:1000 has discarded 98500 useful majority class examples.

Tomek links: identify pairs of majority-minority examples that are nearest neighbors to each other (Tomek links). Remove the majority example from each pair. This cleans the decision boundary rather than random removal - you remove majority examples that are most “confusing” to the classifier.

Oversampling

Generate new minority class examples instead of removing majority ones.

Random oversampling: duplicate minority examples. This helps but risks overfitting - you are literally copying the same points multiple times, and the model can memorize them.

SMOTE (Synthetic Minority Oversampling Technique): the standard approach. For each minority example, find its $k$ nearest minority neighbors. Randomly pick one of those neighbors. Generate a new synthetic point somewhere along the line between the original and the neighbor.

from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy=0.1, k_neighbors=5, random_state=42)
# sampling_strategy=0.1 means minority/majority ratio becomes 0.1 after oversampling
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

SMOTE creates plausible synthetic examples in the feature space rather than duplicates. It expands the minority class manifold rather than just repeating points.

Warning: always apply resampling only to the training set, never to validation or test sets. Resampling is a training-time intervention; evaluation must reflect the true class distribution.

Combining Both

imblearn provides SMOTETomek and SMOTEENN which oversample the minority class with SMOTE and then clean the boundary with Tomek links or Edited Nearest Neighbors. These tend to produce cleaner boundaries than either method alone.

Ensemble Methods Designed for Imbalance

Standard random forests can still underfit the minority class even with class weights. There are ensemble variants specifically designed for imbalance.

BalancedRandomForest: instead of bootstrapping randomly, each bootstrap sample uses all minority examples and a random subsample of majority examples of the same size. Every tree in the forest sees a balanced training set. This avoids the majority-class dominance that emerges in standard bootstrap samples.

from imblearn.ensemble import BalancedRandomForestClassifier

clf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

EasyEnsemble: trains multiple AdaBoost classifiers, each on a balanced subsample. Combines their predictions by majority vote. Competitive with BalancedRandomForest and sometimes better on highly skewed datasets.

Framing the Problem Correctly

Before reaching for SMOTE or class weights, check whether you are framing the problem correctly.

If you need to find rare events, anomaly detection might be more appropriate than classification. When you have almost no positive examples (5 fraud cases out of 1 million), fitting a classifier to those 5 examples is unreliable. Anomaly detection models what “normal” looks like and flags deviations, using all the majority class data to define normality.

If the cost structure is clear, frame the problem with business costs directly. A fraud model that blocks legitimate transactions has a different cost than one that misses fraud. Expected cost = $P(\text{FP}) \times \text{cost_FP} + P(\text{FN}) \times \text{cost_FN}$. Optimize threshold and model for expected cost, not F1.

Practical Workflow

Establish your evaluation metric first - AUC-PR for severe imbalance, F1 for moderate. Never accuracy.
Train a baseline model with class_weight='balanced'. Check performance.
Tune the decision threshold based on precision-recall tradeoffs for your domain.
If still insufficient, try SMOTE oversampling on the training set.
Try BalancedRandomForest as an ensemble alternative.
If positive class is extremely rare ($< 0.1%$), reconsider whether anomaly detection is a better frame.

Concept	Key point
Accuracy trap	99% accuracy on 99:1 imbalance = always-predict-majority; meaningless
Precision	Of predicted positives, how many were real; governs false alarm rate
Recall	Of real positives, how many were found; governs miss rate
AUC-PR	Better than AUC-ROC for severe imbalance; precision-recall tradeoff at all thresholds
Threshold tuning	Default 0.5 is rarely optimal; choose based on domain cost structure
Class weights	Cheapest fix; penalize minority misclassification more in the loss
SMOTE	Synthesize new minority examples between real ones; avoids duplication
BalancedRandomForest	Balanced bootstrap per tree; built-in imbalance handling for ensembles

Read Next: