Model Performance Metrics — How to Choose the Right One

Summary: Selecting the right metric is as important as the model itself. This post explains common metrics, their pros/cons, and where they are most appropriate — with examples and a handy comparison table you can use in interviews or documentation.

Quick Comparison Table

Category	Metric	Best For	When to use / Notes
Classification	Accuracy	Balanced classes	Easy to understand; misleading for imbalanced data
Classification	Precision	When FP costly	Use when false alarms are expensive (spam, fraud investigation cost)
Classification	Recall (Sensitivity)	When FN costly	Use when missing positives is bad (medical, fraud)
Classification	F1 Score	Imbalanced datasets	Balances precision & recall; single-number summary
Classification	ROC-AUC	Model comparison	Threshold-independent; can be optimistic on imbalanced data
Classification	PR-AUC	Rare-event detection	Preferred for imbalanced datasets (fraud)
Classification	Log Loss	Probabilistic outputs	Punishes confident wrong predictions; good for calibration
Regression	MAE	Average absolute error	Robust to outliers; easy to interpret
Regression	MSE / RMSE	When large errors matter	MSE penalizes large errors; RMSE in same units as target
Regression	R²	Variance explained	Good for model comparison; can be misleading
Ranking	MAP / NDCG	Search & recommendation	Evaluate ranking quality; top results weighted more
Business	Precision@K / Cost-based	Alerting / financial impact	Aligns metrics with business costs and capacity

In-Depth: Classification Metrics

Accuracy

Definition: (TP + TN) / Total

Use when classes are balanced and both types of errors are similarly costly. Avoid for rare-event problems.

Precision

Definition: TP / (TP + FP)

Prioritize when false positives have high operational or financial cost — e.g., each flagged fraud requires a manual investigation.

Recall (Sensitivity)

Definition: TP / (TP + FN)

Use when missing a positive is very costly: fraud slipping through or a disease missed by a diagnostic test.

F1 Score

Definition: Harmonic mean of Precision & Recall

Good single-number summary for imbalanced classification; use alongside precision/recall curves.

ROC

ROC-AUC

Definition: Area under Receiver Operating Characteristic curve

Compares true positive rate vs false positive rate across thresholds. Useful for overall ranking power, but prefer PR-AUC for imbalanced tasks.

PR-AUC

Definition: Area under Precision-Recall curve

More informative than ROC-AUC when positive class is rare — common in fraud, anomaly detection, medical screening.

Log Loss

Definition: Cross-entropy loss for probabilistic outputs

Penalizes confident wrong predictions; use when calibrated probabilities matter (e.g., risk scoring).

Regression Metrics

MAE

Mean Absolute Error (MAE)

Definition: average |y - y_hat|

Intuitive and robust to outliers. Use when average absolute deviation matters.

RMSE

MSE / RMSE

Definition: mean squared error; RMSE = sqrt(MSE)

Penalizes large errors more strongly. Use when large deviations are particularly costly.

R² (Coefficient of Determination)

Definition: proportion of variance explained

Good for quick model comparison; interpret with caution on nonlinear or small datasets.

Ranking & Business Metrics

MAP

MAP / NDCG

Definition: ranking-aware metrics for top-k relevance

Use for search and recommendations where order matters; NDCG weights top positions more.

Precision@K / Cost-based Metrics

Definition: precision among top-K predictions; or metrics weighted by financial costs

Use when only top alerts are reviewed (fraud teams limited daily capacity) or when FP/FN have asymmetric monetary costs.

Practical Guidance & Examples

Fraud detection: prioritize Recall and PR-AUC, but tune with Precision@K to control investigator workload.
Medical screening: high Recall (sensitivity) is often critical; use F1 and confusion matrix to understand tradeoffs.
Recommendation engines: use NDCG or MAP since ranking matters.
Regression forecasting (demand/pricing): RMSE when big misses matter, MAE when average error matters.

Quick Code: Calculate Multiple Metrics (sklearn)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score

# y_true, y_pred, y_prob (probability for positive class)
print('Accuracy', accuracy_score(y_true, y_pred))
print('Precision', precision_score(y_true, y_pred))
print('Recall', recall_score(y_true, y_pred))
print('F1', f1_score(y_true, y_pred))
print('ROC-AUC', roc_auc_score(y_true, y_prob))
print('PR-AUC', average_precision_score(y_true, y_prob))

← Back to Blog Index