Model Performance Metrics — How to Choose the Right One

A practical guide to evaluation metrics for classification, regression, ranking and business use cases.

Summary: Selecting the right metric is as important as the model itself. This post explains common metrics, their pros/cons, and where they are most appropriate — with examples and a handy comparison table you can use in interviews or documentation.

Quick Comparison Table

CategoryMetricBest ForWhen to use / Notes
ClassificationAccuracyBalanced classesEasy to understand; misleading for imbalanced data
ClassificationPrecisionWhen FP costlyUse when false alarms are expensive (spam, fraud investigation cost)
ClassificationRecall (Sensitivity)When FN costlyUse when missing positives is bad (medical, fraud)
ClassificationF1 ScoreImbalanced datasetsBalances precision & recall; single-number summary
ClassificationROC-AUCModel comparisonThreshold-independent; can be optimistic on imbalanced data
ClassificationPR-AUCRare-event detectionPreferred for imbalanced datasets (fraud)
ClassificationLog LossProbabilistic outputsPunishes confident wrong predictions; good for calibration
RegressionMAEAverage absolute errorRobust to outliers; easy to interpret
RegressionMSE / RMSEWhen large errors matterMSE penalizes large errors; RMSE in same units as target
RegressionVariance explainedGood for model comparison; can be misleading
RankingMAP / NDCGSearch & recommendationEvaluate ranking quality; top results weighted more
BusinessPrecision@K / Cost-basedAlerting / financial impactAligns metrics with business costs and capacity

In-Depth: Classification Metrics

A

Accuracy

Definition: (TP + TN) / Total

Use when classes are balanced and both types of errors are similarly costly. Avoid for rare-event problems.

P

Precision

Definition: TP / (TP + FP)

Prioritize when false positives have high operational or financial cost — e.g., each flagged fraud requires a manual investigation.

R

Recall (Sensitivity)

Definition: TP / (TP + FN)

Use when missing a positive is very costly: fraud slipping through or a disease missed by a diagnostic test.

F1

F1 Score

Definition: Harmonic mean of Precision & Recall

Good single-number summary for imbalanced classification; use alongside precision/recall curves.

ROC

ROC-AUC

Definition: Area under Receiver Operating Characteristic curve

Compares true positive rate vs false positive rate across thresholds. Useful for overall ranking power, but prefer PR-AUC for imbalanced tasks.

PR

PR-AUC

Definition: Area under Precision-Recall curve

More informative than ROC-AUC when positive class is rare — common in fraud, anomaly detection, medical screening.

LL

Log Loss

Definition: Cross-entropy loss for probabilistic outputs

Penalizes confident wrong predictions; use when calibrated probabilities matter (e.g., risk scoring).

Regression Metrics

MAE

Mean Absolute Error (MAE)

Definition: average |y - y_hat|

Intuitive and robust to outliers. Use when average absolute deviation matters.

RMSE

MSE / RMSE

Definition: mean squared error; RMSE = sqrt(MSE)

Penalizes large errors more strongly. Use when large deviations are particularly costly.

R2

R² (Coefficient of Determination)

Definition: proportion of variance explained

Good for quick model comparison; interpret with caution on nonlinear or small datasets.

Ranking & Business Metrics

MAP

MAP / NDCG

Definition: ranking-aware metrics for top-k relevance

Use for search and recommendations where order matters; NDCG weights top positions more.

K

Precision@K / Cost-based Metrics

Definition: precision among top-K predictions; or metrics weighted by financial costs

Use when only top alerts are reviewed (fraud teams limited daily capacity) or when FP/FN have asymmetric monetary costs.

Practical Guidance & Examples

Quick Code: Calculate Multiple Metrics (sklearn)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score

# y_true, y_pred, y_prob (probability for positive class)
print('Accuracy', accuracy_score(y_true, y_pred))
print('Precision', precision_score(y_true, y_pred))
print('Recall', recall_score(y_true, y_pred))
print('F1', f1_score(y_true, y_pred))
print('ROC-AUC', roc_auc_score(y_true, y_prob))
print('PR-AUC', average_precision_score(y_true, y_prob))
← Back to Blog Index