Summary: Selecting the right metric is as important as the model itself. This post explains common metrics, their pros/cons, and where they are most appropriate — with examples and a handy comparison table you can use in interviews or documentation.
Quick Comparison Table
Category | Metric | Best For | When to use / Notes |
---|---|---|---|
Classification | Accuracy | Balanced classes | Easy to understand; misleading for imbalanced data |
Classification | Precision | When FP costly | Use when false alarms are expensive (spam, fraud investigation cost) |
Classification | Recall (Sensitivity) | When FN costly | Use when missing positives is bad (medical, fraud) |
Classification | F1 Score | Imbalanced datasets | Balances precision & recall; single-number summary |
Classification | ROC-AUC | Model comparison | Threshold-independent; can be optimistic on imbalanced data |
Classification | PR-AUC | Rare-event detection | Preferred for imbalanced datasets (fraud) |
Classification | Log Loss | Probabilistic outputs | Punishes confident wrong predictions; good for calibration |
Regression | MAE | Average absolute error | Robust to outliers; easy to interpret |
Regression | MSE / RMSE | When large errors matter | MSE penalizes large errors; RMSE in same units as target |
Regression | R² | Variance explained | Good for model comparison; can be misleading |
Ranking | MAP / NDCG | Search & recommendation | Evaluate ranking quality; top results weighted more |
Business | Precision@K / Cost-based | Alerting / financial impact | Aligns metrics with business costs and capacity |
In-Depth: Classification Metrics
Accuracy
Definition: (TP + TN) / Total
Use when classes are balanced and both types of errors are similarly costly. Avoid for rare-event problems.
Precision
Definition: TP / (TP + FP)
Prioritize when false positives have high operational or financial cost — e.g., each flagged fraud requires a manual investigation.
Recall (Sensitivity)
Definition: TP / (TP + FN)
Use when missing a positive is very costly: fraud slipping through or a disease missed by a diagnostic test.
F1 Score
Definition: Harmonic mean of Precision & Recall
Good single-number summary for imbalanced classification; use alongside precision/recall curves.
ROC-AUC
Definition: Area under Receiver Operating Characteristic curve
Compares true positive rate vs false positive rate across thresholds. Useful for overall ranking power, but prefer PR-AUC for imbalanced tasks.
PR-AUC
Definition: Area under Precision-Recall curve
More informative than ROC-AUC when positive class is rare — common in fraud, anomaly detection, medical screening.
Log Loss
Definition: Cross-entropy loss for probabilistic outputs
Penalizes confident wrong predictions; use when calibrated probabilities matter (e.g., risk scoring).
Regression Metrics
Mean Absolute Error (MAE)
Definition: average |y - y_hat|
Intuitive and robust to outliers. Use when average absolute deviation matters.
MSE / RMSE
Definition: mean squared error; RMSE = sqrt(MSE)
Penalizes large errors more strongly. Use when large deviations are particularly costly.
R² (Coefficient of Determination)
Definition: proportion of variance explained
Good for quick model comparison; interpret with caution on nonlinear or small datasets.
Ranking & Business Metrics
MAP / NDCG
Definition: ranking-aware metrics for top-k relevance
Use for search and recommendations where order matters; NDCG weights top positions more.
Precision@K / Cost-based Metrics
Definition: precision among top-K predictions; or metrics weighted by financial costs
Use when only top alerts are reviewed (fraud teams limited daily capacity) or when FP/FN have asymmetric monetary costs.
Practical Guidance & Examples
- Fraud detection: prioritize Recall and PR-AUC, but tune with Precision@K to control investigator workload.
- Medical screening: high Recall (sensitivity) is often critical; use F1 and confusion matrix to understand tradeoffs.
- Recommendation engines: use NDCG or MAP since ranking matters.
- Regression forecasting (demand/pricing): RMSE when big misses matter, MAE when average error matters.
Quick Code: Calculate Multiple Metrics (sklearn)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score
# y_true, y_pred, y_prob (probability for positive class)
print('Accuracy', accuracy_score(y_true, y_pred))
print('Precision', precision_score(y_true, y_pred))
print('Recall', recall_score(y_true, y_pred))
print('F1', f1_score(y_true, y_pred))
print('ROC-AUC', roc_auc_score(y_true, y_prob))
print('PR-AUC', average_precision_score(y_true, y_prob))