The Evaluation Problem: Why Benchmarks Can Be Misleading
A model achieves 95 percent accuracy on a benchmark. That sounds great. But 95 percent accuracy on an imbalanced dataset where 95 percent of examples are negative class is meaningless. A model that always predicts negative achieves the same accuracy.
Traditional accuracy metrics mislead when class distributions are unbalanced, when false positives and false negatives have different costs, or when evaluation dataset doesn't match real-world data. Comprehensive evaluation requires multiple metrics, careful dataset selection, and business outcome measurement.
Classification Metrics and When to Use Them
Accuracy
Percentage of correct predictions. Works for balanced datasets but misleads for imbalanced data. A fraud detection model that achieves 99 percent accuracy by predicting no fraud 99 percent of the time is useless.
Precision and Recall
Precision: percentage of predicted positives that are actually correct. High precision minimizes false positives (wrong alarms). Recall: percentage of actual positives that the model finds. High recall minimizes false negatives (missed cases).
Different applications prioritize differently. Medical diagnosis prioritizes recall (don't miss sick patients even if some healthy patients get false alarms). Spam detection prioritizes precision (don't send legitimate email to spam even if some spam slips through).
F1 Score
Harmonic mean of precision and recall. Provides single number balancing both. Useful when you don't know which to prioritize.
ROC-AUC
Measures true positive rate versus false positive rate across all decision thresholds. Robust to class imbalance. Useful for comparing models regardless of threshold chosen.
| Metric | Best For | Handles Imbalance | Interpretation |
|---|---|---|---|
| Accuracy | Balanced classification | No | Percentage correct |
| Precision | Minimize false positives | Yes | Trust positive predictions |
| Recall | Minimize false negatives | Yes | Find all positives |
| F1 Score | Balance precision-recall | Yes | Harmonic mean |
| ROC-AUC | Compare models broadly | Yes | 0.5 to 1.0 score |
LLM and Generative Model Evaluation
Traditional metrics don't work for generative tasks. There are infinite valid responses. BLEU score (compares to reference responses) is flawed because it prioritizes exact word matches over semantic correctness.
ROUGE and METEOR
Improvements over BLEU. ROUGE measures word overlap with references. METEOR considers synonyms. Still imperfect because many valid responses don't match reference exactly.
BERTScore
Uses contextual embeddings to compare semantic similarity. Correlates better with human judgment than BLEU but still doesn't capture all aspects of quality.
Human Evaluation
The gold standard for generative tasks. Have humans rate responses on quality, coherence, factuality, and helpfulness. Time-consuming but most accurate.
LLM-as-Judge
Use a powerful LLM to evaluate other models' outputs. Cheaper than human evaluation but introduces bias (the judge LLM's biases transfer to evaluations).
Hallucination Detection
For knowledge-heavy tasks, detect when models make up facts. Automated hallucination detection is unreliable. Human review of samples or grounding outputs in retrieved facts work better.
Beyond Accuracy: Operational Metrics
Latency
Time from request to response. Critical for real-time applications. Measure 50th, 95th, 99th percentile latencies, not just average. A model with 10ms average but 5-second 99th percentile is unreliable.
Throughput
Requests processed per second. Determines infrastructure costs. A model processing 100 requests/sec might need less hardware than one processing 10 requests/sec.
Cost Per Prediction
Total cost divided by predictions. Includes compute, storage, and API costs. Lower accuracy model that's 100x faster might have better business economics.
Fairness and Bias
Model performance across demographic groups. A model with 95 percent accuracy overall but 70 percent accuracy for minority groups is biased. Measure accuracy separately for each group.
Building Your Evaluation Pipeline
Step 1: Define Your Success Metrics
What matters for your specific problem? Accuracy? Speed? Fairness? Cost? List metrics in priority order.
Step 2: Create Representative Test Data
Collect test data matching real deployment conditions. If deploying globally, test on data from all regions. If handling multiple user demographics, include all in test set.
Step 3: Evaluate Multiple Models
Don't pick the first model that works. Evaluate alternatives. Compare on all success metrics, not just accuracy.
Step 4: Consider Operational Factors
Will the model run fast enough? Cost within budget? Is it fair across groups? Does it scale? These practical factors matter more than small accuracy differences.
Step 5: Monitor Post-Deployment
Metrics on test data don't guarantee real-world performance. Monitor actual deployed model. If performance degrades, investigate and retrain.