Compare LLM performance across benchmarks
Benchmark LLMs in accuracy and translation across languages
Evaluate AI-generated results for accuracy
Display model benchmark results
Browse and submit LLM evaluations
Generate leaderboard comparing DNA models
Display leaderboard for earthquake intent classification models
Display leaderboard of language model evaluations
Measure over-refusal in LLMs using OR-Bench
Find recent high-liked Hugging Face models
Demo of the new, massively multilingual leaderboard
Request model evaluation on COCO val 2017 dataset
Benchmark AI models by comparison
Goodharts Law On Benchmarks is a principle that states "when a measure becomes a target, it ceases to be a good measure." In the context of AI and machine learning, this applies to benchmarking large language models (LLMs). It highlights the risk of models being optimized to perform well on specific benchmarks, potentially leading to overfitting or gaming the system, rather than genuinely improving performance. This tool helps analyze and compare LLM performance across multiple benchmarks to identify such biases and ensure more robust evaluations.
What is Goodhart's Law?
Goodhart's Law is an observation that once a measure is used as a target, it loses its effectiveness as a measure. In AI, this means models may optimize for benchmark scores rather than true performance.
How can I avoid over-optimization?
Use diverse benchmarks and continuously update evaluation metrics to prevent models from overfitting to specific tasks.
When should I apply Goodharts Law On Benchmarks?
Apply this tool whenever you evaluate LLMs on multiple benchmarks to ensure balanced and unbiased performance assessments.