Compare LLM performance across benchmarks
Convert and upload model files for Stable Diffusion
Display and submit language model evaluations
Submit deepfake detection models for evaluation
Benchmark AI models by comparison
Run benchmarks on prediction models
Pergel: A Unified Benchmark for Evaluating Turkish LLMs
Optimize and train foundation models using IBM's FMS
View and submit LLM benchmark evaluations
Display LLM benchmark leaderboard and info
View and submit language model evaluations
View LLM Performance Leaderboard
Persian Text Embedding Benchmark
Goodharts Law On Benchmarks is a principle that states "when a measure becomes a target, it ceases to be a good measure." In the context of AI and machine learning, this applies to benchmarking large language models (LLMs). It highlights the risk of models being optimized to perform well on specific benchmarks, potentially leading to overfitting or gaming the system, rather than genuinely improving performance. This tool helps analyze and compare LLM performance across multiple benchmarks to identify such biases and ensure more robust evaluations.
What is Goodhart's Law?
Goodhart's Law is an observation that once a measure is used as a target, it loses its effectiveness as a measure. In AI, this means models may optimize for benchmark scores rather than true performance.
How can I avoid over-optimization?
Use diverse benchmarks and continuously update evaluation metrics to prevent models from overfitting to specific tasks.
When should I apply Goodharts Law On Benchmarks?
Apply this tool whenever you evaluate LLMs on multiple benchmarks to ensure balanced and unbiased performance assessments.