Compare LLM performance across benchmarks
Convert Stable Diffusion checkpoint to Diffusers and open a PR
Calculate survival probability based on passenger details
Benchmark LLMs in accuracy and translation across languages
Browse and evaluate ML tasks in MLIP Arena
View and submit LLM benchmark evaluations
SolidityBench Leaderboard
Measure BERT model performance using WASM and WebGPU
Load AI models and prepare your space
Demo of the new, massively multilingual leaderboard
Convert PyTorch models to waifu2x-ios format
Open Persian LLM Leaderboard
Evaluate and submit AI model results for Frugal AI Challenge
Goodharts Law On Benchmarks is a principle that states "when a measure becomes a target, it ceases to be a good measure." In the context of AI and machine learning, this applies to benchmarking large language models (LLMs). It highlights the risk of models being optimized to perform well on specific benchmarks, potentially leading to overfitting or gaming the system, rather than genuinely improving performance. This tool helps analyze and compare LLM performance across multiple benchmarks to identify such biases and ensure more robust evaluations.
What is Goodhart's Law?
Goodhart's Law is an observation that once a measure is used as a target, it loses its effectiveness as a measure. In AI, this means models may optimize for benchmark scores rather than true performance.
How can I avoid over-optimization?
Use diverse benchmarks and continuously update evaluation metrics to prevent models from overfitting to specific tasks.
When should I apply Goodharts Law On Benchmarks?
Apply this tool whenever you evaluate LLMs on multiple benchmarks to ensure balanced and unbiased performance assessments.