Compare LLM performance across benchmarks
Display model benchmark results
Predict customer churn based on input details
View LLM Performance Leaderboard
Browse and filter ML model leaderboard data
Generate and view leaderboard for LLM evaluations
Quantize a model for faster inference
GIFT-Eval: A Benchmark for General Time Series Forecasting
Convert PaddleOCR models to ONNX format
Browse and submit LLM evaluations
Compare and rank LLMs using benchmark scores
View and submit LLM benchmark evaluations
Display and submit LLM benchmarks
Goodharts Law On Benchmarks is a principle that states "when a measure becomes a target, it ceases to be a good measure." In the context of AI and machine learning, this applies to benchmarking large language models (LLMs). It highlights the risk of models being optimized to perform well on specific benchmarks, potentially leading to overfitting or gaming the system, rather than genuinely improving performance. This tool helps analyze and compare LLM performance across multiple benchmarks to identify such biases and ensure more robust evaluations.
What is Goodhart's Law?
Goodhart's Law is an observation that once a measure is used as a target, it loses its effectiveness as a measure. In AI, this means models may optimize for benchmark scores rather than true performance.
How can I avoid over-optimization?
Use diverse benchmarks and continuously update evaluation metrics to prevent models from overfitting to specific tasks.
When should I apply Goodharts Law On Benchmarks?
Apply this tool whenever you evaluate LLMs on multiple benchmarks to ensure balanced and unbiased performance assessments.