Compare LLM performance across benchmarks
Retrain models for new data at edge devices
Browse and submit model evaluations in LLM benchmarks
Evaluate model predictions with TruLens
Teach, test, evaluate language models with MTEB Arena
Browse and submit LLM evaluations
Persian Text Embedding Benchmark
Calculate VRAM requirements for LLM models
Leaderboard of information retrieval models in French
Upload ML model to Hugging Face Hub
View and submit machine learning model evaluations
View and submit LLM benchmark evaluations
Create demo spaces for models on Hugging Face
Goodharts Law On Benchmarks is a principle that states "when a measure becomes a target, it ceases to be a good measure." In the context of AI and machine learning, this applies to benchmarking large language models (LLMs). It highlights the risk of models being optimized to perform well on specific benchmarks, potentially leading to overfitting or gaming the system, rather than genuinely improving performance. This tool helps analyze and compare LLM performance across multiple benchmarks to identify such biases and ensure more robust evaluations.
What is Goodhart's Law?
Goodhart's Law is an observation that once a measure is used as a target, it loses its effectiveness as a measure. In AI, this means models may optimize for benchmark scores rather than true performance.
How can I avoid over-optimization?
Use diverse benchmarks and continuously update evaluation metrics to prevent models from overfitting to specific tasks.
When should I apply Goodharts Law On Benchmarks?
Apply this tool whenever you evaluate LLMs on multiple benchmarks to ensure balanced and unbiased performance assessments.