View and submit language model evaluations
GIFT-Eval: A Benchmark for General Time Series Forecasting
Display genomic embedding leaderboard
Compare and rank LLMs using benchmark scores
Browse and evaluate ML tasks in MLIP Arena
Leaderboard of information retrieval models in French
Analyze model errors with interactive pages
Benchmark AI models by comparison
Multilingual Text Embedding Model Pruner
Request model evaluation on COCO val 2017 dataset
Search for model performance across languages and benchmarks
View NSQL Scores for Models
View and submit LLM benchmark evaluations
ContextualBench-Leaderboard is a model benchmarking tool designed to evaluate and compare language models. It provides a platform to view and submit evaluations of language models, enabling users to assess performance across various tasks and datasets. The leaderboard facilitates transparency and competition in AI research by highlighting top-performing models and their benchmarks.
What is the purpose of ContextualBench-Leaderboard?
ContextualBench-Leaderboard is designed to provide a transparent and centralized platform for evaluating and comparing language models. It helps researchers and developers identify top-performing models for specific tasks.
How are the benchmark results calculated?
Results are calculated based on predefined metrics and datasets. Models are evaluated on their performance across tasks, with metrics such as accuracy, speed, and memory usage being tracked.
Can I submit my own language model for evaluation?
Yes, ContextualBench-Leaderboard allows users to submit their own models for evaluation. Follow the submission guidelines on the platform to ensure your model meets the required criteria.
Why don’t I see my model on the leaderboard?
If your model is not appearing on the leaderboard, ensure it has been properly submitted and meets all evaluation criteria. Additionally, check if the leaderboard is updated in real-time or on a specific schedule.
How do I interpret the metrics and visualizations?
Metrics like accuracy and speed indicate how well a model performs relative to others. Visualizations help identify trends and patterns in model performance across different tasks and configurations.