View and submit language model evaluations
Track, rank and evaluate open LLMs and chatbots
Evaluate model predictions with TruLens
Upload ML model to Hugging Face Hub
Evaluate RAG systems with visual analytics
Retrain models for new data at edge devices
Leaderboard of information retrieval models in French
Explore GenAI model efficiency on ML.ENERGY leaderboard
Merge Lora adapters with a base model
Evaluate code generation with diverse feedback types
Compare model weights and visualize differences
Determine GPU requirements for large language models
Evaluate reward models for math reasoning
ContextualBench-Leaderboard is a model benchmarking tool designed to evaluate and compare language models. It provides a platform to view and submit evaluations of language models, enabling users to assess performance across various tasks and datasets. The leaderboard facilitates transparency and competition in AI research by highlighting top-performing models and their benchmarks.
What is the purpose of ContextualBench-Leaderboard?
ContextualBench-Leaderboard is designed to provide a transparent and centralized platform for evaluating and comparing language models. It helps researchers and developers identify top-performing models for specific tasks.
How are the benchmark results calculated?
Results are calculated based on predefined metrics and datasets. Models are evaluated on their performance across tasks, with metrics such as accuracy, speed, and memory usage being tracked.
Can I submit my own language model for evaluation?
Yes, ContextualBench-Leaderboard allows users to submit their own models for evaluation. Follow the submission guidelines on the platform to ensure your model meets the required criteria.
Why don’t I see my model on the leaderboard?
If your model is not appearing on the leaderboard, ensure it has been properly submitted and meets all evaluation criteria. Additionally, check if the leaderboard is updated in real-time or on a specific schedule.
How do I interpret the metrics and visualizations?
Metrics like accuracy and speed indicate how well a model performs relative to others. Visualizations help identify trends and patterns in model performance across different tasks and configurations.