View and submit language model evaluations
Convert Hugging Face model repo to Safetensors
Run benchmarks on prediction models
Explore GenAI model efficiency on ML.ENERGY leaderboard
Compare model weights and visualize differences
Evaluate adversarial robustness using generative models
Calculate memory needed to train AI models
Submit models for evaluation and view leaderboard
Convert PaddleOCR models to ONNX format
Evaluate LLM over-refusal rates with OR-Bench
Generate leaderboard comparing DNA models
Demo of the new, massively multilingual leaderboard
Merge Lora adapters with a base model
ContextualBench-Leaderboard is a model benchmarking tool designed to evaluate and compare language models. It provides a platform to view and submit evaluations of language models, enabling users to assess performance across various tasks and datasets. The leaderboard facilitates transparency and competition in AI research by highlighting top-performing models and their benchmarks.
What is the purpose of ContextualBench-Leaderboard?
ContextualBench-Leaderboard is designed to provide a transparent and centralized platform for evaluating and comparing language models. It helps researchers and developers identify top-performing models for specific tasks.
How are the benchmark results calculated?
Results are calculated based on predefined metrics and datasets. Models are evaluated on their performance across tasks, with metrics such as accuracy, speed, and memory usage being tracked.
Can I submit my own language model for evaluation?
Yes, ContextualBench-Leaderboard allows users to submit their own models for evaluation. Follow the submission guidelines on the platform to ensure your model meets the required criteria.
Why don’t I see my model on the leaderboard?
If your model is not appearing on the leaderboard, ensure it has been properly submitted and meets all evaluation criteria. Additionally, check if the leaderboard is updated in real-time or on a specific schedule.
How do I interpret the metrics and visualizations?
Metrics like accuracy and speed indicate how well a model performs relative to others. Visualizations help identify trends and patterns in model performance across different tasks and configurations.