Explore and submit models using the LLM Leaderboard
View RL Benchmark Reports
Analyze model errors with interactive pages
Track, rank and evaluate open LLMs and chatbots
Calculate GPU requirements for running LLMs
Evaluate model predictions with TruLens
Display leaderboard for earthquake intent classification models
Find and download models from Hugging Face
Measure execution times of BERT models using WebGPU and WASM
Measure over-refusal in LLMs using OR-Bench
Compare and rank LLMs using benchmark scores
Merge machine learning models using a YAML configuration file
Browse and submit LLM evaluations
OPEN-MOE-LLM-LEADERBOARD is a platform designed for exploring and submitting large language models (LLMs). It serves as a centralized hub where users can compare and evaluate different LLMs based on various benchmarks and metrics. The leaderboard provides transparent and comprehensive insights into the performance of different models, helping researchers and developers make informed decisions.
• Model Benchmarking: Compare LLMs across multiple tasks and datasets to understand their strengths and weaknesses.
• Model Submission: Submit your own LLM for evaluation and inclusion in the leaderboard.
• Interactive Visualization: Explore detailed performance metrics and visualizations to gain deeper insights.
• Community-Driven: Open for contributions and feedback from the AI research community.
1. What is the purpose of OPEN-MOE-LLM-LEADERBOARD?
The leaderboard aims to provide a transparent and standardized platform for comparing and evaluating large language models. It helps users identify the best models for their specific needs.
2. How do I submit my own model to the leaderboard?
To submit your model, follow the submission guidelines provided on the platform. This typically involves providing model weights, configuration details, and benchmarking results.
3. What metrics are used to evaluate models on the leaderboard?
Models are evaluated based on a variety of metrics, including accuracy, inference speed, parameter efficiency, and performance on specific benchmarks. The exact metrics may vary depending on the task or dataset.