Evaluate code generation with diverse feedback types
Browse and submit LLM evaluations
Create and upload a Hugging Face model card
Calculate memory needed to train AI models
Demo of the new, massively multilingual leaderboard
Teach, test, evaluate language models with MTEB Arena
Evaluate LLM over-refusal rates with OR-Bench
Compare and rank LLMs using benchmark scores
Text-To-Speech (TTS) Evaluation using objective metrics.
Display and submit language model evaluations
Evaluate RAG systems with visual analytics
SolidityBench Leaderboard
Open Persian LLM Leaderboard
ConvCodeWorld is a model benchmarking tool designed to evaluate and compare code generation models. It focuses on assessing models through diverse feedback types, making it a comprehensive platform for understanding and improving code generation capabilities.
• Multiple Feedback Types: Supports various feedback mechanisms, including user ratings, pairwise comparisons, and error detection tasks.
• Customizable Benchmarks: Allows users to define custom benchmarks tailored to specific use cases or programming languages.
• Detailed Metrics: Provides in-depth performance metrics, including correctness, efficiency, and user satisfaction scores.
• Model Agnostic: Compatible with a wide range of code generation models, ensuring versatility in evaluation.
• Version Tracking: Enables longitudinal analysis of model improvements over time.
• Collaborative Interface: Offers a shared workspace for teams to review and discuss model performance.
What makes ConvCodeWorld unique?
ConvCodeWorld stands out due to its diverse feedback mechanisms, which provide a holistic view of model performance beyond traditional metrics.
Which programming languages does ConvCodeWorld support?
ConvCodeWorld supports a wide range of programming languages, including Python, Java, C++, and JavaScript, with more languages being added regularly.
How long does it take to run a benchmark?
The time required to run a benchmark depends on the size of the test set and the complexity of the tasks. Small benchmarks can complete in minutes, while larger ones may take several hours.