Evaluate code generation with diverse feedback types
Measure BERT model performance using WASM and WebGPU
Compare and rank LLMs using benchmark scores
Create and upload a Hugging Face model card
View LLM Performance Leaderboard
Evaluate open LLMs in the languages of LATAM and Spain.
Merge machine learning models using a YAML configuration file
Explore and benchmark visual document retrieval models
Evaluate LLM over-refusal rates with OR-Bench
Generate and view leaderboard for LLM evaluations
Demo of the new, massively multilingual leaderboard
Compare LLM performance across benchmarks
Retrain models for new data at edge devices
ConvCodeWorld is a model benchmarking tool designed to evaluate and compare code generation models. It focuses on assessing models through diverse feedback types, making it a comprehensive platform for understanding and improving code generation capabilities.
• Multiple Feedback Types: Supports various feedback mechanisms, including user ratings, pairwise comparisons, and error detection tasks.
• Customizable Benchmarks: Allows users to define custom benchmarks tailored to specific use cases or programming languages.
• Detailed Metrics: Provides in-depth performance metrics, including correctness, efficiency, and user satisfaction scores.
• Model Agnostic: Compatible with a wide range of code generation models, ensuring versatility in evaluation.
• Version Tracking: Enables longitudinal analysis of model improvements over time.
• Collaborative Interface: Offers a shared workspace for teams to review and discuss model performance.
What makes ConvCodeWorld unique?
ConvCodeWorld stands out due to its diverse feedback mechanisms, which provide a holistic view of model performance beyond traditional metrics.
Which programming languages does ConvCodeWorld support?
ConvCodeWorld supports a wide range of programming languages, including Python, Java, C++, and JavaScript, with more languages being added regularly.
How long does it take to run a benchmark?
The time required to run a benchmark depends on the size of the test set and the complexity of the tasks. Small benchmarks can complete in minutes, while larger ones may take several hours.