Evaluate code generation with diverse feedback types
Leaderboard of information retrieval models in French
Evaluate LLM over-refusal rates with OR-Bench
Compare code model performance on benchmarks
Submit models for evaluation and view leaderboard
Explore and manage STM32 ML models with the STM32AI Model Zoo dashboard
Browse and submit language model benchmarks
Track, rank and evaluate open LLMs and chatbots
Pergel: A Unified Benchmark for Evaluating Turkish LLMs
Browse and submit LLM evaluations
Merge machine learning models using a YAML configuration file
Calculate GPU requirements for running LLMs
Generate and view leaderboard for LLM evaluations
ConvCodeWorld is a model benchmarking tool designed to evaluate and compare code generation models. It focuses on assessing models through diverse feedback types, making it a comprehensive platform for understanding and improving code generation capabilities.
• Multiple Feedback Types: Supports various feedback mechanisms, including user ratings, pairwise comparisons, and error detection tasks.
• Customizable Benchmarks: Allows users to define custom benchmarks tailored to specific use cases or programming languages.
• Detailed Metrics: Provides in-depth performance metrics, including correctness, efficiency, and user satisfaction scores.
• Model Agnostic: Compatible with a wide range of code generation models, ensuring versatility in evaluation.
• Version Tracking: Enables longitudinal analysis of model improvements over time.
• Collaborative Interface: Offers a shared workspace for teams to review and discuss model performance.
What makes ConvCodeWorld unique?
ConvCodeWorld stands out due to its diverse feedback mechanisms, which provide a holistic view of model performance beyond traditional metrics.
Which programming languages does ConvCodeWorld support?
ConvCodeWorld supports a wide range of programming languages, including Python, Java, C++, and JavaScript, with more languages being added regularly.
How long does it take to run a benchmark?
The time required to run a benchmark depends on the size of the test set and the complexity of the tasks. Small benchmarks can complete in minutes, while larger ones may take several hours.