Measure over-refusal in LLMs using OR-Bench
Compare code model performance on benchmarks
Calculate memory usage for LLM models
Measure BERT model performance using WASM and WebGPU
Visualize model performance on function calling tasks
Explore and manage STM32 ML models with the STM32AI Model Zoo dashboard
Submit models for evaluation and view leaderboard
Push a ML model to Hugging Face Hub
Explore GenAI model efficiency on ML.ENERGY leaderboard
Convert and upload model files for Stable Diffusion
Track, rank and evaluate open LLMs and chatbots
Text-To-Speech (TTS) Evaluation using objective metrics.
Explore and visualize diverse models
OR-Bench Leaderboard is a tool designed to measure and compare over-refusal (OR) behavior in large language models (LLMs). It provides a standardized framework to evaluate how models respond to refusal scenarios, ensuring consistent and fair benchmarking across different models. The leaderboard helps researchers and developers understand the limitations and capabilities of LLMs in handling refusal tasks.
What is over-refusal in LLMs?
Over-refusal refers to when a model refuses to respond to a query, even when it could provide a meaningful answer.
Why is benchmarking over-refusal important?
Benchmarking helps identify models that may excessively refuse to answer, potentially limiting their utility in real-world applications.
How do I interpret the results from OR-Bench Leaderboard?
Results show how often and in what contexts models refuse to respond, enabling comparisons of refusal behavior across different models.