Measure over-refusal in LLMs using OR-Bench
Evaluate open LLMs in the languages of LATAM and Spain.
Visualize model performance on function calling tasks
Explore and manage STM32 ML models with the STM32AI Model Zoo dashboard
Calculate memory usage for LLM models
Convert Stable Diffusion checkpoint to Diffusers and open a PR
Evaluate adversarial robustness using generative models
View and submit LLM benchmark evaluations
Text-To-Speech (TTS) Evaluation using objective metrics.
Export Hugging Face models to ONNX
Track, rank and evaluate open LLMs and chatbots
Demo of the new, massively multilingual leaderboard
Explore GenAI model efficiency on ML.ENERGY leaderboard
OR-Bench Leaderboard is a tool designed to measure and compare over-refusal (OR) behavior in large language models (LLMs). It provides a standardized framework to evaluate how models respond to refusal scenarios, ensuring consistent and fair benchmarking across different models. The leaderboard helps researchers and developers understand the limitations and capabilities of LLMs in handling refusal tasks.
What is over-refusal in LLMs?
Over-refusal refers to when a model refuses to respond to a query, even when it could provide a meaningful answer.
Why is benchmarking over-refusal important?
Benchmarking helps identify models that may excessively refuse to answer, potentially limiting their utility in real-world applications.
How do I interpret the results from OR-Bench Leaderboard?
Results show how often and in what contexts models refuse to respond, enabling comparisons of refusal behavior across different models.