Measure over-refusal in LLMs using OR-Bench
Pergel: A Unified Benchmark for Evaluating Turkish LLMs
Convert PaddleOCR models to ONNX format
Benchmark AI models by comparison
Evaluate and submit AI model results for Frugal AI Challenge
Create and upload a Hugging Face model card
Evaluate open LLMs in the languages of LATAM and Spain.
Leaderboard of information retrieval models in French
Calculate memory usage for LLM models
Browse and submit model evaluations in LLM benchmarks
Display model benchmark results
Browse and evaluate language models
Request model evaluation on COCO val 2017 dataset
OR-Bench Leaderboard is a tool designed to measure and compare over-refusal (OR) behavior in large language models (LLMs). It provides a standardized framework to evaluate how models respond to refusal scenarios, ensuring consistent and fair benchmarking across different models. The leaderboard helps researchers and developers understand the limitations and capabilities of LLMs in handling refusal tasks.
What is over-refusal in LLMs?
Over-refusal refers to when a model refuses to respond to a query, even when it could provide a meaningful answer.
Why is benchmarking over-refusal important?
Benchmarking helps identify models that may excessively refuse to answer, potentially limiting their utility in real-world applications.
How do I interpret the results from OR-Bench Leaderboard?
Results show how often and in what contexts models refuse to respond, enabling comparisons of refusal behavior across different models.