Evaluate LLM over-refusal rates with OR-Bench
Browse and submit LLM evaluations
Convert Hugging Face models to OpenVINO format
Calculate survival probability based on passenger details
Browse and evaluate language models
Measure execution times of BERT models using WebGPU and WASM
Run benchmarks on prediction models
Rank machines based on LLaMA 7B v2 benchmark results
Compare and rank LLMs using benchmark scores
Compare model weights and visualize differences
Convert Stable Diffusion checkpoint to Diffusers and open a PR
Download a TriplaneGaussian model checkpoint
Measure over-refusal in LLMs using OR-Bench
OR-Bench Leaderboard is a benchmarking platform designed to evaluate Large Language Models (LLMs) based on their over-refusal rates. It provides a comprehensive framework to assess how often models refuse to respond to prompts, offering insights into their reliability and responsiveness. This tool is particularly useful for researchers and developers aiming to optimize LLM performance and transparency.
• Benchmarking of LLMs: Comprehensive evaluation of models based on their refusal rates. • Performance Metrics: Detailed metrics on refusal rates across diverse scenarios and prompts. • Model Comparisons: Side-by-side comparisons to identify top-performing models. • Scenarios Support: Testing models against a wide range of scenarios. • Transparency: Open and accessible results for community review. • Community-Driven: Continuously updated with new models and data.
What does the OR-Bench Leaderboard measure?
The leaderboard measures the over-refusal rates of LLMs, indicating how often models refuse to respond to prompts.
How are the models evaluated?
Models are evaluated using a standardized set of scenarios designed to test their responsiveness and reliability.
Can I contribute to the leaderboard?
Yes, contributions are welcome. Submit your model or scenario suggestions through the platform's community portal.