View and submit LLM evaluations
Calculate memory needed to train AI models
Upload ML model to Hugging Face Hub
Rank machines based on LLaMA 7B v2 benchmark results
Benchmark LLMs in accuracy and translation across languages
Text-To-Speech (TTS) Evaluation using objective metrics.
Run benchmarks on prediction models
Download a TriplaneGaussian model checkpoint
Measure execution times of BERT models using WebGPU and WASM
Explore and manage STM32 ML models with the STM32AI Model Zoo dashboard
Evaluate reward models for math reasoning
Submit models for evaluation and view leaderboard
Display leaderboard of language model evaluations
Hallucinations Leaderboard is a tool designed for evaluating and benchmarking large language models (LLMs). It provides a platform to view and submit evaluations of model performance, with a focus on understanding and mitigating hallucinations—instances where models produce inaccurate or non-factual information.
• Leaderboard System: Compare performance of different LLMs based on hallucination metrics. • Benchmarking Tools: Access standardized tests and evaluations for assessing model accuracy. • Customizable Metrics: Define and apply specific criteria for measuring hallucinations. • Model Comparison: Directly compare multiple models side-by-side. • Submission Interface: Easily submit your own evaluations for inclusion in the leaderboard. • Filtering and Sorting: Narrow down results by model size, architecture, or performance thresholds. • Real-Time Updates: Stay current with the latest evaluations and benchmarks.
What is the purpose of Hallucinations Leaderboard?
The purpose is to provide a centralized platform for evaluating and comparing LLMs, with a focus on reducing hallucinations and improving model accuracy.
How do I submit my own evaluations?
To submit evaluations, use the submission interface on the platform. Ensure your results align with the defined metrics and criteria.
Why is tracking hallucinations important?
Hallucinations can lead to misinformation. Tracking them helps improve model reliability and trustworthiness in real-world applications.