Browse and submit evaluations for CaselawQA benchmarks
Calculate survival probability based on passenger details
GIFT-Eval: A Benchmark for General Time Series Forecasting
Find recent high-liked Hugging Face models
Measure BERT model performance using WASM and WebGPU
Display LLM benchmark leaderboard and info
Explore and submit models using the LLM Leaderboard
Calculate memory usage for LLM models
Predict customer churn based on input details
View NSQL Scores for Models
Create and upload a Hugging Face model card
Track, rank and evaluate open LLMs and chatbots
Explore and visualize diverse models
CaselawQA leaderboard (WIP) is a tool designed for browsing and submitting evaluations for the CaselawQA benchmarks. It serves as a platform to track and compare performance of different models on legal question-answering tasks. The leaderboard is currently a work in progress, with ongoing updates to improve functionality and user experience.
• Benchmark Browse: Explore and view performance metrics for various models on CaselawQA benchmarks.
• Submission Portal: Easily submit your model's results for evaluation.
• Comparison Tools: Compare model performance across different metrics and tasks.
• Filtering Options: Narrow down results by specific criteria such as model type or benchmark version.
• Version Tracking: Track changes in model performance over time.
• Community Sharing: Share insights and discuss results with other users.
What is the purpose of the CaselawQA leaderboard?
The leaderboard is designed to facilitate model evaluation and comparison for legal question-answering tasks, helping researchers and developers track progress in the field.
Do I need specific expertise to use the leaderboard?
While some technical knowledge is helpful, the platform is designed to be accessible to both experts and newcomers. Detailed instructions and guidelines are provided for submissions.
How are submissions evaluated?
Submissions are evaluated based on predefined metrics for the CaselawQA benchmarks, ensuring consistency and fairness in comparisons. Results are typically updated periodically.