AI Model Benchmarks

Comprehensive collection of benchmarks organized by category

Intelligence

MMLU Pro
Intelligence
A comprehensive benchmark for evaluating language models across multiple domains
View Leaderboard
Humanity's Last Exam
Intelligence
A challenging benchmark testing AI models' understanding of complex human concepts
View Leaderboard

Coding

Aider Polyglot
Coding
Evaluates AI models' ability to understand and generate code across multiple programming languages
View Leaderboard

Other

Berkeley Function-Calling Leaderboard
Other
Measures AI models' ability to correctly call and use functions in various contexts
View Leaderboard