AI Model Benchmarks

Comprehensive collection of benchmarks organized by category

Intelligence

MMLU Pro

Intelligence

A comprehensive benchmark for evaluating language models across multiple domains

Humanity's Last Exam

Intelligence

A challenging benchmark testing AI models' understanding of complex human concepts

Aider Polyglot

Coding

Evaluates AI models' ability to understand and generate code across multiple programming languages

Berkeley Function-Calling Leaderboard

Other

Measures AI models' ability to correctly call and use functions in various contexts