Related Models
Benchmarks
MMLU-ProNot evaluated
GPQA Diamond
62.3%
HLENot evaluated
LiveCodeBenchNot evaluated
SciCodeNot evaluated
TerminalBench HardNot evaluated
MATH-500
92.1%
AIME
77.0%
AIME 2025Not evaluated
IFBenchNot evaluated
Long Context RecallNot evaluated
Tau2Not evaluated
Market AverageTop Score