Benchmarks
MMLU-Pro
80.9%
GPQA Diamond
70.1%
HLE
7.7%
LiveCodeBench
65.6%
SciCode
33.2%
TerminalBench Hard
2.3%
MATH-500Not evaluated
AIMENot evaluated
AIME 2025
76.7%
IFBench
49.3%
Long Context Recall
9.0%
Tau2
86.5%
Market AverageTop Score