Benchmarks
MMLU-Pro
81.0%
GPQA Diamond
69.5%
HLE
5.4%
LiveCodeBenchNot evaluated
SciCode
27.0%
TerminalBench Hard
6.8%
MATH-500Not evaluated
AIMENot evaluated
AIME 2025
44.0%
IFBench
39.6%
Long Context Recall
47.0%
Tau2
59.1%
Market AverageTop Score