Benchmarks
MMLU-ProNot evaluated
GPQA Diamond
71.3%
HLE
9.5%
LiveCodeBenchNot evaluated
SciCode
33.0%
TerminalBench Hard
6.8%
MATH-500Not evaluated
AIMENot evaluated
AIME 2025Not evaluated
IFBench
62.8%
Long Context Recall
52.7%
Tau2
25.4%
Market AverageTop Score