Benchmarks
MMLU-ProNot evaluated
GPQA Diamond
67.6%
HLE
6.6%
LiveCodeBenchNot evaluated
SciCode
27.2%
TerminalBench Hard
18.2%
MATH-500Not evaluated
AIMENot evaluated
AIME 2025Not evaluated
IFBench
36.7%
Long Context Recall
11.7%
Tau2
93.0%
Market AverageTop Score