Related Models
Benchmarks
MMLU-ProNot evaluated
GPQA Diamond
60.1%
HLE
6.1%
LiveCodeBenchNot evaluated
SciCode
17.4%
TerminalBench Hard
0.8%
MATH-500Not evaluated
AIMENot evaluated
AIME 2025Not evaluated
IFBench
54.6%
Long Context Recall
11.0%
Tau2
81.0%
Market AverageTop Score