Related Models
Benchmarks
MMLU-ProNot evaluated
GPQA Diamond
53.8%
HLE
5.7%
LiveCodeBenchNot evaluated
SciCode
17.8%
TerminalBench Hard
2.3%
MATH-500Not evaluated
AIMENot evaluated
AIME 2025Not evaluated
IFBench
47.1%
Long Context Recall
14.7%
Tau2
93.3%
Market AverageTop Score