Related Models
Benchmarks
MMLU-ProNot evaluated
GPQA Diamond
65.7%
HLE
9.2%
LiveCodeBenchNot evaluated
SciCode
26.9%
TerminalBench Hard
2.3%
MATH-500Not evaluated
AIMENot evaluated
AIME 2025Not evaluated
IFBench
57.7%
Long Context Recall
36.0%
Tau2
48.2%
Market AverageTop Score