Related Models
Benchmarks
MMLU-ProNot evaluated
GPQA Diamond
75.3%
HLE
14.6%
LiveCodeBenchNot evaluated
SciCode
38.2%
TerminalBench Hard
18.2%
MATH-500Not evaluated
AIMENot evaluated
AIME 2025Not evaluated
IFBench
73.5%
Long Context Recall
55.3%
Tau2
34.8%
Market AverageTop Score