Related Models
Benchmarks
MMLU-ProNot evaluated
GPQA Diamond
27.8%
HLE
6.5%
LiveCodeBenchNot evaluated
SciCode
4.4%
TerminalBench Hard
0.0%
MATH-500Not evaluated
AIMENot evaluated
AIME 2025Not evaluated
IFBench
49.3%
Long Context Recall
3.7%
Tau2
81.0%
Market AverageTop Score