Benchmarks
MMLU-Pro
72.5%
GPQA Diamond
66.1%
HLE
10.8%
LiveCodeBench
72.4%
SciCode
24.9%
TerminalBench Hard
2.3%
MATH-500Não avaliado
AIMENão avaliado
AIME 2025
80.0%
IFBench
54.4%
Long Context Recall
8.7%
Tau2
27.8%
Média do MercadoMelhor Score