Benchmarks
MMLU-Pro
78.5%
GPQA Diamond
61.5%
HLE
5.5%
LiveCodeBench
62.9%
SciCode
28.4%
TerminalBench Hard
12.1%
MATH-500Não avaliado
AIMENão avaliado
AIME 2025
59.0%
IFBench
37.9%
Long Context Recall
11.7%
Tau2
87.4%
Média do MercadoMelhor Score