Benchmarks
MMLU-Pro
85.4%
GPQA Diamond
76.4%
HLE
13.3%
LiveCodeBench
76.6%
SciCode
40.7%
TerminalBench Hard
26.5%
MATH-500Não avaliado
AIMENão avaliado
AIME 2025
79.3%
IFBench
51.4%
Long Context Recall
65.3%
Tau2
58.2%
Média do MercadoMelhor Score