Benchmarks
MMLU-Pro
83.8%
GPQA Diamond
78.3%
HLE
13.1%
LiveCodeBench
76.8%
SciCode
35.6%
TerminalBench Hard
22.7%
MATH-500Não avaliado
AIMENão avaliado
AIME 2025
90.3%
IFBench
64.7%
Long Context Recall
55.7%
Tau2
74.3%
Média do MercadoMelhor Score