Benchmarks
MMLU-Pro
78.6%
GPQA Diamond
68.1%
HLE
9.8%
LiveCodeBench
69.4%
SciCode
28.6%
TerminalBench Hard
9.8%
MATH-500Não avaliado
AIMENão avaliado
AIME 2025
78.3%
IFBench
60.1%
Long Context Recall
33.3%
Tau2
27.8%
Média do MercadoMelhor Score