Benchmarks
MMLU-Pro
80.9%
GPQA Diamond
70.1%
HLE
7.7%
LiveCodeBench
65.6%
SciCode
33.2%
TerminalBench Hard
2.3%
MATH-500Não avaliado
AIMENão avaliado
AIME 2025
76.7%
IFBench
49.3%
Long Context Recall
9.0%
Tau2
86.5%
Média do MercadoMelhor Score