Benchmarks
MMLU-Pro
79.6%
GPQA Diamond
69.5%
HLE
8.2%
LiveCodeBench
65.1%
SciCode
28.2%
TerminalBench Hard
3.8%
MATH-500Não avaliado
AIMENão avaliado
AIME 2025
80.3%
IFBench
57.0%
Long Context Recall
13.0%
Tau2
46.5%
Média do MercadoMelhor Score