Benchmarks
MMLU-Pro
71.3%
GPQA Diamond
54.1%
HLE
3.9%
LiveCodeBench
39.3%
SciCode
22.3%
TerminalBench Hard
4.5%
MATH-500Não avaliado
AIMENão avaliado
AIME 2025
35.3%
IFBench
41.0%
Long Context Recall
19.0%
Tau2
20.8%
Média do MercadoMelhor Score