Benchmarks
MMLU-Pro
76.1%
GPQA Diamond
59.8%
HLE
4.4%
LiveCodeBench
54.1%
SciCode
25.2%
TerminalBench Hard
8.3%
MATH-500Não avaliado
AIMENão avaliado
AIME 2025
64.7%
IFBench
55.1%
Long Context Recall
28.0%
Tau2
24.9%
Média do MercadoMelhor Score