Related Models
Benchmarks
MMLU-ProNot evaluated
GPQA Diamond
66.9%
HLE
10.2%
LiveCodeBenchNot evaluated
SciCode
34.3%
TerminalBench HardNot evaluated
MATH-500Not evaluated
AIMENot evaluated
AIME 2025Not evaluated
IFBench
59.5%
Long Context Recall
14.3%
Tau2Not evaluated
Market AverageTop Score