We just open sourced a tiny GPT-style cognitive core built in pure Rust.See our repository

Terminal-Bench 2.1

One harness to unlock the potential of all models

Compare Ante runs across models on the same Terminal-Bench 2.1 task set, using consistent parameters and verified benchmark results.

Best Accuracy85.8%

Leading Model

Grok 4.5ⓘ

Task Set89 tasks

Trials380 passed / 445 trials

Benchmark principlesRead more

We benchmark what we ship.Every eval uses a pinned public . No eval-only branches or benchmark-specific prompts.

The runs are auditable.Every result links its raw Harbor run, so anyone can inspect the trials behind the number.

The constraints are official.All trials follow the official Terminal-Bench parameters: 89 tasks, 5 trials per task, strict timeouts, and hardware limits.

Model org

All model orgsTB 2.1 · Same parameters: 89 tasks · 5 trials/task · Updated Jul 10, 2026

#	Model				Same-model	Agent	Source
1	Grok 4.5ⓘ	85.8% ±1.66 SE	$242.57	8.4 min	#1 same-model	Ante0.preview.56	Harbor link	Jul 10, 2026	›
2	GLM 5.2	74.6% ±2.06 SE	$260.11	11.3 min	#1 same-model	Ante0.preview.43	Harbor link	Jun 20, 2026	›
3	DeepSeek V4 Pro	69.1% ±2.24 SE	$26.34	48.4 min	#1 same-model	Ante0.preview.54	Harbor link	Jul 7, 2026	›
4	DeepSeek V4 Flash	66.4% ±2.27 SE	$49.98	41.4 min	#1 same-model	Ante0.preview.53	Harbor link	Jul 5, 2026	›
5	MiMo V2.5	65.8% ±2.30 SE	$73.76	12.5 min	#1 same-model	Ante20260625-0824-e383a92	Harbor link	Jun 25, 2026	›
6	MiniMax M3	62.1% ±2.33 SE	$121.00	13.9 min	#1 same-model	Ante20260623-0825-ff174ee	Harbor link	Jun 23, 2026	›
7	Qwen3.6 27B	56.2% ±2.36 SE	Local	61.6 min	No public rows	Ante20260701-0837-70d2aac	Harbor link	Jul 3, 2026	›

Terminal-Bench Reference

Verified Public Leaderboard

Same parameters: 89 tasks · 5 trials/task · Updated Jul 17, 2026 · Source: Terminal-Bench official verified rows

For how different models perform on TB 2.1, see Vals AI's Terminal-Bench 2.1 benchmark.

17 official rows

#	Agent	Model	Accuracy	Run date
1	Claude CodeAnthropic	Fable 5	83.8% ±1.16 SE	Jun 7, 2026	›
2	CodexOpenAI	GPT-5.5	83.2% ±1.13 SE	May 1, 2026	›
3	Terminus 2Terminal-Bench	Fable 5	80.5% ±1.16 SE	Jun 5, 2026	›
4	Cursor CLICursor	Grok 4.5ⓘ	79.3% ±1.46 SE	Jul 9, 2026	›
5	Claude CodeAnthropic	Opus 4.8	78.9% ±1.31 SE	Jul 9, 2026	›
6	CodexOpenAI	GPT-5.6 Terra	78.4% ±1.25 SE	Jul 11, 2026	›
7	Terminus 2Terminal-Bench	GPT-5.5	78.0% ±1.22 SE	May 1, 2026	›
8	mini-SWE-agentPrinceton	Muse Spark 1.1	76.2% ±1.23 SE	Jul 9, 2026	›
9	CodexOpenAI	GPT-5.6 Luna	75.7% ±1.32 SE	Jul 11, 2026	›
10	Claude CodeAnthropic	Sonnet 5	74.6% ±1.64 SE	Jul 9, 2026	›
11	Terminus 2Terminal-Bench	Gemini 3 Pro	73.9% ±1.29 SE	May 1, 2026	›
12	Claude CodeAnthropic	Opus 4.7	68.9% ±1.41 SE	May 1, 2026	›
13	Terminus 2Terminal-Bench	Opus 4.7	66.1% ±1.37 SE	May 1, 2026	›
14	Gemini CLIGoogle	Gemini 3 Pro	65.8% ±1.38 SE	May 1, 2026	›
15	Gemini CLIGoogle	Gemini 3.1 Pro	65.8% ±1.67 SE	May 5, 2026	›
16	Terminus 2Terminal-Bench	Gemini 3.1 Pro	65.6% ±1.65 SE	May 5, 2026	›
17	Claude CodeAnthropic	GLM-5.1	58.6% ±1.24 SE	May 1, 2026	›