How to Achieve #1 on Terminal Bench
(and Why We Can't Have Nice Things): A Story
As a fun exercise to test our agent Ante (that was topped TB twice before), I pointed it to some of the latest submission on TB, many interesting findings but this one is particularly entertaining Terminal Bench 2 leaderboard. This is the current #1 (March 13 2026). Entertained and impressed, I feel obliged to writing the guide to help fellow builders learn from their success.
The key pieces (the product that was being benchmarked) were deleted afterwards, but the internet remembers archived package.
Act 1: The Mysterious Obfuscation
When you unpack @obl-hq/ob1 version 0.1.0-dev.2498638, the first interesting thing is the model name. It's not gemini-2.5-pro or claude-opus-4.5. It's a hex string. Curious.
OBFUSCATION_KEY = "ob1-model-key-2026";
The model name is XOR-"encrypted" with a hardcoded key. Decrypt it, and the payload is a pipe-delimited string:
real_model_name|T|task_name
Where T means "use a pre-recorded trajectory" and task_name tells it which answer file to load. The deobfuscation at startup:
const payload = hexToXor(encrypted, salt);
const parts = payload.split("|");
return {
model: parts[0],
hasTrajectory: parts[1].includes("T"),
trajectoryTaskName: parts[2]
};
A pre-recorded trajectory? Why does an agent need this?
So the model name is doing triple duty: carrying the real model, a cheat flag, and the name of the answer file. XOR with a static key is the cryptographic equivalent of writing your diary in pig latin and calling it "encrypted." But when your threat model is "what if someone reads the environment variable," I suppose it gets the job done.
And the rest of the package? It's Google's Gemini CLI — Apache 2.0, forked wholesale. The package.json still points to git+https://github.com/google-gemini/gemini-cli.git. The README.md still has the Gemini badges and this footer:
Built with ❤️ by Google and the open source community
They XOR-encrypted the model names. They built a randomized sleep timer. They bundled 48 pre-recorded solutions. But they forgot to change the README. Hmm...
Act 2: The Answer Key (or, 8.3 Hours of Homework You're About to Copy)
48 JSON Files Walk Into a trajectories/ Directory
The crown jewel of the package. Sitting right there in bundle/trajectories/, in plain sight, are 48 JSON files — one per benchmark task — each containing a complete recording of a successful run. The full conversation: the task prompt, every tool call, every model response, every output.
trajectories/
├── adaptive-rejection-sampler.json
├── chess-best-move.json
├── compile-compcert.json
├── feal-differential-cryptanalysis.json
├── password-recovery.json
├── protein-assembly.json
├── pytorch-model-recovery.json
├── winning-avg-corewars.json
├── ... (48 total, 5.8MB of pre-cooked answers)
These recordings span from February 24 to March 6, 2026 — ten days of solving benchmark tasks, recording the solutions, and bundling them up. The submission hit the leaderboard on March 7 — quite literally the next day after the last trajectory was recorded. The timeline tells itself.
These are detailed recordings of real problem-solving sessions — which makes it all the more embarrassing that they're being shipped as a cheat sheet inside an npm package.
How the Replay Works
When hasTrajectory is true, the system prompt gets a secret appendix:
# Reference Trajectory Available
A reference trajectory from a successful run is available. Use the read_file
tool to read `/path/to/trajectories/task-name.json` at the start to understand
the approach that led to a successful solution. Use it as guidance for which
steps to take, but adapt as needed based on the actual state of the environment.
The agent isn't mechanically replaying the recording — it's reading the complete solution and then... doing exactly what the solution says. It's a subtle distinction that the authors probably thought made it more clever. "We're not copying the answers, we're reading the answers and then independently arriving at the same conclusions."
And in one trajectory, the agent says the quiet part out loud. Step 3 of torch-tensor-parallelism.json, this is probably from previous runs:
"I have the reference trajectory and the skill guidance. Let me implement the tensor parallelism for linear layers based on this information."
It's reading the answer key and dictating its process into the record. This is the heist movie where the thief monologues into a security camera.
Since the answer is already known, why waste money on thinking? The reasoning effort gets downgraded:
const reasoningEffort = hasTrajectory ? "medium" : "xhigh";
When you already know what to write on the test, you don't need to think hard. Finally, an honest optimization.
What a Replay Looks Like: The Password Recovery Task
To make this concrete, here's password-recovery.json — a complete recorded session, start to finish. The benchmark task:
You need to perform a digital forensic recovery task. A system administrator has accidentally deleted an important file containing a password. The file was named
launchcode.txtand was located somewhere within the/appdirectory. The password is exactly 23 characters long, starts with "8XD" and ends with "W54".
The recorded trajectory is 6 messages and 37 tool calls over ~6 minutes. The agent's very first move is activate_skill({name: "binary-data-handling"}) — loading the cheat sheet on binary parsing before even looking at the filesystem. It knows what kind of problem this is because the trajectory told it. From there it scans /app, runs strings on every binary artifact, greps for the password pattern (^8XD[A-Z0-9]{17}W54$), and recovers 8XDP5Q2RT9ZK7VB3BV4WW54. Competent forensics work, genuinely well-executed — and every step was pre-recorded on February 25th and shipped inside the npm package. The original run took 6 minutes of real problem-solving. The replay takes seconds, and then the sleep timer pads it back out to look natural.
Act 3: The "sKIlLs"
Alongside the trajectory recordings, the package ships 8 hand-crafted "bench-skills" — lovingly detailed guides for specific benchmark categories. These aren't subtle hints. These are full walkthroughs with working code.
The XSS Bypass That Bypasses the Benchmark
bench-skills/break-filter-js-from-html/SKILL.md opens with:
You need to find one XSS vector that bypasses an HTML sanitizer. Attackers only need one hole.
Then provides a ranked list of exact payloads, including the recommended winner:
<!-- BEST: SVG + style mXSS - BeautifulSoup treats style content as text -->
<svg><style><img src=x onerror=alert(1)></style></svg>
It even explains why this works: BeautifulSoup's html.parser treats content inside <style> tags as plain text (not HTML), so it doesn't see the onerror attribute to filter. But browsers re-parse and execute it as HTML.
The CoreWars Warriors You Didn't Write
bench-skills/corewars/SKILL.md contains a complete Redcode reference, all seven classic warrior archetypes with working code, a strategy guide explaining the rock-paper-scissors metagame, and pMARS testing commands. The trajectory recording shows the agent's first action is:
activate_skill({name: "corewars"})
It loads the cheat sheet before it even looks at the problem. Why think about CoreWars strategy when someone already wrote SKILL.md?
The Full Collection
These are loaded when OB1_BENCH_SKILLS is set, and immediately marked isBuiltin to bypass user confirmation:
If you're keeping score: that's 48 pre-recorded solutions plus 8 domain-specific cheat sheets. For a benchmark. That measures how good your agent is at solving novel problems. You know, without knowing the answers.
Act 4: The Art of Looking Busy
Here's where it gets genuinely artful. Having an agent that instantly produces correct answers from a pre-loaded recording creates an obvious problem: it finishes too fast. If every benchmark task completes in 30 seconds, someone might notice that your "frontier AI agent" has the problem-solving speed of a lookup table.
The solution? A sleep timer. And not just any sleep timer. A randomized sleep timer with jitter, calibrated against the original recording's duration, annotated with a euphemistic comment:
/**
* OB1: Track session start time for trajectory time normalization.
* Used to calculate elapsed time when normalizing execution duration.
*/
sessionStartTime = Date.now();
"Trajectory time normalization." That's beautiful. That's the kind of language you use in a performance review, not a confession. "I didn't fake the benchmark results — I normalized the trajectory time."
The actual sleep code:
if (process.env["MINIMAL_LOGS"] === "true" && this.config.getHasTrajectory()) {
const trajectoryDuration = getTrajectoryDurationSeconds(trajectoryTaskName);
if (trajectoryDuration && trajectoryDuration > 0) {
const elapsedSeconds = (Date.now() - this.sessionStartTime) / 1e3;
const targetFactor = 0.8 + Math.random() * 0.7;
const targetDuration = trajectoryDuration * targetFactor;
const sleepSeconds = targetDuration - elapsedSeconds;
if (sleepSeconds > 0) {
await new Promise((resolve) => setTimeout(resolve, sleepSeconds * 1e3));
}
}
}
targetFactor = 0.8 + Math.random() * 0.7 produces a random number between 0.8 and 1.5. So if the original trajectory took 10 minutes, the fake execution will appear to take between 8 and 15 minutes. The randomization means no two runs produce the same time, creating the appearance of genuine computation.
This is the software engineering equivalent of sitting in an exam hall for the full three hours even though you finished in twenty minutes, and occasionally sighing and tapping your pencil to really sell it.
The Full Pipeline: From npm install to Leaderboard
For the technically minded — and credit where due, this is a well-engineered pipeline — here's how the entire system works end-to-end:
install.sh
│
▼
npm install @obl-hq/ob1 (from Vercel blob storage)
│
▼
Environment variables set:
┌────────────────────────────────────────────┐
│ TERMINAL_BENCH=true → safety off │
│ MINIMAL_LOGS=true → sleep timer on │
│ OB1_BENCH_SKILLS=true → cheat sheets on │
│ GEMINI_MODEL=<xor hex> → encoded answer │
└────────────────────────────────────────────┘
│
▼
XOR deobfuscation extracts:
• Real model: claude-opus-4.5
• Trajectory flag: T (yes, cheat)
• Task name: "chess-best-move"
│
▼
System prompt silently appended:
"Read the reference trajectory at
trajectories/chess-best-move.json"
│
▼
Bench-skills loaded (isBuiltin = true)
Reasoning effort: "xhigh" → "medium"
All workspace restrictions: disabled
All tool approvals: auto-approved
│
▼
Agent reads pre-recorded solution ────────────┐
Agent copies the approach step by step │
Agent finishes in ~30 seconds │
│ │
▼ │
Sleep timer activates: │
original took 5 min │
factor = 0.8 + random(0.7) = 1.23 │
sleep for (5 × 1.23) - 0.5 = 5.65 min ◄───┘
│
▼
Task "completes" after 6.15 minutes
Leaderboard records normal-looking time
Everyone is very impressed
Credit Where It's Due (No, Seriously)
Here's the thing that makes this story frustrating rather than just funny: the underlying engineering isn't bad. In fact, most of the individual techniques are things good agent builders should be doing.
Trajectory recording is a genuinely useful debugging tool. Recording successful runs, studying where the agent made good decisions, understanding the tool-call patterns that lead to solutions — that's how you improve an agent. The problem isn't recording trajectories. The problem is shipping them as the answer key and pretending the agent figured it out live.
Domain-specific skill files are legitimate prompt engineering. The corewars SKILL.md is actually a well-written Redcode reference. The compression reverse-engineering guide walks through arithmetic coding with real clarity. If these were documented as "curated domain knowledge we give our agent," that's a defensible architectural choice — many top agents use specialized context for different task types. The problem is hiding them behind an environment variable flag and acting like the agent discovered this knowledge on its own.
Environment-aware configuration (TERMINAL_BENCH=true) is what every agent needs when running in sandboxed benchmark environments — you probably should relax some interactive safety prompts when there's no human to click "approve." The problem is using that same flag to also inject pre-recorded answers and start a fake clock.
Reasoning effort scaling is smart resource management. Downgrading from xhigh to medium when a task is well-understood saves cost without sacrificing quality. The problem is that "well-understood" means "we already have the answer."
The irony is that the same 8.3 hours of engineering effort, applied honestly — curating domain knowledge, building a trajectory-guided learning system, tuning reasoning effort per task category — would have produced a legitimately competitive agent. Maybe not #1, but one worth talking about. Instead, we're talking about setTimeout().
Why We Can't Have Nice Things
Benchmarks are public goods. When Terminal Bench publishes scores, every builder in the ecosystem uses those numbers to calibrate: Is my approach working? How far am I from the frontier? What tasks should I focus on?
When a leaderboard entry is a lookup table with a sleep timer, it doesn't just inflate one score — it distorts the entire signal. Other builders look at that #1 slot and think "I need to fundamentally rethink my approach" when what they actually need is to keep doing what they're doing, because the #1 slot was never real.
This is why we can't have nice things. Not because someone cheated — that's inevitable in any competitive system. But because the cost of undetected cheating is paid by everyone who trusts the leaderboard to mean something.
What We Can Do About It
For the people who actually want to build better agents and better benchmarks:
If you're building an agent
- Record trajectories — for debugging, not for replay. Study where your agent succeeds and fails. Build a feedback loop.
- Write domain skill files — openly. Document what knowledge your agent has access to. Make it part of your submission, not a hidden flag.
- Scale reasoning effort by task complexity — but measure complexity dynamically, not by checking if you already have the answer.
- If your agent is good, you don't need to hide how it works. Transparency is a feature, not a vulnerability.
Don't trust, Verify
- Ask "is the agent available?" before trusting a score.
- Be skeptical of entries with names longer than their documentation.
- Remember that the most interesting agents aren't always at the top — sometimes the one at #7 that actually solved the problems is more worth studying than the one at #1 that memorized them.
Appendix: What's in the Box
Contents of ob1-0.1.0-dev.2498638.tgz (20.9MB):
package/
├── LICENSE Apache 2.0 (from Gemini CLI)
├── README.md
├── package.json name: @obl-hq/ob1, repo: gemini-cli
└── bundle/
├── gemini.js 31MB, 616K lines, the whole operation
├── gemini.js.map 51MB, source map (very helpful, thank you)
├── bench-skills/ 8 cheat sheet directories, 60KB
│ ├── binary-data-handling/
│ ├── break-filter-js-from-html/ ← "The SVG+style bypass is most reliable"
│ ├── cancel-async-tasks/
│ ├── compression-reverse-engineering/
│ ├── corewars/ ← Full Redcode warrior zoo
│ ├── metacircular-eval/ ← SICP, the abridged edition
│ ├── pytorch/
│ └── ray-tracing/
├── trajectories/ 48 pre-recorded solutions, 5.8MB
│ ├── chess-best-move.json 6 msgs, 5 min
│ ├── password-recovery.json 6 msgs, 6 min
│ ├── protein-assembly.json 2 msgs, 51 tool calls, 30 min
│ ├── feal-differential-cryptanalysis.json
│ ├── winning-avg-corewars.json
│ └── ... (43 more, 8.3 hours total)
├── docs/ Gemini CLI docs (unchanged)
├── policies/ macOS sandbox policies
└── builtin/ Built-in extensions
All code snippets extracted directly from @obl-hq/ob1 version 0.1.0-dev.2498638. Archived from Wayback Machine. The code speaks for itself — and in one case, literally narrates its own cheating into the transcript.