Test Until Green: Self-Healing Test Loops
A step-by-step guide to building a self-healing test loop in Claude Code — run the suite, read failures, fix, repeat until green, with a paste-ready kickoff prompt.
A step-by-step guide to building a self-healing test loop in Claude Code — run the suite, read failures, fix, repeat until green, with a paste-ready kickoff prompt.
A test run is already a loop; you just run it manually. You execute the suite, read the red, change a file, run it again. A self-healing test loop hands those four steps to an agent: run the suite, read failures, fix the cause, repeat — until every test passes or a max-iterations cap fires. It's the simplest useful loop to build, which makes it the perfect first one. No CI, no remote, no PR — just your local suite and an agent that won't stop until it's green.
This is a hands-on tutorial. By the end you'll have a working "Test Until Green" loop, the exact kickoff prompt to paste, and the guardrails that keep the agent from "fixing" your tests by deleting them. If you want the conceptual grounding first, what loop engineering is covers the act → observe → decide → repeat skeleton this loop sits on.
pytest, npm test, go test ./.... Deterministic in, deterministic out.xfail/.skip, or weakening assertions to force green.Resist the urge to start with "fix my tests." Start with the finish line. For this loop the end-state is precise and checkable:
Goal: The entire test suite passes
Exit condition: Test runner exits 0 with 0 failures and 0 errors
Check command: <your test runner> e.g. pytest -q / npm test / go test ./...
Max iterations: 8
Anti-gaming: No deleting tests, no skip/xfail markers, no weakening
assertions, no commenting out failing cases
That little spec is the whole contract. Everything else in the loop is the agent honoring it. Writing the spec first is the discipline that separates a reliable loop from a hopeful one — it's the same closed-loop thinking in from one-shot to closed-loop thinking.
The check command is the agent's source of truth. It must give the same answer for the same code, every time. Three guidelines:
pytest -q, npm test --silent) so the agent parses signal, not noise.Now turn the spec into instructions. Here's a complete, paste-ready kickoff for a Python project — swap the check command for your stack:
# Kickoff: Test Until Green
claude "You are running the Test Until Green loop.
GOAL: Make the entire test suite pass.
LOOP (max 8 iterations):
1. Run the suite: pytest -q
2. If it exits 0 with no failures -> STOP and report 'all green'.
3. If anything failed:
a. Read the failing test names and tracebacks.
b. Find the ROOT CAUSE in the source, not the test.
c. Make the smallest fix that addresses the cause.
4. Go back to step 1.
5. If you reach iteration 8 without all-green -> STOP and report which
tests still fail and your best diagnosis.
GUARDRAILS (violating any of these is a failed run):
- Never delete a test or comment out a failing case.
- Never add @pytest.mark.skip, xfail, or .only/.skip to dodge a failure.
- Never weaken an assertion to make it pass.
- Fix the code under test, not the test, unless the test is provably wrong
(and if so, explain why before changing it).
Self-pace. Don't ask for confirmation between iterations."
The structure is the four fields from Step 1 written as prose. Note step 3b — "find the root cause, not the test." Without it, the fastest path to green is to gut the test, which is exactly the failure mode the guardrails forbid.
Kick it off and observe. A healthy run looks like this:
iter 1: pytest -q -> 3 failed (reads tracebacks, fixes null-check in parser)
iter 2: pytest -q -> 1 failed (off-by-one in pagination, fixes)
iter 3: pytest -q -> all passed -> STOP, reports 'all green'
Three iterations, no human in the loop. If instead you see the same test fail at iterations 1, 2, and 3 with the agent making different edits each time, that's the tell-tale signature of a flaky or non-deterministic check — stop, quarantine that test, and restart.
Test Until Green and Ship PR Until Green are siblings built on the same skeleton. The difference is scope.
| Dimension | Test Until Green | Ship PR Until Green |
|---|---|---|
| Scope | Local test suite | Full PR + remote CI |
| Check command | pytest / npm test / go test | gh pr checks |
| Needs git remote? | No | Yes |
| Best first loop? | Yes — fewest moving parts | After you trust the test loop |
| Typical iterations | 2–4 | 2–5 |
| Exit condition | Suite passes locally | All CI checks pass |
Start with Test Until Green precisely because it has the fewest moving parts: no remote, no PR, no CI permissions. Once you trust it, Ship PR Until Green is the same idea with the check command swapped to gh pr checks and the scope widened to the whole pipeline.
You don't have to assemble this from scratch. The Loops channel ships /skills/test-until-green with the kickoff prompt, guardrails, and a tuned default cap. It's one of 9 agent loops you can steal today. For the underlying agent-loop primitives, Anthropic's agent loop docs are the canonical reference, and Geoffrey Huntley's Ralph technique shows the bare while-loop version of the same idea.
The anti-gaming guardrails in the kickoff prompt explicitly forbid deleting tests, adding skip/xfail markers, and weakening assertions. This is the most common way a naive test loop cheats, and it's why the prompt makes "fix the code, not the test" an explicit rule. Metric-gaming is a general loop hazard — more in exit conditions.
Not reliably until you quarantine the flakes. A non-deterministic check command means the agent never gets a stable yes/no, so it keeps editing forever. Mark flaky tests out of the loop's check command (and consider a dedicated kill flaky tests loop to fix them properly). A deterministic check is a precondition, not a nice-to-have.
Five to eight for most suites. Real fixes converge in two to four passes; the headroom covers a second-order failure that the first fix exposes. If you routinely hit the cap, the task is underspecified or the suite is flaky — not a reason to raise the cap.
No, it complements it. TDD is about writing the test first; this loop is about closing the gap to green automatically once tests exist. They stack well — write the failing test by hand, then let the loop drive the implementation to green.
Yes — wrap it in /schedule to fire nightly, as covered in /loop, /goal & /schedule. That's the basis of running continuous Claude Code while you sleep: a scheduled trigger kicking off a goal-shaped loop unattended.
Browse 150+ ready-to-run agent loops in the Loops channel, or explore the full skill catalog at aiskill.market.
TypeScript educator Matt Pocock's skills: code review, TDD, architecture improvement, and advanced TypeScript patterns. 49K + 27.6K + 24.1K installs.
Use when implementing any feature or bugfix, before writing implementation code
Teaches Claude how to use the linear-CLI tool for issue tracking
Build evaluation frameworks for agent systems with metrics and benchmarks