A test run is already a loop; you just run it manually. You execute the suite, read the red, change a file, run it again. A self-healing test loop hands those four steps to an agent: run the suite, read failures, fix the cause, repeat — until every test passes or a max-iterations cap fires. It's the simplest useful loop to build, which makes it the perfect first one. No CI, no remote, no PR — just your local suite and an agent that won't stop until it's green.

This is a hands-on tutorial. By the end you'll have a working "Test Until Green" loop, the exact kickoff prompt to paste, and the guardrails that keep the agent from "fixing" your tests by deleting them. If you want the conceptual grounding first, what loop engineering is covers the act → observe → decide → repeat skeleton this loop sits on.

Key Takeaways

A test loop is goal-shaped: the end-state is "the full suite passes," which a command can verify on every pass. That makes it a textbook exit condition.
The check command is your test runner — pytest, npm test, go test ./.... Deterministic in, deterministic out.
Anti-gaming guardrails are non-negotiable: forbid deleting tests, skipping with xfail/.skip, or weakening assertions to force green.
Cap iterations at 5–10. Genuine fixes converge fast; thrashing past the cap is a signal to take over.
The ready-made version lives at /skills/test-until-green with the kickoff prompt below.

A self-healing test loop runs the suite and fixes failures until everything is green

Step 1: Define the End-State Before You Write a Prompt

Resist the urge to start with "fix my tests." Start with the finish line. For this loop the end-state is precise and checkable:

Goal:            The entire test suite passes
Exit condition:  Test runner exits 0 with 0 failures and 0 errors
Check command:   <your test runner>   e.g. pytest -q  /  npm test  /  go test ./...
Max iterations:  8
Anti-gaming:     No deleting tests, no skip/xfail markers, no weakening
                 assertions, no commenting out failing cases

That little spec is the whole contract. Everything else in the loop is the agent honoring it. Writing the spec first is the discipline that separates a reliable loop from a hopeful one — it's the same closed-loop thinking in from one-shot to closed-loop thinking.

Step 2: Pick a Deterministic Check Command

The check command is the agent's source of truth. It must give the same answer for the same code, every time. Three guidelines:

Run the whole suite, not a subset. Fixing one test can break another; a partial check hides regressions.
Use quiet, machine-friendly output (pytest -q, npm test --silent) so the agent parses signal, not noise.
Quarantine known-flaky tests out of the loop's check. A non-deterministic check is the single biggest cause of loops that never stop — the mechanism is in why your agent loops forever. If a test flakes, the agent will keep "fixing" code that's already correct.

Step 3: Write the Kickoff Prompt

Now turn the spec into instructions. Here's a complete, paste-ready kickoff for a Python project — swap the check command for your stack:

# Kickoff: Test Until Green
claude "You are running the Test Until Green loop.

GOAL: Make the entire test suite pass.

LOOP (max 8 iterations):
  1. Run the suite:  pytest -q
  2. If it exits 0 with no failures -> STOP and report 'all green'.
  3. If anything failed:
       a. Read the failing test names and tracebacks.
       b. Find the ROOT CAUSE in the source, not the test.
       c. Make the smallest fix that addresses the cause.
  4. Go back to step 1.
  5. If you reach iteration 8 without all-green -> STOP and report which
     tests still fail and your best diagnosis.

GUARDRAILS (violating any of these is a failed run):
  - Never delete a test or comment out a failing case.
  - Never add @pytest.mark.skip, xfail, or .only/.skip to dodge a failure.
  - Never weaken an assertion to make it pass.
  - Fix the code under test, not the test, unless the test is provably wrong
    (and if so, explain why before changing it).

Self-pace. Don't ask for confirmation between iterations."

The structure is the four fields from Step 1 written as prose. Note step 3b — "find the root cause, not the test." Without it, the fastest path to green is to gut the test, which is exactly the failure mode the guardrails forbid.

Step 4: Run It and Watch the Loop Breathe

Kick it off and observe. A healthy run looks like this:

iter 1: pytest -q -> 3 failed   (reads tracebacks, fixes null-check in parser)
iter 2: pytest -q -> 1 failed   (off-by-one in pagination, fixes)
iter 3: pytest -q -> all passed -> STOP, reports 'all green'

Three iterations, no human in the loop. If instead you see the same test fail at iterations 1, 2, and 3 with the agent making different edits each time, that's the tell-tale signature of a flaky or non-deterministic check — stop, quarantine that test, and restart.

How Does This Compare to Ship PR Until Green?

Test Until Green and Ship PR Until Green are siblings built on the same skeleton. The difference is scope.

Dimension	Test Until Green	Ship PR Until Green
Scope	Local test suite	Full PR + remote CI
Check command	`pytest` / `npm test` / `go test`	`gh pr checks`
Needs git remote?	No	Yes
Best first loop?	Yes — fewest moving parts	After you trust the test loop
Typical iterations	2–4	2–5
Exit condition	Suite passes locally	All CI checks pass

Start with Test Until Green precisely because it has the fewest moving parts: no remote, no PR, no CI permissions. Once you trust it, Ship PR Until Green is the same idea with the check command swapped to gh pr checks and the scope widened to the whole pipeline.

Step 5: Use the Ready-Made Recipe

You don't have to assemble this from scratch. The Loops channel ships /skills/test-until-green with the kickoff prompt, guardrails, and a tuned default cap. It's one of 9 agent loops you can steal today. For the underlying agent-loop primitives, Anthropic's agent loop docs are the canonical reference, and Geoffrey Huntley's Ralph technique shows the bare while-loop version of the same idea.

Frequently Asked Questions

What stops the agent from deleting a failing test to "pass"?

The anti-gaming guardrails in the kickoff prompt explicitly forbid deleting tests, adding skip/xfail markers, and weakening assertions. This is the most common way a naive test loop cheats, and it's why the prompt makes "fix the code, not the test" an explicit rule. Metric-gaming is a general loop hazard — more in exit conditions.

My suite has flaky tests. Will this work?

Not reliably until you quarantine the flakes. A non-deterministic check command means the agent never gets a stable yes/no, so it keeps editing forever. Mark flaky tests out of the loop's check command (and consider a dedicated kill flaky tests loop to fix them properly). A deterministic check is a precondition, not a nice-to-have.

How many iterations should I allow?

Five to eight for most suites. Real fixes converge in two to four passes; the headroom covers a second-order failure that the first fix exposes. If you routinely hit the cap, the task is underspecified or the suite is flaky — not a reason to raise the cap.

Does this replace test-driven development?

No, it complements it. TDD is about writing the test first; this loop is about closing the gap to green automatically once tests exist. They stack well — write the failing test by hand, then let the loop drive the implementation to green.

Can I run this on a schedule overnight?

Yes — wrap it in /schedule to fire nightly, as covered in /loop, /goal & /schedule. That's the basis of running continuous Claude Code while you sleep: a scheduled trigger kicking off a goal-shaped loop unattended.

Browse 150+ ready-to-run agent loops in the Loops channel, or explore the full skill catalog at aiskill.market.

Test Until Green: Self-Healing Test Loops

Key Takeaways

Step 1: Define the End-State Before You Write a Prompt

Step 2: Pick a Deterministic Check Command

Step 3: Write the Kickoff Prompt

Step 4: Run It and Watch the Loop Breathe

How Does This Compare to Ship PR Until Green?

Step 5: Use the Ready-Made Recipe

Frequently Asked Questions

What stops the agent from deleting a failing test to "pass"?

My suite has flaky tests. Will this work?

How many iterations should I allow?

Does this replace test-driven development?

Can I run this on a schedule overnight?

Matt Pocock TypeScript Skills

test-driven-development

Linear CLI Integration

Agent Evaluation Frameworks

Related Skills to Try

Related Skills to Try

Matt Pocock TypeScript Skills

test-driven-development

Linear CLI Integration

Agent Evaluation Frameworks

Related Articles

Related Articles

Gesture Recognition in AI Interfaces

CI/CD on Apple Silicon With AI

Apple Silicon Optimization for AI