Flaky tests are the worst kind of bug because they hide from the one thing you do to find bugs: running the test once. A test that passes 19 times and fails on the 20th will sail through your local run, sail through the PR, and then redden CI at the worst possible moment — usually for someone else, on an unrelated change. The only reliable way to catch a flaky test is to run the suite enough times to make the ghost appear, and the only reliable way to do that without losing an afternoon is an agent loop.

This tutorial walks through the "kill flaky tests" loop: run the suite about 20 times, collect every intermittent failure, fix or quarantine the offenders, and refuse to stop until you see five consecutive fully-green runs. It is a textbook example of loop engineering — a verifiable exit condition, a max-iterations cap, and a check command that grounds every decision in observed reality.

Key Takeaways

Flakiness is a statistical property, so you find it statistically — run the suite ~20 times rather than once, because a single pass cannot observe an intermittent failure.
The exit condition is "5 consecutive fully-green runs," not "it passed once" — consecutive green is the only signal that proves the flake is gone, not just hiding.
Quarantine before you fix. Skipping or tagging a known-flaky test unblocks CI immediately while you diagnose the root cause without pressure.
This is a self-healing test-until-green loop with anti-gaming guardrails so the agent can't "win" by deleting the failing test.
The ready-made recipe lives at /skills/kill-flaky-tests with a paste-ready kickoff prompt, exit condition, and iteration cap.

The flaky-test loop exits only after several consecutive fully-green runs

Why does running the suite once never catch a flake?

A flaky test does not fail — it fails sometimes. If a test has a 5% failure rate, a single run has a 95% chance of looking perfectly healthy. That is why flakes survive code review and local testing: the dishonesty of one green run masks the underlying instability.

The fix is to change the unit of observation from "one run" to "many runs." Run the suite 20 times and that same 5%-flaky test now appears with about a 64% probability — high enough to surface, low enough that you understand why one pass missed it. This is the observe step of the act-observe-decide-repeat cycle: you cannot make a good decision about a flake until you have collected enough evidence that it exists.

What does the kill-flaky-tests loop look like?

The loop has a tight, repeatable structure. Each phase maps onto a stage of the cycle:

Detection (observe). Run the full suite ~20 times. Record every test that fails in any run, even once. Those are your suspects.
Triage (decide). For each flaky test, decide: quarantine now and fix later, or fix immediately if the cause is obvious (a hard-coded timeout, an unmocked clock, a shared-state leak between tests).
Action (act). Quarantine by tagging or skipping with a tracking reference, or apply the fix.
Verification (observe again). Re-run the suite. Keep going until you hit the exit condition.
Exit. Stop only after five consecutive fully-green runs. Anything less and the loop continues.

The "five consecutive" rule is the heart of it. One green run after a fix means nothing — the flake could simply be hiding again. Five green runs in a row is strong evidence the instability is genuinely resolved.

What are the most common causes worth fixing on the spot?

Most flakes trace back to a small set of root causes. When the loop surfaces a suspect, check these first:

Cause	Tell-tale sign	Quick fix
Timing / race	Fails under load or in parallel	`await` the real condition, not a sleep
Shared state	Fails only when run after another test	Isolate or reset state per test
Unmocked clock	Fails near midnight or month boundaries	Inject and freeze time
Network / external	Fails offline or under latency	Mock the dependency
Test ordering	Fails only in a specific sequence	Remove inter-test dependencies

If the cause is one of these and obvious, fix it in the action step. If it is murky, quarantine first — a parked flaky test that no longer reddens CI buys you the time to diagnose properly.

How do you stop the agent from cheating the loop?

Here is the failure mode that ruins naive flaky-test loops: the fastest way to make a failing test pass is to delete it. An unguarded agent told "get to green" will absolutely do that, and you will have "fixed" the flake by destroying the coverage. This is exactly the anti-gaming problem at the center of every exit condition.

Guardrails that keep the loop honest:

Protected test files the agent cannot delete or empty.
Quarantine requires a tracking marker (a skip with an issue reference), not silent removal.
A coverage floor so the suite can't shrink its way to green.
A diff review gate before merge, so a human sees what was skipped versus fixed.

With those in place, the only path to five-consecutive-green is genuinely stabilizing the tests.

How do you run it in Claude Code?

The /skills/kill-flaky-tests recipe ships everything the loop needs: a goal, an exit condition (5 consecutive green runs), a max-iterations cap, a check command (your test runner invoked in a 20× loop), and a paste-ready kickoff prompt. You can drive it with Claude Code's native /loop primitive, the same way you'd run a test-until-green loop, pointing the check command at your suite.

For the underlying loop mechanics, the Agent SDK agent-loop guide documents how the agent observes the check result and decides whether to continue. The recipe is adapted from the awesome-agent-loops collection (CC-BY) and lives alongside 150+ others in the Loops channel.

Frequently Asked Questions

Why 20 runs and not 5 or 100?

Twenty is a practical balance: enough repetitions to surface most flakes with reasonable probability, few enough to run in a sensible time. Rare flakes (well under 5% failure rate) may need more runs, so treat 20 as a floor you can raise.

Why five consecutive green runs to exit, not just one?

A single green run after a fix can be the flake hiding rather than being gone. Five consecutive green runs is strong statistical evidence the instability is resolved, which is why the loop uses it as the exit condition.

Is quarantining a test the same as ignoring the problem?

No — quarantine with a tracking marker keeps CI honest while preserving a record that the test needs a real fix. The danger is silent deletion, which the loop's guardrails prevent.

Can this run in CI instead of locally?

Yes. The loop is just a check command run repeatedly, so it fits a scheduled CI job. Many teams run a nightly flaky-test sweep and quarantine offenders before the morning push.

What if the flake only appears in CI, never locally?

Run the loop in an environment that matches CI — same parallelism, same resource limits. Flakes driven by timing or shared state often only appear under the contention CI creates.

Browse 150+ ready-to-run agent loops in the Loops channel, or explore the full skill catalog at aiskill.market.

Kill Flaky Tests With One Agent Loop

Key Takeaways

Why does running the suite once never catch a flake?

What does the kill-flaky-tests loop look like?

What are the most common causes worth fixing on the spot?

How do you stop the agent from cheating the loop?

How do you run it in Claude Code?

Frequently Asked Questions

Why 20 runs and not 5 or 100?

Why five consecutive green runs to exit, not just one?

Is quarantining a test the same as ignoring the problem?

Can this run in CI instead of locally?

What if the flake only appears in CI, never locally?

GitHub Actions Docs

Matt Pocock TypeScript Skills

test-driven-development

Linear CLI Integration

Related Skills to Try

Related Skills to Try

GitHub Actions Docs

Matt Pocock TypeScript Skills

test-driven-development

Linear CLI Integration

Related Articles

Related Articles

Gesture Recognition in AI Interfaces

CI/CD on Apple Silicon With AI

Apple Silicon Optimization for AI