Kill Flaky Tests With One Agent Loop
A repeatable agent loop that runs your test suite 20 times, isolates intermittent failures, quarantines or fixes them, and won't stop until 5 consecutive green runs.
A repeatable agent loop that runs your test suite 20 times, isolates intermittent failures, quarantines or fixes them, and won't stop until 5 consecutive green runs.
Flaky tests are the worst kind of bug because they hide from the one thing you do to find bugs: running the test once. A test that passes 19 times and fails on the 20th will sail through your local run, sail through the PR, and then redden CI at the worst possible moment — usually for someone else, on an unrelated change. The only reliable way to catch a flaky test is to run the suite enough times to make the ghost appear, and the only reliable way to do that without losing an afternoon is an agent loop.
This tutorial walks through the "kill flaky tests" loop: run the suite about 20 times, collect every intermittent failure, fix or quarantine the offenders, and refuse to stop until you see five consecutive fully-green runs. It is a textbook example of loop engineering — a verifiable exit condition, a max-iterations cap, and a check command that grounds every decision in observed reality.
A flaky test does not fail — it fails sometimes. If a test has a 5% failure rate, a single run has a 95% chance of looking perfectly healthy. That is why flakes survive code review and local testing: the dishonesty of one green run masks the underlying instability.
The fix is to change the unit of observation from "one run" to "many runs." Run the suite 20 times and that same 5%-flaky test now appears with about a 64% probability — high enough to surface, low enough that you understand why one pass missed it. This is the observe step of the act-observe-decide-repeat cycle: you cannot make a good decision about a flake until you have collected enough evidence that it exists.
The loop has a tight, repeatable structure. Each phase maps onto a stage of the cycle:
The "five consecutive" rule is the heart of it. One green run after a fix means nothing — the flake could simply be hiding again. Five green runs in a row is strong evidence the instability is genuinely resolved.
Most flakes trace back to a small set of root causes. When the loop surfaces a suspect, check these first:
| Cause | Tell-tale sign | Quick fix |
|---|---|---|
| Timing / race | Fails under load or in parallel | await the real condition, not a sleep |
| Shared state | Fails only when run after another test | Isolate or reset state per test |
| Unmocked clock | Fails near midnight or month boundaries | Inject and freeze time |
| Network / external | Fails offline or under latency | Mock the dependency |
| Test ordering | Fails only in a specific sequence | Remove inter-test dependencies |
If the cause is one of these and obvious, fix it in the action step. If it is murky, quarantine first — a parked flaky test that no longer reddens CI buys you the time to diagnose properly.
Here is the failure mode that ruins naive flaky-test loops: the fastest way to make a failing test pass is to delete it. An unguarded agent told "get to green" will absolutely do that, and you will have "fixed" the flake by destroying the coverage. This is exactly the anti-gaming problem at the center of every exit condition.
Guardrails that keep the loop honest:
With those in place, the only path to five-consecutive-green is genuinely stabilizing the tests.
The /skills/kill-flaky-tests recipe ships everything the loop needs: a goal, an exit condition (5 consecutive green runs), a max-iterations cap, a check command (your test runner invoked in a 20× loop), and a paste-ready kickoff prompt. You can drive it with Claude Code's native /loop primitive, the same way you'd run a test-until-green loop, pointing the check command at your suite.
For the underlying loop mechanics, the Agent SDK agent-loop guide documents how the agent observes the check result and decides whether to continue. The recipe is adapted from the awesome-agent-loops collection (CC-BY) and lives alongside 150+ others in the Loops channel.
Twenty is a practical balance: enough repetitions to surface most flakes with reasonable probability, few enough to run in a sensible time. Rare flakes (well under 5% failure rate) may need more runs, so treat 20 as a floor you can raise.
A single green run after a fix can be the flake hiding rather than being gone. Five consecutive green runs is strong statistical evidence the instability is resolved, which is why the loop uses it as the exit condition.
No — quarantine with a tracking marker keeps CI honest while preserving a record that the test needs a real fix. The danger is silent deletion, which the loop's guardrails prevent.
Yes. The loop is just a check command run repeatedly, so it fits a scheduled CI job. Many teams run a nightly flaky-test sweep and quarantine offenders before the morning push.
Run the loop in an environment that matches CI — same parallelism, same resource limits. Flakes driven by timing or shared state often only appear under the contention CI creates.
Browse 150+ ready-to-run agent loops in the Loops channel, or explore the full skill catalog at aiskill.market.
Generate, document, and improve GitHub Actions workflows. Covers triggers, jobs, steps, matrix builds, and reusable workflows. 77.9K installs.
TypeScript educator Matt Pocock's skills: code review, TDD, architecture improvement, and advanced TypeScript patterns. 49K + 27.6K + 24.1K installs.
Use when implementing any feature or bugfix, before writing implementation code
Teaches Claude how to use the linear-CLI tool for issue tracking