Agents are expensive when they are wrong. A conversation that gets into a loop and retries the same failing command 80 times can burn through real money before you notice. Hermes exposes three levers specifically to prevent this — a hard turn cap, a hard dollar cap, and a fallback model chain. This post is about using all three, and how to think about the cost/quality tradeoff they let you make.

Key Takeaways

--max-turns N caps a conversation's round-trips. Prevents runaway loops where an agent argues with itself forever.
--max-budget-usd X is a hard USD ceiling per session. Claude counts tokens as it goes; when the budget is hit, the session ends.
Fallback model chain degrades gracefully: configure claude-sonnet-4-6 → claude-haiku-4-5 → gpt-4o-mini so quota or failure on the primary triggers automatic downgrade.
Use cheap models for triage and expensive models for final output to get quality where it matters without paying for it where it does not.
All three levers are composable and should be set on every scheduled task without exception.
Budget is the most important of the three — set it even when you think it is unnecessary.

The Three Levers

Lever One: --max-turns

A "turn" is one round-trip — you prompt, the agent responds (possibly calling tools in the process). --max-turns 20 means the conversation will end after at most 20 turns regardless of how compelling the agent finds the remaining work.

hermes chat --provider anthropic --model claude-sonnet-4-6 --max-turns 20

Why it matters: the most common agent failure mode is not wrong output, it is infinite loops. An agent tries to run a command, gets an error, decides to try something else, gets another error, loops back. Without a turn cap, this can continue until your context window or your wallet runs out. A sensible cap for most tasks is 10-30 turns. For very focused tasks, 5 is enough.

Lever Two: --max-budget-usd

This is the hard stop. The agent tracks token usage in real time, converts to USD at the current model's pricing, and halts the session when the cap is hit.

hermes chat --provider anthropic --model claude-sonnet-4-6 \
  --max-turns 20 \
  --max-budget-usd 2.00

For most interactive sessions, $1-$5 is plenty. For scheduled tasks, $0.25-$1.00 is typically enough. The value here is bounded: even if everything else goes wrong, you are capped at the number you set. Nothing protects you better.

Lever Three: Fallback Model Chain

The fallback chain specifies a list of models to try in order. If the primary fails (quota exceeded, rate limited, transient error), Hermes automatically downgrades to the next model in the chain. The conversation continues on the fallback.

Conceptual config (see docs for exact field names):

models:
  primary: claude-sonnet-4-6
  fallback:
    - claude-haiku-4-5
    - gpt-4o-mini
  fallback_triggers:
    - quota_exceeded
    - rate_limited
    - timeout

What this buys you: resilience. A 3am cron job on quota-exhausted Anthropic no longer fails silently — it finishes, possibly with lower-quality output, and emails you the result. For scheduled tasks this is essential. For interactive sessions it is a nice-to-have.

Combining The Three

A minimal production-grade launch for a scheduled task:

hermes run morning-digest \
  --max-turns 15 \
  --max-budget-usd 0.50 \
  --fallback-model claude-haiku-4-5 \
  --fallback-model gpt-4o-mini

With this configuration, the worst case is: 15 turns spent on a cheaper fallback model, capped at 50 cents, and the job either completes or is terminated cleanly. No scenario burns real money. No scenario runs forever.

For the scheduler-based equivalent, the same caps can be set per-schedule in YAML:

schedules:
  - name: "morning-digest"
    cron: "0 8 * * 1-5"
    prompt: "Summarize yesterday's deploys..."
    max_turns: 15
    max_budget_usd: 0.50
    models:
      primary: claude-sonnet-4-6
      fallback:
        - claude-haiku-4-5

See scheduling-claude-agents-hermes-cron-daily-reports for the full scheduler config.

Model Pricing Mental Model

To use these levers well, a rough pricing hierarchy helps. Prices change; the order is the point:

Claude Opus / GPT-4 / Claude Sonnet 4.6: premium. Use for synthesis, hard reasoning, final output.
Claude Haiku 4.5 / GPT-4o-mini: cheap-but-capable. Use for triage, summarization, routing.
Tiny local models (via Ollama/vLLM): effectively free after the hardware. Use for bulk classification, simple extraction.

A smart agent setup uses the cheap tier to triage (is this PR worth reading in detail?), the premium tier to reason (yes it is — here is what to do about it), and the local tier for bulk stuff.

Per-Subagent Budget Caps

The budget controls also matter for the parallel pattern (see parallel-workstreams-multiple-claude-from-hermes). When the orchestrator dispatches five subagents, each should have its own max_budget_usd. Otherwise a single misbehaving subagent could consume the budget meant for all five.

# per-subagent caps, illustrative
delegate_task(
    prompt="...",
    model="claude-sonnet-4-6",
    max_turns=15,
    max_budget_usd=0.30,
)

Orchestrator total budget = sum of subagent budgets + orchestrator's own reasoning budget.

What Happens When Caps Hit

All three caps fail cleanly by design. max-turns returns whatever the agent has produced so far with a note that turns were exhausted. max-budget-usd halts the next API call and returns with a budget-exceeded status. Fallback triggers a model switch and continues. Your code (or scheduled task log) sees a clear terminal state and can act on it.

What you want to avoid: silent failures. Always log the exit reason. A scheduled task that "completed" without producing output because it hit a turn cap should raise an alert, not pass silently.

Cost Strategy: Triage Cheap, Finalize Expensive

The highest-leverage pattern for production: use Haiku or GPT-4o-mini for the first 80% of a workflow, then promote to Sonnet for the final synthesis. You pay premium prices only for the part where quality actually matters.

Concrete example: summarizing 50 PRs. Let Haiku read each PR and produce a one-line summary. That is cheap. Then Sonnet takes the 50 summaries and produces the final categorized digest. That is where quality matters, and the context is now small enough that Sonnet runs quickly.

Net cost is a fraction of "Sonnet reads all 50 PRs in detail" with negligible quality loss on the final artifact.

Closing Thought

The first time you leave an agent running overnight without these caps, you are playing the lottery. The numbers come up fine most nights. On the wrong night, the numbers come up $400. Set the caps. Set them low. Raise them only if you hit them and the work was genuinely worth the money.

Sources

Hermes GitHub: github.com/NousResearch/hermes-agent
Hermes docs: hermes-agent.nousresearch.com/docs/
Anthropic pricing: anthropic.com/pricing
Series: scheduling-claude-agents-hermes-cron-daily-reports, parallel-workstreams-multiple-claude-from-hermes, running-hermes-on-claude-sonnet-4-6
Anthropic API docs: docs.anthropic.com

Cost Control on Hermes: --max-turns, --max-budget-usd, and Claude Fallback Models

Key Takeaways

The Three Levers

Lever One: --max-turns

Lever Two: --max-budget-usd

Lever Three: Fallback Model Chain

Combining The Three

Model Pricing Mental Model

Per-Subagent Budget Caps

What Happens When Caps Hit

Cost Strategy: Triage Cheap, Finalize Expensive

Closing Thought

Sources

apple-notes

apple-reminders

codex

findmy

Related Skills to Try

Related Skills to Try

apple-notes

apple-reminders

codex

findmy

Related Articles

Related Articles

Gesture Recognition in AI Interfaces

CI/CD on Apple Silicon With AI

Apple Silicon Optimization for AI