Cost Control on Hermes: --max-turns, --max-budget-usd, and Claude Fallback Models
The three cost levers that keep Hermes agents from burning your API budget: turn caps, hard USD limits, and a fallback model chain that downgrades on failure.
The three cost levers that keep Hermes agents from burning your API budget: turn caps, hard USD limits, and a fallback model chain that downgrades on failure.
Agents are expensive when they are wrong. A conversation that gets into a loop and retries the same failing command 80 times can burn through real money before you notice. Hermes exposes three levers specifically to prevent this — a hard turn cap, a hard dollar cap, and a fallback model chain. This post is about using all three, and how to think about the cost/quality tradeoff they let you make.
--max-turns N caps a conversation's round-trips. Prevents runaway loops where an agent argues with itself forever.--max-budget-usd X is a hard USD ceiling per session. Claude counts tokens as it goes; when the budget is hit, the session ends.claude-sonnet-4-6 → claude-haiku-4-5 → gpt-4o-mini so quota or failure on the primary triggers automatic downgrade.A "turn" is one round-trip — you prompt, the agent responds (possibly calling tools in the process). --max-turns 20 means the conversation will end after at most 20 turns regardless of how compelling the agent finds the remaining work.
hermes chat --provider anthropic --model claude-sonnet-4-6 --max-turns 20
Why it matters: the most common agent failure mode is not wrong output, it is infinite loops. An agent tries to run a command, gets an error, decides to try something else, gets another error, loops back. Without a turn cap, this can continue until your context window or your wallet runs out. A sensible cap for most tasks is 10-30 turns. For very focused tasks, 5 is enough.
This is the hard stop. The agent tracks token usage in real time, converts to USD at the current model's pricing, and halts the session when the cap is hit.
hermes chat --provider anthropic --model claude-sonnet-4-6 \
--max-turns 20 \
--max-budget-usd 2.00
For most interactive sessions, $1-$5 is plenty. For scheduled tasks, $0.25-$1.00 is typically enough. The value here is bounded: even if everything else goes wrong, you are capped at the number you set. Nothing protects you better.
The fallback chain specifies a list of models to try in order. If the primary fails (quota exceeded, rate limited, transient error), Hermes automatically downgrades to the next model in the chain. The conversation continues on the fallback.
Conceptual config (see docs for exact field names):
models:
primary: claude-sonnet-4-6
fallback:
- claude-haiku-4-5
- gpt-4o-mini
fallback_triggers:
- quota_exceeded
- rate_limited
- timeout
What this buys you: resilience. A 3am cron job on quota-exhausted Anthropic no longer fails silently — it finishes, possibly with lower-quality output, and emails you the result. For scheduled tasks this is essential. For interactive sessions it is a nice-to-have.
A minimal production-grade launch for a scheduled task:
hermes run morning-digest \
--max-turns 15 \
--max-budget-usd 0.50 \
--fallback-model claude-haiku-4-5 \
--fallback-model gpt-4o-mini
With this configuration, the worst case is: 15 turns spent on a cheaper fallback model, capped at 50 cents, and the job either completes or is terminated cleanly. No scenario burns real money. No scenario runs forever.
For the scheduler-based equivalent, the same caps can be set per-schedule in YAML:
schedules:
- name: "morning-digest"
cron: "0 8 * * 1-5"
prompt: "Summarize yesterday's deploys..."
max_turns: 15
max_budget_usd: 0.50
models:
primary: claude-sonnet-4-6
fallback:
- claude-haiku-4-5
See scheduling-claude-agents-hermes-cron-daily-reports for the full scheduler config.
To use these levers well, a rough pricing hierarchy helps. Prices change; the order is the point:
A smart agent setup uses the cheap tier to triage (is this PR worth reading in detail?), the premium tier to reason (yes it is — here is what to do about it), and the local tier for bulk stuff.
The budget controls also matter for the parallel pattern (see parallel-workstreams-multiple-claude-from-hermes). When the orchestrator dispatches five subagents, each should have its own max_budget_usd. Otherwise a single misbehaving subagent could consume the budget meant for all five.
# per-subagent caps, illustrative
delegate_task(
prompt="...",
model="claude-sonnet-4-6",
max_turns=15,
max_budget_usd=0.30,
)
Orchestrator total budget = sum of subagent budgets + orchestrator's own reasoning budget.
All three caps fail cleanly by design. max-turns returns whatever the agent has produced so far with a note that turns were exhausted. max-budget-usd halts the next API call and returns with a budget-exceeded status. Fallback triggers a model switch and continues. Your code (or scheduled task log) sees a clear terminal state and can act on it.
What you want to avoid: silent failures. Always log the exit reason. A scheduled task that "completed" without producing output because it hit a turn cap should raise an alert, not pass silently.
The highest-leverage pattern for production: use Haiku or GPT-4o-mini for the first 80% of a workflow, then promote to Sonnet for the final synthesis. You pay premium prices only for the part where quality actually matters.
Concrete example: summarizing 50 PRs. Let Haiku read each PR and produce a one-line summary. That is cheap. Then Sonnet takes the 50 summaries and produces the final categorized digest. That is where quality matters, and the context is now small enough that Sonnet runs quickly.
Net cost is a fraction of "Sonnet reads all 50 PRs in detail" with negligible quality loss on the final artifact.
The first time you leave an agent running overnight without these caps, you are playing the lottery. The numbers come up fine most nights. On the wrong night, the numbers come up $400. Set the caps. Set them low. Raise them only if you hit them and the work was genuinely worth the money.
Manage Apple Notes via the memo CLI on macOS (create, view, search, edit).
Manage Apple Reminders via remindctl CLI (list, add, complete, delete).
Delegate coding tasks to OpenAI Codex CLI agent. Use for building features, refactoring, PR reviews, and batch issue fixing. Requires the codex CLI and a git repository.
Track Apple devices and AirTags via FindMy.app on macOS using AppleScript and screen capture.