Hermes for DevOps: Using the Bundled DevOps and MLOps Skills with Claude

DevOps work is the kind of agent task that benefits most from a persistent runtime. Alerts fire when you are asleep. Log correlations take minutes of reading, not seconds of clever prompting. Rollback decisions want traceability. Hermes Agent ships devops/ and mlops/ skill categories designed for this, and because Hermes is a long-lived daemon with memory, it fits the problem better than a per-session CLI does.

This post tours representative skills in both categories and walks through a realistic incident chain: an alert fires, Hermes reads the log, correlates with recent deploys, proposes a rollback, and delegates the rollback command to Claude Code as a subagent.

Key Takeaways

Hermes bundles devops/ and mlops/ skill categories alongside software-development and others.
DevOps skills cover CI debugging, log triage, alert handling, incident response, and rollback workflows.
MLOps skills cover model deployment, drift detection, eval scheduling, and training-job triage.
Hermes's persistent daemon is a natural fit for on-call responsibilities — it is always up and always listening.
Skills chain: alert-triage → log-correlation → deploy-history → rollback-proposal → hand off to Claude Code.
The claude-code subagent skill lets Hermes delegate heavy filesystem or repo work to a Claude Code process without ever leaving the daemon.

What Ships in `devops/`

Representative skills (exact names evolve; this is the shape of the category):

alert-triage/ — parse an alert payload (PagerDuty, Grafana, Prometheus Alertmanager) and extract the affected service, severity, and suspected cause.
log-correlation/ — pull logs across services around the alert timestamp and surface the likely signal.
ci-debug/ — read CI failure output, identify the failing step, propose a fix.
deploy-history/ — query recent deploys (git, Vercel, Fly, Kubernetes) and highlight the diffs that correlate with the incident window.
rollback-proposal/ — given a suspect deploy, draft a rollback plan with the exact command and the rollback-safety checks to run first.
incident-report/ — post-incident writeup template with timeline, blast radius, root cause, and follow-ups.
runbook-executor/ — step-through a YAML-defined runbook with operator confirmation at each step.

What Ships in `mlops/`

model-deploy/ — promote a model version from staging to production with the canary pattern configured per-environment.
drift-detection/ — run a drift check against a baseline dataset and flag features out of tolerance.
eval-scheduler/ — schedule evals to run against production traffic snapshots (pairs naturally with Hermes cron).
training-job-triage/ — given a failed or slow training job, pull the logs and propose a diagnosis.
artifact-audit/ — verify model artifacts (hashes, sizes, metadata) before a deploy.

Why Persistent Runtime Is the Right Shape for This Work

A session-based CLI agent is wrong for on-call. You do not want to start a new session for every alert; you want an agent that has been watching. The Hermes daemon:

Listens for webhooks (PagerDuty, Alertmanager, GitHub Actions).
Runs cron jobs (hourly drift checks, nightly backups).
Keeps memory across the whole week (so yesterday's incident context informs today's triage).
Calls Claude Sonnet 4.6 for reasoning-heavy steps.
Delegates to Claude Code when it needs to actually change files in a repo.

Claude Code is still in the loop — Hermes is not trying to replace it. Hermes orchestrates, Claude Code edits.

Incident Chain: Alert → Rollback

Here is the full chain for a typical "latest deploy broke production" incident. Trigger: a Grafana alert hits the Hermes webhook endpoint.

Step 1: Alert Triage

Hermes receives the webhook payload and runs the alert-triage skill:

Alert payload parsed. Service: checkout-api. Severity: high. Signal: p99 latency 2.4s (baseline 120ms). Time window: last 8 minutes. Likely cause category: recent deploy.

Step 2: Log Correlation

The log-correlation skill reads logs from the last 30 minutes:

# example config ~/.hermes/skills/devops/log-correlation/SKILL.md
services:
  checkout-api:
    logs:
      - source: loki
        query: '{service="checkout-api"} | json'
window_minutes: 30

Hermes pulls logs, finds a surge of timeout errors from a downstream dependency that started exactly at the deploy time.

Step 3: Deploy History

The deploy-history skill runs:

Last 3 deploys to checkout-api: 14 min ago (commit abc123), 2 days ago (commit def456), 5 days ago. The 14-min-ago deploy is suspect. Diff touches the payment-client timeout configuration.

Hermes has now narrowed the scope. It has the suspect commit, the changed files, and the correlating log signature.

Step 4: Rollback Proposal

The rollback-proposal skill produces:

Proposed action: revert commit abc123 and redeploy. Safety checks: ensure no DB migration in the diff (confirmed: none), ensure staging is healthy on prior version (confirmed), ensure rollback target matches deploy manifest (confirmed). Command draft: gh workflow run deploy.yml --ref v2026.04.21.

Hermes does not execute this. The skill posts a message to the on-call Slack channel via the Slack messaging gateway, tagging the on-call engineer, with the proposal and a one-click confirmation link.

Step 5: Handoff to Claude Code

Once the operator confirms, Hermes delegates the actual repo work to Claude Code using the bundled claude-code subagent skill:

# pseudocode of the subagent call
spawn: claude-code
task: |
  Revert commit abc123 on main, open a PR with title
  "revert: payment-client timeout regression (incident 2026-04-22-01)",
  link the incident memory file, and wait for CI.

Claude Code runs in a sandboxed workspace, does the revert, opens the PR, and reports back. Hermes watches the PR, waits for CI to pass, and then triggers the deploy via the confirmed workflow run.

Throughout all of this, the memory store in ~/.hermes/ accumulates the incident file: who paged, when, what logs were pulled, what decisions were made. That becomes the raw material for the post-incident report via the incident-report skill.

MLOps Chain: Drift Detected, Retrain Proposed

A shorter MLOps example. The drift-detection skill runs hourly on cron:

# ~/.hermes/cron.yaml
- name: drift-hourly
  schedule: "0 * * * *"
  skill: mlops/drift-detection
  args:
    baseline: s3://models/fraud-v12/baseline-features.parquet
    current_window: 1h

When a feature drifts, the skill:

Logs the drift to the engagement memory.
If the drift exceeds a configured threshold, opens a Slack thread with the drift plot and proposes running mlops/training-job-triage to check recent training data freshness.
Optionally schedules a retrain via the model-deploy skill's staging pathway (never production without explicit approval).

Why This Stack Is Appealing

The combination that makes this compelling:

Hermes is always on (daemon), so alerts do not miss.
The bundled skills are auditable markdown, so you know what the agent will do.
Claude Sonnet 4.6 is the reasoning engine; it is good at log triage and correlation.
Claude Code handles the actual code changes, which is what it is best at.
Memory persists, so post-incident reports write themselves.

For more on the Claude Code subagent pattern, see spawning Claude Code as a Hermes subagent. For scheduled work specifically, see scheduling Claude agents: Hermes cron for daily reports. For the messaging gateways used in the handoff, see Hermes messaging gateway: Telegram and Discord.

Guardrails Worth Setting

A few defaults I would not run production on-call without:

Max-turns cap per incident session, to prevent runaway loops.
Budget cap per day via the Hermes cost-control config.
Required operator confirmation before any kubectl delete, gh workflow run ... --ref prod, or rollback that touches a paid-tier service.
Read-only credentials for the log sources wherever possible.

See cost control: Hermes max-turns, budget, fallback for the concrete knobs.

Sources

GitHub: NousResearch/hermes-agent — see skills/devops/ and skills/mlops/
Hermes docs: hermes-agent.nousresearch.com/docs/
Related: Spawning Claude Code as a Hermes subagent
Related: Scheduling Claude agents: Hermes cron for daily reports
Related: Cost control: Hermes max-turns, budget, fallback

Hermes for DevOps: Using the Bundled DevOps and MLOps Skills with Claude

Key Takeaways

What Ships in `devops/`

What Ships in `mlops/`

Why Persistent Runtime Is the Right Shape for This Work

Incident Chain: Alert → Rollback

Step 1: Alert Triage

Step 2: Log Correlation

Step 3: Deploy History

Step 4: Rollback Proposal

Step 5: Handoff to Claude Code

MLOps Chain: Drift Detected, Retrain Proposed

Why This Stack Is Appealing

Guardrails Worth Setting

Sources

Related Skills to Try

Related Skills to Try

GitHub Actions Docs

Related Articles

Related Articles

Security Research Skills for Claude

Slack Skills for Developer Teams

Smart Proxy Patterns for AI Agents

GitHub Actions Docs

writing-skills

using-superpowers

kanban-worker

writing-skills

using-superpowers

kanban-worker

Key Takeaways

What Ships in devops/

What Ships in mlops/

Why Persistent Runtime Is the Right Shape for This Work

Incident Chain: Alert → Rollback

Step 1: Alert Triage

Step 2: Log Correlation

Step 3: Deploy History

Step 4: Rollback Proposal

Step 5: Handoff to Claude Code

MLOps Chain: Drift Detected, Retrain Proposed

Why This Stack Is Appealing

Guardrails Worth Setting

Sources

Related Skills to Try

Related Skills to Try

GitHub Actions Docs

Related Articles

Related Articles

Security Research Skills for Claude

Slack Skills for Developer Teams

Smart Proxy Patterns for AI Agents

writing-skills

using-superpowers

kanban-worker

What Ships in `devops/`

What Ships in `mlops/`