Hermes Agent Atropos Environments

Guide for building RL environments in the hermes-agent repo that integrate with the Atropos training framework.

Architecture Overview

Atropos BaseEnv (atroposlib/envs/base.py)
    └── HermesAgentBaseEnv (environments/hermes_base_env.py)
            ├── Handles agent loop orchestration
            ├── Handles tool resolution per group
            ├── Handles ToolContext for reward verification
            └── YOUR ENVIRONMENT (environments/your_env.py)
                    Only implements: setup, get_next_item, format_prompt,
                                    compute_reward, evaluate, wandb_log

Hermes environments are special because they run a multi-turn agent loop with tool calling — not just single-turn completions. The base env handles the loop; you implement the task and scoring.

File Locations

File	Purpose
`environments/hermes_base_env.py`	Base class with agent loop + tool resolution
`environments/agent_loop.py`	`HermesAgentLoop` + `AgentResult` dataclass
`environments/tool_context.py`	`ToolContext` for reward verification
`environments/tool_call_parsers.py`	Phase 2 tool call parsers (hermes, mistral, etc.)
`environments/your_env.py`	Your environment implementation

Inference Setup — Ask the User First

IMPORTANT: Before running any test, evaluation, or data generation command, always ask the user how they want to handle inference. Do NOT assume OpenRouter or any specific endpoint. Present these options:

OpenRouter — Ask which model they want to use (e.g.,

anthropic/claude-sonnet-4.5

google/gemini-2.5-pro

meta-llama/llama-3.3-70b-instruct

, etc.). Requires

OPENROUTER_API_KEY

in environment.

Self-hosted VLLM endpoint — Ask for their base URL (e.g.,
```
http://localhost:8000/v1
```
) and model name. Set
```
--openai.server_type vllm
```
.
Other OpenAI-compatible API — Ask for the base URL, model name, and any required API key. Set
```
--openai.server_type openai
```
and
```
--openai.health_check false
```
.
Local Atropos training server — For
```
serve
```
mode with a live training loop. Default
```
http://localhost:8000/v1
```
.

Once the user tells you their setup, use those values in all CLI commands for that session. Example prompts:

"Before I run this, how would you like to handle inference?

OpenRouter (I'll need your preferred model, e.g. claude-sonnet-4.5)

A self-hosted VLLM endpoint (give me the URL and model name)

Another OpenAI-compatible API (give me the URL, model, and any auth details)

Local Atropos training server (serve mode)"

Key flags by provider:

Provider	`--openai.server_type`	`--openai.health_check`	`--openai.api_key`
OpenRouter	`openai`	`false`	`$OPENROUTER_API_KEY`
VLLM (self-hosted)	`vllm`	(default)	(not needed)
Other OpenAI-compatible	`openai`	`false`	As needed
Local Atropos	(default)	(default)	(not needed)

Required Methods

setup()

— Load dataset and initialize state

async def setup(self) -> None:
    """Called once at startup. Load datasets, initialize state."""
    # Try HuggingFace first, fallback to built-in samples
    try:
        from datasets import load_dataset
        ds = load_dataset("your/dataset", split="test")
        self._items = [...]
    except Exception:
        self._items = BUILTIN_SAMPLES

    # Always split into train/eval
    random.shuffle(self._items)
    eval_size = max(20, int(len(self._items) * 0.1))
    self._eval_items = self._items[:eval_size]
    self._items = self._items[eval_size:]

get_next_item()

— Return next training item

async def get_next_item(self) -> dict:
    """Return next item, cycling through dataset."""
    item = self._items[self._index % len(self._items)]
    self._index += 1
    return item

format_prompt(item)

— Convert item to user message

def format_prompt(self, item: dict) -> str:
    """Convert a dataset item into the user-facing prompt."""
    return f"Research this question: {item['question']}"

compute_reward(item, result, ctx)

— Score the rollout

CRITICAL:

result

is an

AgentResult

, NOT a dict. It has these attributes:

```
result.messages
```
— List of message dicts (OpenAI format)
```
result.turns_used
```
— Number of LLM calls made
```
result.finished_naturally
```
— True if model stopped voluntarily
```
result.tool_errors
```
— List of ToolError objects

AgentResult does NOT have:

final_response

tool_calls

tools_used

. You must extract these from

result.messages

async def compute_reward(self, item, result: AgentResult, ctx: ToolContext) -> float:
    # Extract final response (last assistant message with content)
    final_response = ""
    tools_used = []
    for msg in reversed(result.messages):
        if msg.get("role") == "assistant" and msg.get("content") and not final_response:
            final_response = msg["content"]
        if msg.get("role") == "assistant" and msg.get("tool_calls"):
            for tc in msg["tool_calls"]:
                fn = tc.get("function", {}) if isinstance(tc, dict) else {}
                name = fn.get("name", "")
                if name:
                    tools_used.append(name)

    # Score using LLM judge, heuristic, or ToolContext verification
    correctness = await self._llm_judge(item, final_response)
    return correctness

ctx

(ToolContext) gives you terminal/file access to the agent's sandbox for verification:

# Run tests in the agent's sandbox
result = ctx.terminal("pytest /workspace/test.py")
return 1.0 if result["exit_code"] == 0 else 0.0

evaluate()

— Periodic evaluation with full agent loop

MUST use the full agent loop with tools, not single-turn chat_completion. The whole point of hermes-agent environments is agentic evaluation:

async def evaluate(self, *args, **kwargs) -> None:
    import time, uuid
    from environments.agent_loop import HermesAgentLoop
    from environments.tool_context import ToolContext

    start_time = time.time()
    tools, valid_names = self._resolve_tools_for_group()
    samples = []

    for item in self._eval_items[:self.config.eval_size]:
        task_id = str(uuid.uuid4())
        messages = []
        if self.config.system_prompt:
            messages.append({"role": "system", "content": self.config.system_prompt})
        messages.append({"role": "user", "content": self.format_prompt(item)})

        agent = HermesAgentLoop(
            server=self.server,
            tool_schemas=tools,
            valid_tool_names=valid_names,
            max_turns=self.config.max_agent_turns,
            task_id=task_id,
            temperature=0.0,  # Deterministic for eval
            max_tokens=self.config.max_token_length,
            extra_body=self.config.extra_body,
        )
        result = await agent.run(messages)

        ctx = ToolContext(task_id)
        try:
            reward = await self.compute_reward(item, result, ctx)
        finally:
            ctx.cleanup()

        samples.append({"prompt": ..., "response": ..., "reward": reward})

    eval_metrics = {"eval/mean_reward": ...}
    await self.evaluate_log(metrics=eval_metrics, samples=samples,
                            start_time=start_time, end_time=time.time())

wandb_log()

— Custom metrics logging

Always call

super().wandb_log()

at the end:

async def wandb_log(self, wandb_metrics=None):
    if wandb_metrics is None:
        wandb_metrics = {}
    if self._reward_buffer:
        n = len(self._reward_buffer)
        wandb_metrics["train/mean_reward"] = sum(self._reward_buffer) / n
        self._reward_buffer.clear()
    await super().wandb_log(wandb_metrics)  # MUST call super

Pitfall:

compute_reward

appends to metric buffers. During eval, this pollutes training metrics. Roll back buffer entries added during eval.

Config Class

Always create a custom config subclass with Pydantic Field descriptors. Key inherited fields you can tune:

enabled_toolsets

max_agent_turns

agent_temperature

system_prompt

terminal_backend

group_size

steps_per_eval

total_steps

config_init() — Default Configuration

Classmethod returning

(YourEnvConfig, [APIServerConfig(...)])

. Set server_type to "openai" for OpenRouter/external APIs. Load API key from environment variable.

Three CLI Modes

# SERVE — Full training loop (connects to Atropos API server)
python environments/my_env.py serve --openai.base_url http://localhost:8000/v1

# PROCESS — Offline data generation (saves JSONL)
python environments/my_env.py process --env.total_steps 10 --env.group_size 1 \
    --env.use_wandb false --env.data_path_to_save_groups output.jsonl \
    --openai.base_url "<USER_BASE_URL>" \
    --openai.model_name "<USER_MODEL>" \
    --openai.server_type <USER_SERVER_TYPE> --openai.health_check false

# EVALUATE — Standalone eval (runs setup + evaluate only)
python environments/my_env.py evaluate --env.eval_size 20 \
    --env.data_dir_to_save_evals /tmp/eval_results \
    --openai.base_url "<USER_BASE_URL>" \
    --openai.model_name "<USER_MODEL>" \
    --openai.server_type <USER_SERVER_TYPE> --openai.health_check false

Config priority: CLI args > YAML file > config_init() defaults.

Common Pitfalls

AgentResult has .messages, not .final_response — Extract the final response by iterating reversed(result.messages) looking for the last assistant message with content.
evaluate() must use HermesAgentLoop, not chat_completion — Single-turn chat_completion has no tools. The whole point of hermes-agent benchmarks is agentic evaluation with tool use.
Don't call _llm_judge twice — If compute_reward already calls it, extract the score from the buffer instead of calling judge separately in evaluate().
Eval pollutes training buffers — compute_reward appends to metric buffers. During eval, roll back buffer entries to keep training metrics clean.
Always set health_check=false for OpenRouter — OpenRouter has no /health endpoint.
Set data_dir_to_save_evals in evaluate mode — Without it, results aren't saved.
default_toolsets class variable vs enabled_toolsets config — The class variable is a hint; the config field is what actually controls tool resolution.
Tool call parsing in messages — Tool calls are dicts with
```
{"function": {"name": ..., "arguments": ...}}
```
. Always check
```
isinstance(tc, dict)
```
.
ToolContext.cleanup() — Always call in a finally block to release sandbox resources.
server_type must be "openai" for external APIs — Without it, Atropos assumes a local VLLM server.
Always ask the user for their inference setup — Never hardcode or assume a specific provider/model. See the "Inference Setup" section above.

Reward Function Patterns

LLM Judge (for open-ended tasks)

Use

self.server.chat_completion()

with a scoring prompt. Parse JSON response for score float. Always include a heuristic fallback (keyword overlap) for when the judge call fails.

Binary Verification (for code/terminal tasks)

Use

ctx.terminal("pytest test.py -q")

to run tests in the agent's sandbox. Return 1.0 for pass, 0.0 for fail.

Multi-Signal (combine multiple indicators)

Weight correctness (0.6) + tool usage (0.2) + efficiency (0.2) + optional bonuses. Clamp to [0, 1].

Testing Your Environment

Import test:

python -c "from environments.my_env import MyEnv; print('OK')"

Ask the user for inference setup (see "Inference Setup" section above)
Process mode (1 item): Verify JSONL output has valid tokens, masks, scores
Evaluate mode: Verify full agent loop runs with tools, metrics logged correctly
Check reward range: Scores should be in [0, 1], not all identical

Minimum Implementation Checklist

class MyEnv(HermesAgentBaseEnv):
    name = "my-env"
    env_config_cls = MyEnvConfig

    @classmethod
    def config_init(cls): ...          # Default server + env config
    async def setup(self): ...         # Load dataset + train/eval split
    async def get_next_item(self): ... # Cycle through training items
    def format_prompt(self, item): ... # Item → user message string
    async def compute_reward(self, item, result, ctx): ...  # Score rollout
    async def evaluate(self, *args, **kwargs): ...  # Full agent loop eval
    async def wandb_log(self, metrics=None): ...    # Custom metrics + super()

if __name__ == "__main__":
    MyEnv.cli()

hermes-atropos-environments

AI Skill Market Insights

Be Part of the 0+ Developer Community

Hermes Agent Atropos Environments

Architecture Overview

File Locations

Inference Setup — Ask the User First

Key flags by provider:

Required Methods

1.
`setup()`
— Load dataset and initialize state

2.
`get_next_item()`
— Return next training item

3.
`format_prompt(item)`
— Convert item to user message

4.
`compute_reward(item, result, ctx)`
— Score the rollout

5.
`evaluate()`
— Periodic evaluation with full agent loop

6.
`wandb_log()`
— Custom metrics logging

Config Class

config_init() — Default Configuration

Three CLI Modes

Common Pitfalls

Reward Function Patterns

LLM Judge (for open-ended tasks)

Binary Verification (for code/terminal tasks)

Multi-Signal (combine multiple indicators)

Testing Your Environment

Minimum Implementation Checklist

License

Quick Start

Install Command

TEAR & SHARE

Tags

huggingface-tokenizers

lambda-labs-gpu-cloud

modal-serverless-gpu

nemo-curator

dspy

Channels

Learn

Compare

Company

Agents

hermes-atropos-environments

AI Skill Market Insights

Be Part of the 0+ Developer Community

Hermes Agent Atropos Environments

Architecture Overview

File Locations

Inference Setup — Ask the User First

Key flags by provider:

Required Methods

1. setup() — Load dataset and initialize state

2. get_next_item() — Return next training item

3. format_prompt(item) — Convert item to user message

4. compute_reward(item, result, ctx) — Score the rollout

5. evaluate() — Periodic evaluation with full agent loop

6. wandb_log() — Custom metrics logging

Config Class

config_init() — Default Configuration

Three CLI Modes

Common Pitfalls

Reward Function Patterns

LLM Judge (for open-ended tasks)

Binary Verification (for code/terminal tasks)

Multi-Signal (combine multiple indicators)

Testing Your Environment

Minimum Implementation Checklist

License

Quick Start

Install Command

TEAR & SHARE

Tags

Channels

Learn

Compare

Company

Agents

1.
`setup()`
— Load dataset and initialize state

2.
`get_next_item()`
— Return next training item

3.
`format_prompt(item)`
— Convert item to user message

4.
`compute_reward(item, result, ctx)`
— Score the rollout

5.
`evaluate()`
— Periodic evaluation with full agent loop

6.
`wandb_log()`
— Custom metrics logging