slime-rl-training
Provides guidance for LLM post-training with RL using slime, a Megatron+SGLang framework. Use when training GLM models, implementing custom data generation workflows, or needing tight Megatron-LM inte
Provides guidance for LLM post-training with RL using slime, a Megatron+SGLang framework. Use when training GLM models, implementing custom data generation workflows, or needing tight Megatron-LM inte
Real data. Real impact.
Emerging
Developers
Per week
Excellent
Skills give you superpowers. Install in 30 seconds.
slime is an LLM post-training framework from Tsinghua's THUDM team, powering GLM-4.5, GLM-4.6, and GLM-4.7. It connects Megatron-LM for training with SGLang for high-throughput rollout generation.
Choose slime when you need:
Consider alternatives when:
┌─────────────────────────────────────────────────────────┐ │ Data Buffer │ │ - Prompt initialization and management │ │ - Custom data generation and filtering │ │ - Rollout sample storage │ └─────────────┬───────────────────────────┬───────────────┘ │ │ ┌─────────────▼───────────┐ ┌─────────────▼───────────────┐ │ Training (Megatron-LM) │ │ Rollout (SGLang + Router) │ │ - Actor model training │ │ - Response generation │ │ - Critic (optional) │ │ - Reward/verifier output │ │ - Weight sync to rollout│ │ - Multi-turn support │ └─────────────────────────┘ └─────────────────────────────┘
# Recommended: Docker docker pull slimerl/slime:latest docker run --rm --gpus all --ipc=host --shm-size=16g \ -it slimerl/slime:latest /bin/bash # Inside container cd /root/slime && pip install -e . --no-deps
git clone https://github.com/THUDM/slime.git cd slime pip install -r requirements.txt pip install -e .
# Source model configuration source scripts/models/qwen3-4B.sh # Launch training python train.py \ --actor-num-nodes 1 \ --actor-num-gpus-per-node 4 \ --rollout-num-gpus 4 \ --advantage-estimator grpo \ --use-kl-loss --kl-loss-coef 0.001 \ --rollout-batch-size 32 \ --n-samples-per-prompt 8 \ --global-batch-size 256 \ --num-rollout 3000 \ --prompt-data /path/to/data.jsonl \ ${MODEL_ARGS[@]} ${CKPT_ARGS[@]}
Use this workflow for training reasoning models with group-relative advantages.
# data.jsonl format {"prompt": "What is 2 + 2?", "label": "4"} {"prompt": "Solve: 3x = 12", "label": "x = 4"}
Or with chat format:
{ "prompt": [ {"role": "system", "content": "You are a math tutor."}, {"role": "user", "content": "What is 15 + 27?"} ], "label": "42" }
Choose a pre-configured model script:
# List available models ls scripts/models/ # glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ... # Source your model source scripts/models/qwen3-4B.sh
python train.py \ --actor-num-nodes 1 \ --actor-num-gpus-per-node 8 \ --rollout-num-gpus 8 \ --advantage-estimator grpo \ --use-kl-loss \ --kl-loss-coef 0.001 \ --prompt-data /path/to/train.jsonl \ --input-key prompt \ --label-key label \ --apply-chat-template \ --rollout-batch-size 32 \ --n-samples-per-prompt 8 \ --global-batch-size 256 \ --num-rollout 3000 \ --save-interval 100 \ --eval-interval 50 \ ${MODEL_ARGS[@]}
tensorboard --logdir outputs/Use async mode for higher throughput by overlapping rollout and training.
python train_async.py \ --actor-num-nodes 1 \ --actor-num-gpus-per-node 8 \ --rollout-num-gpus 8 \ --advantage-estimator grpo \ --async-buffer-size 4 \ --prompt-data /path/to/train.jsonl \ ${MODEL_ARGS[@]}
--async-buffer-size 4 # Number of rollouts to buffer --update-weights-interval 2 # Sync weights every N rollouts
Use this workflow for training agents with tool use or multi-step reasoning.
# custom_generate.py async def custom_generate(args, samples, evaluation=False): """Multi-turn generation with tool calling.""" for sample in samples: conversation = sample.prompt for turn in range(args.max_turns): # Generate response response = await generate_single(conversation) # Check for tool call tool_call = extract_tool_call(response) if tool_call: tool_result = execute_tool(tool_call) conversation.append({"role": "assistant", "content": response}) conversation.append({"role": "tool", "content": tool_result}) else: break sample.response = response sample.reward = compute_reward(sample) return samples
python train.py \ --custom-generate-function-path custom_generate.py \ --max-turns 5 \ --prompt-data /path/to/agent_data.jsonl \ ${MODEL_ARGS[@]}
See
examples/search-r1/ for a complete multi-turn search example.
slime uses three types of arguments:
1. Megatron Arguments (passed directly):
--tensor-model-parallel-size 2 --pipeline-model-parallel-size 1 --num-layers 32 --hidden-size 4096
2. SGLang Arguments (prefixed with
--sglang-):
--sglang-mem-fraction-static 0.8 --sglang-context-length 8192 --sglang-log-level INFO
3. slime Arguments:
# Resource allocation --actor-num-nodes 1 --actor-num-gpus-per-node 8 --rollout-num-gpus 8 --colocate # Share GPUs between training/inference # Data --prompt-data /path/to/data.jsonl --input-key prompt --label-key label # Training loop --num-rollout 3000 --rollout-batch-size 32 --n-samples-per-prompt 8 --global-batch-size 256 # Algorithm --advantage-estimator grpo # or: gspo, ppo, reinforce_plus_plus --use-kl-loss --kl-loss-coef 0.001
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
Example: 32 × 8 = 256 × 1
slime's data buffer enables flexible data management:
class RolloutDataSource: def get_samples(self, num_samples): """Fetch prompts from dataset.""" return self.dataset.sample(num_samples) def add_samples(self, samples): """Called after generation (no-op by default).""" pass
class RolloutDataSourceWithBuffer(RolloutDataSource): def __init__(self): self.buffer = [] def add_samples(self, samples): """Store generated samples for reuse.""" self.buffer.extend(samples) def buffer_filter(self, args, buffer, num_samples): """Custom selection logic (prioritized, stratified, etc.).""" return select_best(buffer, num_samples)
Symptoms: Inference engine dies mid-training
Solutions:
# Enable fault tolerance --use-fault-tolerance # Increase memory allocation --sglang-mem-fraction-static 0.85 # Reduce batch size --rollout-batch-size 16
Symptoms: Training hangs after rollout
Solutions:
# Increase sync interval --update-weights-interval 5 # Use colocated mode (no network transfer) --colocate
Symptoms: CUDA OOM in backward pass
Solutions:
# Enable gradient checkpointing --recompute-activations # Reduce micro-batch size --micro-batch-size 1 # Enable sequence parallelism --sequence-parallel
Symptoms: GPU idle during data fetch
Solutions:
# Increase data workers --num-data-workers 4 # Use streaming dataset --streaming-data
| Model Family | Configurations |
|---|---|
| GLM | GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B |
| Qwen | Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5 |
| DeepSeek | V3, V3.1, R1 |
| Llama | Llama 3 (8B, 70B) |
| Others | Kimi K2, Moonlight-16B |
Each model has pre-configured scripts in
scripts/models/.
Share GPUs between training and inference to reduce memory:
python train.py \ --colocate \ --actor-num-gpus-per-node 8 \ --sglang-mem-fraction-static 0.4 \ ${MODEL_ARGS[@]}
# custom_rm.py class CustomRewardModel: def __init__(self, model_path): self.model = load_model(model_path) def compute_reward(self, prompts, responses): inputs = self.tokenize(prompts, responses) scores = self.model(inputs) return scores.tolist()
--custom-rm-path custom_rm.py
--eval-prompt-data aime /path/to/aime.jsonl \ --eval-prompt-data gsm8k /path/to/gsm8k.jsonl \ --n-samples-per-eval-prompt 16
examples/ directory for 14+ worked examplesMIT
mkdir -p ~/.hermes/skills/mlops/slime && curl -o ~/.hermes/skills/mlops/slime/SKILL.md https://raw.githubusercontent.com/NousResearch/hermes-agent/main/optional-skills/mlops/slime/SKILL.md1,500+ AI skills, agents & workflows. Install in 30 seconds. Part of the Torly.ai family.
© 2026 Torly.ai. All rights reserved.