simpo-training
Simple Preference Optimization for LLM alignment. Reference-free alternative to DPO with better performance (+6.4 points on AlpacaEval 2.0). No reference model needed, more efficient than DPO. Use for
Simple Preference Optimization for LLM alignment. Reference-free alternative to DPO with better performance (+6.4 points on AlpacaEval 2.0). No reference model needed, more efficient than DPO. Use for
Real data. Real impact.
Emerging
Developers
Per week
Excellent
Skills give you superpowers. Install in 30 seconds.
SimPO is a reference-free preference optimization method that outperforms DPO without needing a reference model.
Installation:
# Create environment conda create -n simpo python=3.10 && conda activate simpo # Install PyTorch 2.2.2 # Visit: https://pytorch.org/get-started/locally/ # Install alignment-handbook git clone https://github.com/huggingface/alignment-handbook.git cd alignment-handbook python -m pip install . # Install Flash Attention 2 python -m pip install flash-attn --no-build-isolation
Training (Mistral 7B):
ACCELERATE_LOG_LEVEL=info accelerate launch \ --config_file accelerate_configs/deepspeed_zero3.yaml \ scripts/run_simpo.py \ training_configs/mistral-7b-base-simpo.yaml
Config (
mistral-7b-base-simpo.yaml):
# Model model_name_or_path: mistralai/Mistral-7B-v0.1 torch_dtype: bfloat16 # Dataset dataset_mixer: HuggingFaceH4/ultrafeedback_binarized: 1.0 dataset_splits: - train_prefs - test_prefs # SimPO hyperparameters beta: 2.0 # Reward scaling (2.0-10.0) gamma_beta_ratio: 0.5 # Target margin (0-1) loss_type: sigmoid # sigmoid or hinge sft_weight: 0.0 # Optional SFT regularization # Training learning_rate: 5e-7 # Critical: 3e-7 to 1e-6 num_train_epochs: 1 per_device_train_batch_size: 1 gradient_accumulation_steps: 8 # Output output_dir: ./outputs/mistral-7b-simpo
Launch training:
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \ scripts/run_simpo.py training_configs/mistral-7b-base-simpo.yaml
Config (
llama3-8b-instruct-simpo.yaml):
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct dataset_mixer: argilla/ultrafeedback-binarized-preferences-cleaned: 1.0 beta: 2.5 gamma_beta_ratio: 0.5 learning_rate: 5e-7 sft_weight: 0.1 # Add SFT loss to preserve capabilities num_train_epochs: 1 per_device_train_batch_size: 2 gradient_accumulation_steps: 4 output_dir: ./outputs/llama3-8b-simpo
Launch:
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \ scripts/run_simpo.py training_configs/llama3-8b-instruct-simpo.yaml
For math/code tasks:
model_name_or_path: deepseek-ai/deepseek-math-7b-base dataset_mixer: argilla/distilabel-math-preference-dpo: 1.0 beta: 5.0 # Higher for stronger signal gamma_beta_ratio: 0.7 # Larger margin learning_rate: 3e-7 # Lower LR for reasoning sft_weight: 0.0 num_train_epochs: 1 per_device_train_batch_size: 1 gradient_accumulation_steps: 16
Use SimPO when:
Algorithm selection:
Use alternatives instead:
Issue: Loss divergence
Reduce learning rate:
learning_rate: 3e-7 # Reduce from 5e-7
Reduce beta:
beta: 1.0 # Reduce from 2.0
Issue: Model forgets capabilities
Add SFT regularization:
sft_weight: 0.1 # Add SFT loss component
Issue: Poor preference separation
Increase beta and margin:
beta: 5.0 # Increase from 2.0 gamma_beta_ratio: 0.8 # Increase from 0.5
Issue: OOM during training
Reduce batch size:
per_device_train_batch_size: 1 gradient_accumulation_steps: 16 # Maintain effective batch
Enable gradient checkpointing:
gradient_checkpointing: true
Loss functions: See references/loss-functions.md for sigmoid vs hinge loss, mathematical formulations, and when to use each.
Hyperparameter tuning: See references/hyperparameters.md for beta, gamma, learning rate selection guide, and model-size-specific recommendations.
Dataset preparation: See references/datasets.md for preference data formats, quality filtering, and custom dataset creation.
Memory optimization:
MIT
mkdir -p ~/.hermes/skills/mlops/simpo && curl -o ~/.hermes/skills/mlops/simpo/SKILL.md https://raw.githubusercontent.com/NousResearch/hermes-agent/main/optional-skills/mlops/simpo/SKILL.md1,500+ AI skills, agents & workflows. Install in 30 seconds. Part of the Torly.ai family.
© 2026 Torly.ai. All rights reserved.