lambda-labs-gpu-cloud
Reserved and on-demand GPU cloud instances for ML training and inference. Use when you need dedicated GPU instances with simple SSH access, persistent filesystems, or high-performance multi-node clust
Reserved and on-demand GPU cloud instances for ML training and inference. Use when you need dedicated GPU instances with simple SSH access, persistent filesystems, or high-performance multi-node clust
Real data. Real impact.
Emerging
Developers
Per week
Excellent
Skills give you superpowers. Install in 30 seconds.
Comprehensive guide to running ML workloads on Lambda Labs GPU cloud with on-demand instances and 1-Click Clusters.
Use Lambda Labs when:
Key features:
Use alternatives instead:
# Get instance IP from console ssh ubuntu@<INSTANCE-IP> # Or with specific key ssh -i ~/.ssh/lambda_key ubuntu@<INSTANCE-IP>
| GPU | VRAM | Price/GPU/hr | Best For |
|---|---|---|---|
| B200 SXM6 | 180 GB | $4.99 | Largest models, fastest training |
| H100 SXM | 80 GB | $2.99-3.29 | Large model training |
| H100 PCIe | 80 GB | $2.49 | Cost-effective H100 |
| GH200 | 96 GB | $1.49 | Single-GPU large models |
| A100 80GB | 80 GB | $1.79 | Production training |
| A100 40GB | 40 GB | $1.29 | Standard training |
| A10 | 24 GB | $0.75 | Inference, fine-tuning |
| A6000 | 48 GB | $0.80 | Good VRAM/price ratio |
| V100 | 16 GB | $0.55 | Budget training |
8x GPU: Best for distributed training (DDP, FSDP) 4x GPU: Large models, multi-GPU training 2x GPU: Medium workloads 1x GPU: Fine-tuning, inference, development
All instances come with Lambda Stack pre-installed:
# Included software - Ubuntu 22.04 LTS - NVIDIA drivers (latest) - CUDA 12.x - cuDNN 8.x - NCCL (for multi-GPU) - PyTorch (latest) - TensorFlow (latest) - JAX - JupyterLab
# Check GPU nvidia-smi # Check PyTorch python -c "import torch; print(torch.cuda.is_available())" # Check CUDA version nvcc --version
pip install lambda-cloud-client
import os import lambda_cloud_client # Configure with API key configuration = lambda_cloud_client.Configuration( host="https://cloud.lambdalabs.com/api/v1", access_token=os.environ["LAMBDA_API_KEY"] )
with lambda_cloud_client.ApiClient(configuration) as api_client: api = lambda_cloud_client.DefaultApi(api_client) # Get available instance types types = api.instance_types() for name, info in types.data.items(): print(f"{name}: {info.instance_type.description}")
from lambda_cloud_client.models import LaunchInstanceRequest request = LaunchInstanceRequest( region_name="us-west-1", instance_type_name="gpu_1x_h100_sxm5", ssh_key_names=["my-ssh-key"], file_system_names=["my-filesystem"], # Optional name="training-job" ) response = api.launch_instance(request) instance_id = response.data.instance_ids[0] print(f"Launched: {instance_id}")
instances = api.list_instances() for instance in instances.data: print(f"{instance.name}: {instance.ip} ({instance.status})")
from lambda_cloud_client.models import TerminateInstanceRequest request = TerminateInstanceRequest( instance_ids=[instance_id] ) api.terminate_instance(request)
from lambda_cloud_client.models import AddSshKeyRequest # Add SSH key request = AddSshKeyRequest( name="my-key", public_key="ssh-rsa AAAA..." ) api.add_ssh_key(request) # List keys keys = api.list_ssh_keys() # Delete key api.delete_ssh_key(key_id)
curl -u $LAMBDA_API_KEY: \ https://cloud.lambdalabs.com/api/v1/instance-types | jq
curl -u $LAMBDA_API_KEY: \ -X POST https://cloud.lambdalabs.com/api/v1/instance-operations/launch \ -H "Content-Type: application/json" \ -d '{ "region_name": "us-west-1", "instance_type_name": "gpu_1x_h100_sxm5", "ssh_key_names": ["my-key"] }' | jq
curl -u $LAMBDA_API_KEY: \ -X POST https://cloud.lambdalabs.com/api/v1/instance-operations/terminate \ -H "Content-Type: application/json" \ -d '{"instance_ids": ["<INSTANCE-ID>"]}' | jq
Filesystems persist data across instance restarts:
# Mount location /lambda/nfs/<FILESYSTEM_NAME> # Example: save checkpoints python train.py --checkpoint-dir /lambda/nfs/my-storage/checkpoints
Filesystems must be attached at instance launch time:
file_system_names in launch request# Store on filesystem (persists) /lambda/nfs/storage/ ├── datasets/ ├── checkpoints/ ├── models/ └── outputs/ # Local SSD (faster, ephemeral) /home/ubuntu/ └── working/ # Temporary files
# Generate key locally ssh-keygen -t ed25519 -f ~/.ssh/lambda_key # Add public key to Lambda console # Or via API
# On instance, add more keys echo 'ssh-rsa AAAA...' >> ~/.ssh/authorized_keys
# On instance ssh-import-id gh:username
# Forward Jupyter ssh -L 8888:localhost:8888 ubuntu@<IP> # Forward TensorBoard ssh -L 6006:localhost:6006 ubuntu@<IP> # Multiple ports ssh -L 8888:localhost:8888 -L 6006:localhost:6006 ubuntu@<IP>
# On instance jupyter lab --ip=0.0.0.0 --port=8888 # From local machine with tunnel ssh -L 8888:localhost:8888 ubuntu@<IP> # Open http://localhost:8888
# SSH to instance ssh ubuntu@<IP> # Clone repo git clone https://github.com/user/project cd project # Install dependencies pip install -r requirements.txt # Train python train.py --epochs 100 --checkpoint-dir /lambda/nfs/storage/checkpoints
# train_ddp.py import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP def main(): dist.init_process_group("nccl") rank = dist.get_rank() device = rank % torch.cuda.device_count() model = MyModel().to(device) model = DDP(model, device_ids=[device]) # Training loop... if __name__ == "__main__": main()
# Launch with torchrun (8 GPUs) torchrun --nproc_per_node=8 train_ddp.py
import os checkpoint_dir = "/lambda/nfs/my-storage/checkpoints" os.makedirs(checkpoint_dir, exist_ok=True) # Save checkpoint torch.save({ 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss, }, f"{checkpoint_dir}/checkpoint_{epoch}.pt")
High-performance Slurm clusters with:
# On Slurm cluster srun --nodes=4 --ntasks-per-node=8 --gpus-per-node=8 \ torchrun --nnodes=4 --nproc_per_node=8 \ --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500 \ train.py
# Find private IP ip addr show | grep 'inet '
# 1. Launch 8x H100 instance with filesystem # 2. SSH and setup ssh ubuntu@<IP> pip install transformers accelerate peft # 3. Download model to filesystem python -c " from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf') model.save_pretrained('/lambda/nfs/storage/models/llama-2-7b') " # 4. Fine-tune with checkpoints on filesystem accelerate launch --num_processes 8 train.py \ --model_path /lambda/nfs/storage/models/llama-2-7b \ --output_dir /lambda/nfs/storage/outputs \ --checkpoint_dir /lambda/nfs/storage/checkpoints
# 1. Launch A10 instance (cost-effective for inference) # 2. Run inference python inference.py \ --model /lambda/nfs/storage/models/fine-tuned \ --input /lambda/nfs/storage/data/inputs.jsonl \ --output /lambda/nfs/storage/data/outputs.jsonl
| Task | Recommended GPU |
|---|---|
| LLM fine-tuning (7B) | A100 40GB |
| LLM fine-tuning (70B) | 8x H100 |
| Inference | A10, A6000 |
| Development | V100, A10 |
| Maximum performance | B200 |
| Issue | Solution |
|---|---|
| Instance won't launch | Check region availability, try different GPU |
| SSH connection refused | Wait for instance to initialize (3-15 min) |
| Data lost after terminate | Use persistent filesystems |
| Slow data transfer | Use filesystem in same region |
| GPU not detected | Reboot instance, check drivers |
MIT
mkdir -p ~/.hermes/skills/mlops/lambda-labs && curl -o ~/.hermes/skills/mlops/lambda-labs/SKILL.md https://raw.githubusercontent.com/NousResearch/hermes-agent/main/optional-skills/mlops/lambda-labs/SKILL.md1,500+ AI skills, agents & workflows. Install in 30 seconds. Part of the Torly.ai family.
© 2026 Torly.ai. All rights reserved.