modal-serverless-gpu
Serverless GPU cloud platform for running ML workloads. Use when you need on-demand GPU access without infrastructure management, deploying ML models as APIs, or running batch jobs with automatic scal
High Quality
New
Serverless GPU cloud platform for running ML workloads. Use when you need on-demand GPU access without infrastructure management, deploying ML models as APIs, or running batch jobs with automatic scal
Real data. Real impact.
Emerging
Developers
Per week
Excellent
Skills give you superpowers. Install in 30 seconds.
Comprehensive guide to running ML workloads on Modal's serverless GPU cloud platform.
Use Modal when:
Key features:
Use alternatives instead:
pip install modal modal setup # Opens browser for authentication
import modal app = modal.App("hello-gpu") @app.function(gpu="T4") def gpu_info(): import subprocess return subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout @app.local_entrypoint() def main(): print(gpu_info.remote())
Run:
modal run hello_gpu.py
import modal app = modal.App("text-generation") image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate") @app.cls(gpu="A10G", image=image) class TextGenerator: @modal.enter() def load_model(self): from transformers import pipeline self.pipe = pipeline("text-generation", model="gpt2", device=0) @modal.method() def generate(self, prompt: str) -> str: return self.pipe(prompt, max_length=100)[0]["generated_text"] @app.local_entrypoint() def main(): print(TextGenerator().generate.remote("Hello, world"))
| Component | Purpose |
|---|---|
| Container for functions and resources |
| Serverless function with compute specs |
| Class-based functions with lifecycle hooks |
| Container image definition |
| Persistent storage for models/data |
| Secure credential storage |
| Command | Description |
|---|---|
| Execute and exit |
| Development with live reload |
| Persistent cloud deployment |
| GPU | VRAM | Best For |
|---|---|---|
| 16GB | Budget inference, small models |
| 24GB | Inference, Ada Lovelace arch |
| 24GB | Training/inference, 3.3x faster than T4 |
| 48GB | Recommended for inference (best cost/perf) |
| 40GB | Large model training |
| 80GB | Very large models |
| 80GB | Fastest, FP8 + Transformer Engine |
| 141GB | Auto-upgrade from H100, 4.8TB/s bandwidth |
| Latest | Blackwell architecture |
# Single GPU @app.function(gpu="A100") # Specific memory variant @app.function(gpu="A100-80GB") # Multiple GPUs (up to 8) @app.function(gpu="H100:4") # GPU with fallbacks @app.function(gpu=["H100", "A100", "L40S"]) # Any available GPU @app.function(gpu="any")
# Basic image with pip image = modal.Image.debian_slim(python_version="3.11").pip_install( "torch==2.1.0", "transformers==4.36.0", "accelerate" ) # From CUDA base image = modal.Image.from_registry( "nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04", add_python="3.11" ).pip_install("torch", "transformers") # With system packages image = modal.Image.debian_slim().apt_install("git", "ffmpeg").pip_install("whisper")
volume = modal.Volume.from_name("model-cache", create_if_missing=True) @app.function(gpu="A10G", volumes={"/models": volume}) def load_model(): import os model_path = "/models/llama-7b" if not os.path.exists(model_path): model = download_model() model.save_pretrained(model_path) volume.commit() # Persist changes return load_from_path(model_path)
@app.function() @modal.fastapi_endpoint(method="POST") def predict(text: str) -> dict: return {"result": model.predict(text)}
from fastapi import FastAPI web_app = FastAPI() @web_app.post("/predict") async def predict(text: str): return {"result": await model.predict.remote.aio(text)} @app.function() @modal.asgi_app() def fastapi_app(): return web_app
| Decorator | Use Case |
|---|---|
| Simple function → API |
| Full FastAPI/Starlette apps |
| Django/Flask apps |
| Arbitrary HTTP servers |
@app.function() @modal.batched(max_batch_size=32, wait_ms=100) async def batch_predict(inputs: list[str]) -> list[dict]: # Inputs automatically batched return model.batch_predict(inputs)
# Create secret modal secret create huggingface HF_TOKEN=hf_xxx
@app.function(secrets=[modal.Secret.from_name("huggingface")]) def download_model(): import os token = os.environ["HF_TOKEN"]
@app.function(schedule=modal.Cron("0 0 * * *")) # Daily midnight def daily_job(): pass @app.function(schedule=modal.Period(hours=1)) def hourly_job(): pass
@app.function( container_idle_timeout=300, # Keep warm 5 min allow_concurrent_inputs=10, # Handle concurrent requests ) def inference(): pass
@app.cls(gpu="A100") class Model: @modal.enter() # Run once at container start def load(self): self.model = load_model() # Load during warm-up @modal.method() def predict(self, x): return self.model(x)
@app.function() def process_item(item): return expensive_computation(item) @app.function() def run_parallel(): items = list(range(1000)) # Fan out to parallel containers results = list(process_item.map(items)) return results
@app.function( gpu="A100", memory=32768, # 32GB RAM cpu=4, # 4 CPU cores timeout=3600, # 1 hour max container_idle_timeout=120,# Keep warm 2 min retries=3, # Retry on failure concurrency_limit=10, # Max concurrent containers ) def my_function(): pass
# Test locally if __name__ == "__main__": result = my_function.local() # View logs # modal app logs my-app
| Issue | Solution |
|---|---|
| Cold start latency | Increase , use |
| GPU OOM | Use larger GPU (), enable gradient checkpointing |
| Image build fails | Pin dependency versions, check CUDA compatibility |
| Timeout errors | Increase , add checkpointing |
MIT
mkdir -p ~/.hermes/skills/mlops/modal && curl -o ~/.hermes/skills/mlops/modal/SKILL.md https://raw.githubusercontent.com/NousResearch/hermes-agent/main/optional-skills/mlops/modal/SKILL.md1,500+ AI skills, agents & workflows. Install in 30 seconds. Part of the Torly.ai family.
© 2026 Torly.ai. All rights reserved.