tensorrt-llm
Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, o
High Quality
New
Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, o
Real data. Real impact.
Emerging
Developers
Per week
Excellent
Skills give you superpowers. Install in 30 seconds.
NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.
Use TensorRT-LLM when:
Use vLLM instead when:
Use llama.cpp instead when:
# Docker (recommended) docker pull nvidia/tensorrt_llm:latest # pip install pip install tensorrt_llm==1.2.0rc3 # Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
from tensorrt_llm import LLM, SamplingParams # Initialize model llm = LLM(model="meta-llama/Meta-Llama-3-8B") # Configure sampling sampling_params = SamplingParams( max_tokens=100, temperature=0.7, top_p=0.9 ) # Generate prompts = ["Explain quantum computing"] outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.text)
# Start server (automatic model download and compilation) trtllm-serve meta-llama/Meta-Llama-3-8B \ --tp_size 4 \ # Tensor parallelism (4 GPUs) --max_batch_size 256 \ --max_num_tokens 4096 # Client request curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3-8B", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'
from tensorrt_llm import LLM # Load FP8 quantized model (2× faster, 50% memory) llm = LLM( model="meta-llama/Meta-Llama-3-70B", dtype="fp8", max_num_tokens=8192 ) # Inference same as before outputs = llm.generate(["Summarize this article..."])
# Tensor parallelism across 8 GPUs llm = LLM( model="meta-llama/Meta-Llama-3-405B", tensor_parallel_size=8, dtype="fp8" )
# Process 100 prompts efficiently prompts = [f"Question {i}: ..." for i in range(100)] outputs = llm.generate( prompts, sampling_params=SamplingParams(max_tokens=200) ) # Automatic in-flight batching for maximum throughput
Meta Llama 3-8B (H100 GPU):
Llama 3-70B (8× A100 80GB):
MIT
mkdir -p ~/.hermes/skills/mlops/tensorrt-llm && curl -o ~/.hermes/skills/mlops/tensorrt-llm/SKILL.md https://raw.githubusercontent.com/NousResearch/hermes-agent/main/optional-skills/mlops/tensorrt-llm/SKILL.md1,500+ AI skills, agents & workflows. Install in 30 seconds. Part of the Torly.ai family.
© 2026 Torly.ai. All rights reserved.