whisper
OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M par
High Quality
New
OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M par
Real data. Real impact.
Emerging
Developers
Per week
Excellent
Skills give you superpowers. Install in 30 seconds.
OpenAI's multilingual speech recognition model.
Use when:
Metrics:
Use alternatives instead:
# Requires Python 3.8-3.11 pip install -U openai-whisper # Requires ffmpeg # macOS: brew install ffmpeg # Ubuntu: sudo apt install ffmpeg # Windows: choco install ffmpeg
import whisper # Load model model = whisper.load_model("base") # Transcribe result = model.transcribe("audio.mp3") # Print text print(result["text"]) # Access segments for segment in result["segments"]: print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
# Available models models = ["tiny", "base", "small", "medium", "large", "turbo"] # Load specific model model = whisper.load_model("turbo") # Fastest, good quality
| Model | Parameters | English-only | Multilingual | Speed | VRAM |
|---|---|---|---|---|---|
| tiny | 39M | ✓ | ✓ | ~32x | ~1 GB |
| base | 74M | ✓ | ✓ | ~16x | ~1 GB |
| small | 244M | ✓ | ✓ | ~6x | ~2 GB |
| medium | 769M | ✓ | ✓ | ~2x | ~5 GB |
| large | 1550M | ✗ | ✓ | 1x | ~10 GB |
| turbo | 809M | ✗ | ✓ | ~8x | ~6 GB |
Recommendation: Use
turbo for best speed/quality, base for prototyping
# Auto-detect language result = model.transcribe("audio.mp3") # Specify language (faster) result = model.transcribe("audio.mp3", language="en") # Supported: en, es, fr, de, it, pt, ru, ja, ko, zh, and 89 more
# Transcription (default) result = model.transcribe("audio.mp3", task="transcribe") # Translation to English result = model.transcribe("spanish.mp3", task="translate") # Input: Spanish audio → Output: English text
# Improve accuracy with context result = model.transcribe( "audio.mp3", initial_prompt="This is a technical podcast about machine learning and AI." ) # Helps with: # - Technical terms # - Proper nouns # - Domain-specific vocabulary
# Word-level timestamps result = model.transcribe("audio.mp3", word_timestamps=True) for segment in result["segments"]: for word in segment["words"]: print(f"{word['word']} ({word['start']:.2f}s - {word['end']:.2f}s)")
# Retry with different temperatures if confidence low result = model.transcribe( "audio.mp3", temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0) )
# Basic transcription whisper audio.mp3 # Specify model whisper audio.mp3 --model turbo # Output formats whisper audio.mp3 --output_format txt # Plain text whisper audio.mp3 --output_format srt # Subtitles whisper audio.mp3 --output_format vtt # WebVTT whisper audio.mp3 --output_format json # JSON with timestamps # Language whisper audio.mp3 --language Spanish # Translation whisper spanish.mp3 --task translate
import os audio_files = ["file1.mp3", "file2.mp3", "file3.mp3"] for audio_file in audio_files: print(f"Transcribing {audio_file}...") result = model.transcribe(audio_file) # Save to file output_file = audio_file.replace(".mp3", ".txt") with open(output_file, "w") as f: f.write(result["text"])
# For streaming audio, use faster-whisper # pip install faster-whisper from faster_whisper import WhisperModel model = WhisperModel("base", device="cuda", compute_type="float16") # Transcribe with streaming segments, info = model.transcribe("audio.mp3", beam_size=5) for segment in segments: print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
import whisper # Automatically uses GPU if available model = whisper.load_model("turbo") # Force CPU model = whisper.load_model("turbo", device="cpu") # Force GPU model = whisper.load_model("turbo", device="cuda") # 10-20× faster on GPU
# Generate SRT subtitles whisper video.mp4 --output_format srt --language English # Output: video.srt
from langchain.document_loaders import WhisperTranscriptionLoader loader = WhisperTranscriptionLoader(file_path="audio.mp3") docs = loader.load() # Use transcription in RAG from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
# Use ffmpeg to extract audio ffmpeg -i video.mp4 -vn -acodec pcm_s16le audio.wav # Then transcribe whisper audio.wav
| Model | Real-time factor (CPU) | Real-time factor (GPU) |
|---|---|---|
| tiny | ~0.32 | ~0.01 |
| base | ~0.16 | ~0.01 |
| turbo | ~0.08 | ~0.01 |
| large | ~1.0 | ~0.05 |
Real-time factor: 0.1 = 10× faster than real-time
Top-supported languages:
Full list: 99 languages total
MIT
mkdir -p ~/.hermes/skills/mlops/whisper && curl -o ~/.hermes/skills/mlops/whisper/SKILL.md https://raw.githubusercontent.com/NousResearch/hermes-agent/main/optional-skills/mlops/whisper/SKILL.md1,500+ AI skills, agents & workflows. Install in 30 seconds. Part of the Torly.ai family.
© 2026 Torly.ai. All rights reserved.