huggingface-tokenizers
Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignme
Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignme
Real data. Real impact.
Emerging
Developers
Per week
Excellent
Skills give you superpowers. Install in 30 seconds.
Fast, production-ready tokenizers with Rust performance and Python ease-of-use.
Use HuggingFace Tokenizers when:
Performance:
Use alternatives instead:
# Install tokenizers pip install tokenizers # With transformers integration pip install tokenizers transformers
from tokenizers import Tokenizer # Load from HuggingFace Hub tokenizer = Tokenizer.from_pretrained("bert-base-uncased") # Encode text output = tokenizer.encode("Hello, how are you?") print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?'] print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029] # Decode back text = tokenizer.decode(output.ids) print(text) # "hello, how are you?"
from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import Whitespace # Initialize tokenizer with BPE model tokenizer = Tokenizer(BPE(unk_token="[UNK]")) tokenizer.pre_tokenizer = Whitespace() # Configure trainer trainer = BpeTrainer( vocab_size=30000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], min_frequency=2 ) # Train on files files = ["train.txt", "validation.txt"] tokenizer.train(files, trainer) # Save tokenizer.save("my-tokenizer.json")
Training time: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB
# Enable padding tokenizer.enable_padding(pad_id=3, pad_token="[PAD]") # Encode batch texts = ["Hello world", "This is a longer sentence"] encodings = tokenizer.encode_batch(texts) for encoding in encodings: print(encoding.ids) # [101, 7592, 2088, 102, 3, 3, 3] # [101, 2023, 2003, 1037, 2936, 6251, 102]
How it works:
Used by: GPT-2, GPT-3, RoBERTa, BART, DeBERTa
from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import ByteLevel tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>")) tokenizer.pre_tokenizer = ByteLevel() trainer = BpeTrainer( vocab_size=50257, special_tokens=["<|endoftext|>"], min_frequency=2 ) tokenizer.train(files=["data.txt"], trainer=trainer)
Advantages:
Trade-offs:
How it works:
frequency(pair) / (frequency(first) × frequency(second))Used by: BERT, DistilBERT, MobileBERT
from tokenizers import Tokenizer from tokenizers.models import WordPiece from tokenizers.trainers import WordPieceTrainer from tokenizers.pre_tokenizers import Whitespace from tokenizers.normalizers import BertNormalizer tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) tokenizer.normalizer = BertNormalizer(lowercase=True) tokenizer.pre_tokenizer = Whitespace() trainer = WordPieceTrainer( vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], continuing_subword_prefix="##" ) tokenizer.train(files=["corpus.txt"], trainer=trainer)
Advantages:
Trade-offs:
[UNK] if no subword matchHow it works:
Used by: ALBERT, T5, mBART, XLNet (via SentencePiece)
from tokenizers import Tokenizer from tokenizers.models import Unigram from tokenizers.trainers import UnigramTrainer tokenizer = Tokenizer(Unigram()) trainer = UnigramTrainer( vocab_size=8000, special_tokens=["<unk>", "<s>", "</s>"], unk_token="<unk>" ) tokenizer.train(files=["data.txt"], trainer=trainer)
Advantages:
Trade-offs:
Complete pipeline: Normalization → Pre-tokenization → Model → Post-processing
Clean and standardize text:
from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence tokenizer.normalizer = Sequence([ NFD(), # Unicode normalization (decompose) Lowercase(), # Convert to lowercase StripAccents() # Remove accents ]) # Input: "Héllo WORLD" # After normalization: "hello world"
Common normalizers:
NFD, NFC, NFKD, NFKC - Unicode normalization formsLowercase() - Convert to lowercaseStripAccents() - Remove accents (é → e)Strip() - Remove whitespaceReplace(pattern, content) - Regex replacementSplit text into word-like units:
from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel # Split on whitespace and punctuation tokenizer.pre_tokenizer = Sequence([ Whitespace(), Punctuation() ]) # Input: "Hello, world!" # After pre-tokenization: ["Hello", ",", "world", "!"]
Common pre-tokenizers:
Whitespace() - Split on spaces, tabs, newlinesByteLevel() - GPT-2 style byte-level splittingPunctuation() - Isolate punctuationDigits(individual_digits=True) - Split digits individuallyMetaspace() - Replace spaces with ▁ (SentencePiece style)Add special tokens for model input:
from tokenizers.processors import TemplateProcessing # BERT-style: [CLS] sentence [SEP] tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B [SEP]", special_tokens=[ ("[CLS]", 1), ("[SEP]", 2), ], )
Common patterns:
# GPT-2: sentence <|endoftext|> TemplateProcessing( single="$A <|endoftext|>", special_tokens=[("<|endoftext|>", 50256)] ) # RoBERTa: <s> sentence </s> TemplateProcessing( single="<s> $A </s>", pair="<s> $A </s> </s> $B </s>", special_tokens=[("<s>", 0), ("</s>", 2)] )
Track token positions in original text:
output = tokenizer.encode("Hello, world!") # Get token offsets for token, offset in zip(output.tokens, output.offsets): start, end = offset print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}") # Output: # hello → [ 0, 5): 'Hello' # , → [ 5, 6): ',' # world → [ 7, 12): 'world' # ! → [12, 13): '!'
Use cases:
from transformers import AutoTokenizer # AutoTokenizer automatically uses fast tokenizers tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Check if using fast tokenizer print(tokenizer.is_fast) # True # Access underlying tokenizers.Tokenizer fast_tokenizer = tokenizer.backend_tokenizer print(type(fast_tokenizer)) # <class 'tokenizers.Tokenizer'>
from tokenizers import Tokenizer from transformers import PreTrainedTokenizerFast # Train custom tokenizer tokenizer = Tokenizer(BPE()) # ... train tokenizer ... tokenizer.save("my-tokenizer.json") # Wrap for transformers transformers_tokenizer = PreTrainedTokenizerFast( tokenizer_file="my-tokenizer.json", unk_token="[UNK]", pad_token="[PAD]", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]" ) # Use like any transformers tokenizer outputs = transformers_tokenizer( "Hello world", padding=True, truncation=True, max_length=512, return_tensors="pt" )
from datasets import load_dataset # Load dataset dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train") # Create batch iterator def batch_iterator(batch_size=1000): for i in range(0, len(dataset), batch_size): yield dataset[i:i + batch_size]["text"] # Train tokenizer tokenizer.train_from_iterator( batch_iterator(), trainer=trainer, length=len(dataset) # For progress bar )
Performance: Processes 1GB in ~10-20 minutes
# Enable truncation tokenizer.enable_truncation(max_length=512) # Enable padding tokenizer.enable_padding( pad_id=tokenizer.token_to_id("[PAD]"), pad_token="[PAD]", length=512 # Fixed length, or None for batch max ) # Encode with both output = tokenizer.encode("This is a long sentence that will be truncated...") print(len(output.ids)) # 512
from tokenizers import Tokenizer from multiprocessing import Pool # Load tokenizer tokenizer = Tokenizer.from_file("tokenizer.json") def encode_batch(texts): return tokenizer.encode_batch(texts) # Process large corpus in parallel with Pool(8) as pool: # Split corpus into chunks chunk_size = 1000 chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)] # Encode in parallel results = pool.map(encode_batch, chunks)
Speedup: 5-8× with 8 cores
| Corpus Size | BPE (30k vocab) | WordPiece (30k) | Unigram (8k) |
|---|---|---|---|
| 10 MB | 15 sec | 18 sec | 25 sec |
| 100 MB | 1.5 min | 2 min | 4 min |
| 1 GB | 15 min | 20 min | 40 min |
Hardware: 16-core CPU, tested on English Wikipedia
| Implementation | 1 GB corpus | Throughput |
|---|---|---|
| Pure Python | ~20 minutes | ~50 MB/min |
| HF Tokenizers | ~15 seconds | ~4 GB/min |
| Speedup | 80× | 80× |
Test: English text, average sentence length 20 words
| Task | Memory |
|---|---|
| Load tokenizer | ~10 MB |
| Train BPE (30k vocab) | ~200 MB |
| Encode 1M sentences | ~500 MB |
Pre-trained tokenizers available via
from_pretrained():
BERT family:
bert-base-uncased, bert-large-caseddistilbert-base-uncasedroberta-base, roberta-largeGPT family:
gpt2, gpt2-medium, gpt2-largedistilgpt2T5 family:
t5-small, t5-base, t5-largegoogle/flan-t5-xxlOther:
facebook/bart-base, facebook/mbart-large-cc25albert-base-v2, albert-xlarge-v2xlm-roberta-base, xlm-roberta-largeBrowse all: https://huggingface.co/models?library=tokenizers
MIT
mkdir -p ~/.hermes/skills/mlops/huggingface-tokenizers && curl -o ~/.hermes/skills/mlops/huggingface-tokenizers/SKILL.md https://raw.githubusercontent.com/NousResearch/hermes-agent/main/optional-skills/mlops/huggingface-tokenizers/SKILL.md1,500+ AI skills, agents & workflows. Install in 30 seconds. Part of the Torly.ai family.
© 2026 Torly.ai. All rights reserved.