Tokenization,
Stemming, and Lemmatization
2 of 32Natural Language Processing
Tokenization, Stemming, and Lemmatization
Before any model touches a sentence, the sentence must be broken into pieces. Tokenization is the process of splitting raw text into the units a model consumes — words, subwords, or characters. Stemming and lemmatization are the classical-era cousins that further normalize those tokens. This notebook walks all three hands-on, with the modern subword tokenizers that LLMs actually use, so you finish knowing exactly what bytes flow into a model.
pip install nltk==3.9.1 spacy==3.7.5 \
transformers==4.46.0 sentencepiece==0.2.0
python -m spacy download en_core_web_sm
python -c "import nltk; nltk.download('punkt_tab')"1. What Tokenization Actually Does
Tokenization splits a string into a sequence of tokens. For the word "tokenization" alone, the answer depends on the tokenizer:
| Tokenizer | Output |
|---|---|
| Whitespace (naive) | ['tokenization'] |
| NLTK word tokenizer | ['tokenization'] |
| spaCy | ['tokenization'] |
| BERT WordPiece | ['token', '##ization'] |
| GPT-2/3/4 BPE | ['token', 'ization'] |
| SentencePiece (T5, LLaMA) | ['▁token', 'ization'] |
| Character | ['t', 'o', 'k', 'e', 'n', ...] |
| Byte-level | raw bytes — handles any Unicode |
Modern transformers use subword tokenization (WordPiece, BPE, SentencePiece) — a compromise between word-level (small vocabulary, can't handle unseen words) and character-level (handles anything but produces very long sequences).
2. Whitespace and Word Tokenization
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Dr. Smith said NLP is fun. It's harder than it looks!"
# Naive
print(text.split())
# ['Dr.', 'Smith', 'said', 'NLP', 'is', 'fun.', "It's", 'harder', 'than', 'it', 'looks!']
# NLTK
print(word_tokenize(text))
# ['Dr.', 'Smith', 'said', 'NLP', 'is', 'fun', '.', 'It', "'s", 'harder', 'than', 'it', 'looks', '!']
# Sentences
print(sent_tokenize(text))
# ['Dr. Smith said NLP is fun.', "It's harder than it looks!"]
Two things naive whitespace gets wrong: punctuation attachment ("fun." stays glued) and contractions ("It's" stays glued). NLTK handles both. spaCy's tokenizer is even smarter (handles "U.S.A." as one token) and faster.
3. spaCy: The Production-Grade Pipeline
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The quick brown foxes were jumping over the lazy dogs.")
for tok in doc:
print(f"{tok.text:<12} lemma={tok.lemma_:<10} pos={tok.pos_:<6} is_stop={tok.is_stop}")
spaCy gives you tokens + lemmas + part-of-speech + stop-word flags + dependency parse + named entities, all from a single forward pass. ~5,000 tokens/second on CPU. The modern default for any preprocessing pipeline that doesn't involve transformer models.
4. Stemming vs Lemmatization
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk; nltk.download("wordnet")
words = ["running", "runs", "ran", "easily", "fairly", "studies"]
stemmer = PorterStemmer()
print([stemmer.stem(w) for w in words])
# ['run', 'run', 'ran', 'easili', 'fairli', 'studi']
lem = WordNetLemmatizer()
print([lem.lemmatize(w, pos="v") for w in words])
# ['run', 'run', 'run', 'easily', 'fairly', 'study']
| Property | Stemming | Lemmatization |
|---|---|---|
| How | Heuristic suffix stripping | Dictionary lookup + POS-aware |
| Output | Often not a real word ("studi", "fairli") | Always a real word |
| Speed | Fast | Slower (~10× slower) |
| Quality | Aggressive; collapses too much | Conservative; preserves meaning |
| When | Search indexes that prioritize recall | Most other classical NLP work |
5. Stop Words
from nltk.corpus import stopwords
nltk.download("stopwords")
stops = set(stopwords.words("english"))
text = "the quick brown fox jumps over the lazy dog"
filtered = [w for w in text.split() if w.lower() not in stops]
print(filtered)
# ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
Removing high-frequency words (the, is, of, ...) reduces feature dimension and noise for classical models. Same warning: not relevant for transformers, which learn the value of these words from data.
6. Subword Tokenizers (the Modern Default)
from transformers import AutoTokenizer
# WordPiece (BERT family)
tok_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
print(tok_bert.tokenize("The tokenization workflow is unforgettable!"))
# ['the', 'token', '##ization', 'workflow', 'is', 'un', '##forget', '##table', '!']
# BPE (GPT family)
tok_gpt = AutoTokenizer.from_pretrained("gpt2")
print(tok_gpt.tokenize("The tokenization workflow is unforgettable!"))
# ['The', 'Ġtoken', 'ization', 'Ġworkflow', 'Ġis', 'Ġunfor', 'gettable', '!']
# SentencePiece (T5, LLaMA, Mistral)
tok_t5 = AutoTokenizer.from_pretrained("t5-small")
print(tok_t5.tokenize("The tokenization workflow is unforgettable!"))
# ['▁The', '▁token', 'ization', '▁workflow', '▁is', '▁un', 'for', 'gettable', '!']
Each algorithm marks token boundaries differently — `##` for WordPiece continuation, `Ġ` for BPE space-prefix, `▁` for SentencePiece word-start — but they all do the same job: break unknown words into known subword pieces, so the tokenizer never returns "unknown".
7. Encoding to IDs
encoded = tok_bert("The tokenization workflow.", return_tensors="pt")
print(encoded.input_ids)
# tensor([[ 101, 1996, 19204, 3989, 8487, 1012, 102]])
print(tok_bert.convert_ids_to_tokens(encoded.input_ids[0]))
# ['[CLS]', 'the', 'token', '##ization', 'workflow', '.', '[SEP]']
Real model inputs are integer IDs into a vocabulary, not strings. The tokenizer also adds special tokens (CLS, SEP, PAD, BOS, EOS) the model expects. Always use the model's matching tokenizer; mismatched tokenizers silently ruin a fine-tune.
8. Tokenization Pitfalls
9. Putting It Together
def classical_preprocess(text, lemmatizer, stops):
# Lowercase, tokenize, remove stops, lemmatize. For TF-IDF / Naive Bayes.
tokens = word_tokenize(text.lower())
return [lemmatizer.lemmatize(t) for t in tokens
if t.isalpha() and t not in stops]
def transformer_preprocess(text, tokenizer, max_length=128):
# For BERT / GPT / T5: just tokenize. No lower / stop / lemma.
return tokenizer(text, padding="max_length",
truncation=True, max_length=max_length,
return_tensors="pt")
Two preprocessing pipelines, two different worldviews. Use the first for any classical model in this course's Sections 2-3; use the second for everything from Section 4 onward.