Tokenization,
Stemming, and Lemmatization

45 min readnotebookText Preprocessing and Representation

2 of 32Natural Language Processing

Tokenization, Stemming, and Lemmatization

Before any model touches a sentence, the sentence must be broken into pieces. Tokenization is the process of splitting raw text into the units a model consumes — words, subwords, or characters. Stemming and lemmatization are the classical-era cousins that further normalize those tokens. This notebook walks all three hands-on, with the modern subword tokenizers that LLMs actually use, so you finish knowing exactly what bytes flow into a model.

code

pip install nltk==3.9.1 spacy==3.7.5 \
            transformers==4.46.0 sentencepiece==0.2.0
python -m spacy download en_core_web_sm
python -c "import nltk; nltk.download('punkt_tab')"

1. What Tokenization Actually Does

Tokenization splits a string into a sequence of tokens. For the word "tokenization" alone, the answer depends on the tokenizer:

Tokenizer	Output
Whitespace (naive)	['tokenization']
NLTK word tokenizer	['tokenization']
spaCy	['tokenization']
BERT WordPiece	['token', '##ization']
GPT-2/3/4 BPE	['token', 'ization']
SentencePiece (T5, LLaMA)	['▁token', 'ization']
Character	['t', 'o', 'k', 'e', 'n', ...]
Byte-level	raw bytes — handles any Unicode

Modern transformers use subword tokenization (WordPiece, BPE, SentencePiece) — a compromise between word-level (small vocabulary, can't handle unseen words) and character-level (handles anything but produces very long sequences).

2. Whitespace and Word Tokenization

code

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Dr. Smith said NLP is fun. It's harder than it looks!"

# Naive
print(text.split())
# ['Dr.', 'Smith', 'said', 'NLP', 'is', 'fun.', "It's", 'harder', 'than', 'it', 'looks!']

# NLTK
print(word_tokenize(text))
# ['Dr.', 'Smith', 'said', 'NLP', 'is', 'fun', '.', 'It', "'s", 'harder', 'than', 'it', 'looks', '!']

# Sentences
print(sent_tokenize(text))
# ['Dr. Smith said NLP is fun.', "It's harder than it looks!"]

Two things naive whitespace gets wrong: punctuation attachment ("fun." stays glued) and contractions ("It's" stays glued). NLTK handles both. spaCy's tokenizer is even smarter (handles "U.S.A." as one token) and faster.

3. spaCy: The Production-Grade Pipeline

code

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("The quick brown foxes were jumping over the lazy dogs.")
for tok in doc:
    print(f"{tok.text:<12} lemma={tok.lemma_:<10} pos={tok.pos_:<6} is_stop={tok.is_stop}")

spaCy gives you tokens + lemmas + part-of-speech + stop-word flags + dependency parse + named entities, all from a single forward pass. ~5,000 tokens/second on CPU. The modern default for any preprocessing pipeline that doesn't involve transformer models.

4. Stemming vs Lemmatization

code

from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk; nltk.download("wordnet")

words = ["running", "runs", "ran", "easily", "fairly", "studies"]

stemmer = PorterStemmer()
print([stemmer.stem(w) for w in words])
# ['run', 'run', 'ran', 'easili', 'fairli', 'studi']

lem = WordNetLemmatizer()
print([lem.lemmatize(w, pos="v") for w in words])
# ['run', 'run', 'run', 'easily', 'fairly', 'study']

Property	Stemming	Lemmatization
How	Heuristic suffix stripping	Dictionary lookup + POS-aware
Output	Often not a real word ("studi", "fairli")	Always a real word
Speed	Fast	Slower (~10× slower)
Quality	Aggressive; collapses too much	Conservative; preserves meaning
When	Search indexes that prioritize recall	Most other classical NLP work

5. Stop Words

code

from nltk.corpus import stopwords
nltk.download("stopwords")

stops = set(stopwords.words("english"))
text = "the quick brown fox jumps over the lazy dog"
filtered = [w for w in text.split() if w.lower() not in stops]
print(filtered)
# ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

Removing high-frequency words (the, is, of, ...) reduces feature dimension and noise for classical models. Same warning: not relevant for transformers, which learn the value of these words from data.

6. Subword Tokenizers (the Modern Default)

code

from transformers import AutoTokenizer

# WordPiece (BERT family)
tok_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
print(tok_bert.tokenize("The tokenization workflow is unforgettable!"))
# ['the', 'token', '##ization', 'workflow', 'is', 'un', '##forget', '##table', '!']

# BPE (GPT family)
tok_gpt = AutoTokenizer.from_pretrained("gpt2")
print(tok_gpt.tokenize("The tokenization workflow is unforgettable!"))
# ['The', 'Ġtoken', 'ization', 'Ġworkflow', 'Ġis', 'Ġunfor', 'gettable', '!']

# SentencePiece (T5, LLaMA, Mistral)
tok_t5 = AutoTokenizer.from_pretrained("t5-small")
print(tok_t5.tokenize("The tokenization workflow is unforgettable!"))
# ['▁The', '▁token', 'ization', '▁workflow', '▁is', '▁un', 'for', 'gettable', '!']

Each algorithm marks token boundaries differently — `##` for WordPiece continuation, `Ġ` for BPE space-prefix, `▁` for SentencePiece word-start — but they all do the same job: break unknown words into known subword pieces, so the tokenizer never returns "unknown".

7. Encoding to IDs

code

encoded = tok_bert("The tokenization workflow.", return_tensors="pt")
print(encoded.input_ids)
# tensor([[ 101, 1996, 19204, 3989, 8487, 1012,  102]])
print(tok_bert.convert_ids_to_tokens(encoded.input_ids[0]))
# ['[CLS]', 'the', 'token', '##ization', 'workflow', '.', '[SEP]']

Real model inputs are integer IDs into a vocabulary, not strings. The tokenizer also adds special tokens (CLS, SEP, PAD, BOS, EOS) the model expects. Always use the model's matching tokenizer; mismatched tokenizers silently ruin a fine-tune.

8. Tokenization Pitfalls

9. Putting It Together

code

def classical_preprocess(text, lemmatizer, stops):
    # Lowercase, tokenize, remove stops, lemmatize. For TF-IDF / Naive Bayes.
    tokens = word_tokenize(text.lower())
    return [lemmatizer.lemmatize(t) for t in tokens
            if t.isalpha() and t not in stops]

def transformer_preprocess(text, tokenizer, max_length=128):
    # For BERT / GPT / T5: just tokenize. No lower / stop / lemma.
    return tokenizer(text, padding="max_length",
                     truncation=True, max_length=max_length,
                     return_tensors="pt")

Two preprocessing pipelines, two different worldviews. Use the first for any classical model in this course's Sections 2-3; use the second for everything from Section 4 onward.

10. Exercises

← Previous lessonIntroduction to NLP and Text Data

Up next · Bag-of-Words and TF-IDF Representations