Bag-of-Words and TF-IDF Representations
3 of 32Natural Language Processing
Bag-of-Words and TF-IDF Representations
Once text is tokenized, it has to become numbers. The two oldest representations — bag-of-words (BoW) and TF-IDF — are also still the production default for retrieval, high-volume text classification, and any pipeline where latency or cost rules out neural approaches. This lesson is what they actually compute, why TF-IDF beats BoW, and where both still beat transformers in 2026.
1. The Bag-of-Words Idea
Represent a document as a vector of word counts, ignoring word order. Three documents:
D1: "the cat sat on the mat"
D2: "the dog sat on the rug"
D3: "the cat and the dog"
Vocabulary across all three: {the, cat, sat, on, mat,
dog, rug, and} (8 words). Each document becomes a
count vector of length 8:
the cat sat on mat dog rug and
D1: 2 1 1 1 1 0 0 0
D2: 2 0 1 1 0 1 1 0
D3: 2 1 0 0 0 1 0 1
Two assumptions baked in: word order is ignored ("dog bites man" = "man bites dog"); identical surface forms are equal ("ran" ≠ "running"). Stemming / lemmatization (Lesson 2) fix the second one; n-grams (Section 4) fix part of the first.
2. The Limitation BoW Hits Immediately
Common words ("the", "is", "of") dominate count vectors but carry little information. A document about cats and a document about dogs both have lots of "the" — the model can't easily distinguish them on count alone. Two fixes:
- Stop-word removal — drop high-frequency words altogether. Crude.
- TF-IDF weighting — down-weight common words proportional to how common they are across the corpus. Better.
3. TF-IDF in One Equation
tfidf(t, d, D) = tf(t, d) · idf(t, D)
tf(t, d) = count(t in d) / |d| # term frequency
idf(t, D) = log( |D| / (1 + |{d : t ∈ d}|) ) # inverse doc frequency
Two pieces:
- TF rewards words that appear often in the document — they're probably about the topic.
- IDF penalizes words that appear in many documents — "the" appears everywhere; "phylogenetic" doesn't. The "+1" in the denominator avoids division by zero for unseen-in-corpus terms.
The product gives words that are locally common and globally rare a high score — exactly the signal a topic / sentiment / spam classifier wants.
4. TF-IDF in Scikit-Learn
from sklearn.feature_extraction.text import TfidfVectorizer
docs = [
"the cat sat on the mat",
"the dog sat on the rug",
"the cat and the dog",
]
vec = TfidfVectorizer(stop_words="english", ngram_range=(1, 1))
X = vec.fit_transform(docs) # sparse matrix (3 docs x V terms)
print(vec.get_feature_names_out())
# ['cat' 'dog' 'mat' 'rug' 'sat']
print(X.toarray().round(2))
# [[0.61 0. 0.79 0. 0. ]
# [0. 0.61 0. 0.79 0. ]
# [0.71 0.71 0. 0. 0. ]]
Default sklearn settings: lowercase + stop-word removal + L2-normalized TF-IDF. The output is sparse — most documents touch only a tiny fraction of the vocabulary, and storing zeros wastes memory. Use sparse matrices throughout downstream pipelines.
5. N-Grams: Cheap Word-Order Recovery
vec = TfidfVectorizer(ngram_range=(1, 2)) # unigrams + bigrams
X = vec.fit_transform(docs)
print(vec.get_feature_names_out()[:10])
# ['and', 'and the', 'cat', 'cat and', 'cat sat', 'dog', 'dog sat', 'mat', 'on', 'on the']
Bigrams ("not great" vs "great") capture some local word order at the cost of vocabulary explosion (V × V terms in the worst case). In practice unigrams + bigrams is the default sweet spot; trigrams add little and inflate the feature count 100×.
6. The Production Sweet Spot
vec = TfidfVectorizer(
lowercase=True,
stop_words="english",
ngram_range=(1, 2),
min_df=5, # drop terms in < 5 docs (typos, rare)
max_df=0.9, # drop terms in > 90% of docs (junk)
sublinear_tf=True, # 1 + log(tf) instead of raw tf
max_features=50_000, # cap vocab
)
Six parameters cover most TF-IDF tuning. sublinear_tf
empirically helps because raw counts grow non-linearly with
document length. min_df is the most-skipped
parameter and the source of the largest accuracy lifts —
a 1M-doc corpus with no min_df ends up with
millions of typos and one-off tokens as features.
7. TF-IDF + Linear Model: The 5-Minute Baseline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
("tfidf", TfidfVectorizer(stop_words="english", ngram_range=(1, 2))),
("clf", LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train_text, y_train)
print(pipe.score(X_test_text, y_test))
Five lines of code; on most text-classification benchmarks (sentiment, topic, spam) this baseline reaches within 2-5 percentage points of fine-tuned BERT — at 1000× the inference speed. Always build this baseline first before reaching for anything heavier.
8. BM25: TF-IDF's Production Cousin
TF-IDF for classification; BM25 for retrieval. BM25 (Robertson 1976; still the standard) saturates the term frequency contribution and length-normalizes more carefully:
BM25(q, d) = Σ_{t ∈ q} idf(t) · (tf · (k₁+1)) / (tf + k₁(1 − b + b·|d|/avgdl))
Used by every search engine that doesn't run on neural embeddings — Elasticsearch, OpenSearch, Solr, Vespa. In hybrid retrieval (Section 5 of the LLM-GenAI course), BM25 is the lexical half paired with dense embeddings.
9. Where TF-IDF Still Beats Transformers in 2026
| Use case | Why TF-IDF wins |
|---|---|
| High-volume spam detection | ~1ms inference; cost; auditable rules |
| Lexical retrieval at scale | BM25 over inverted index; sub-millisecond |
| Rare-term queries | Dense models miss exact technical terms |
| Low-resource languages | No pretrained model exists |
| Regulated environments | Coefficients directly inspectable |
| Bootstrapping new datasets | Trains in seconds; iterate on data instead of model |