Bag-of-Words and TF-IDF Representations

35 min readvideoText Preprocessing and Representation

3 of 32Natural Language Processing

Bag-of-Words and TF-IDF Representations

Once text is tokenized, it has to become numbers. The two oldest representations — bag-of-words (BoW) and TF-IDF — are also still the production default for retrieval, high-volume text classification, and any pipeline where latency or cost rules out neural approaches. This lesson is what they actually compute, why TF-IDF beats BoW, and where both still beat transformers in 2026.

1. The Bag-of-Words Idea

Represent a document as a vector of word counts, ignoring word order. Three documents:

code

D1: "the cat sat on the mat"
D2: "the dog sat on the rug"
D3: "the cat and the dog"

Vocabulary across all three: {the, cat, sat, on, mat, dog, rug, and} (8 words). Each document becomes a count vector of length 8:

code

      the cat sat on mat dog rug and
D1:    2   1   1   1  1   0   0   0
D2:    2   0   1   1  0   1   1   0
D3:    2   1   0   0  0   1   0   1

Two assumptions baked in: word order is ignored ("dog bites man" = "man bites dog"); identical surface forms are equal ("ran" ≠ "running"). Stemming / lemmatization (Lesson 2) fix the second one; n-grams (Section 4) fix part of the first.

2. The Limitation BoW Hits Immediately

Common words ("the", "is", "of") dominate count vectors but carry little information. A document about cats and a document about dogs both have lots of "the" — the model can't easily distinguish them on count alone. Two fixes:

Stop-word removal — drop high-frequency words altogether. Crude.
TF-IDF weighting — down-weight common words proportional to how common they are across the corpus. Better.

3. TF-IDF in One Equation

code

tfidf(t, d, D) = tf(t, d) · idf(t, D)

tf(t, d)  = count(t in d) / |d|             # term frequency
idf(t, D) = log( |D| / (1 + |{d : t ∈ d}|) ) # inverse doc frequency

Two pieces:

TF rewards words that appear often in the document — they're probably about the topic.
IDF penalizes words that appear in many documents — "the" appears everywhere; "phylogenetic" doesn't. The "+1" in the denominator avoids division by zero for unseen-in-corpus terms.

The product gives words that are locally common and globally rare a high score — exactly the signal a topic / sentiment / spam classifier wants.

4. TF-IDF in Scikit-Learn

code

from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    "the cat sat on the mat",
    "the dog sat on the rug",
    "the cat and the dog",
]

vec = TfidfVectorizer(stop_words="english", ngram_range=(1, 1))
X = vec.fit_transform(docs)        # sparse matrix (3 docs x V terms)

print(vec.get_feature_names_out())
# ['cat' 'dog' 'mat' 'rug' 'sat']
print(X.toarray().round(2))
# [[0.61 0.   0.79 0.   0.   ]
#  [0.   0.61 0.   0.79 0.   ]
#  [0.71 0.71 0.   0.   0.   ]]

Default sklearn settings: lowercase + stop-word removal + L2-normalized TF-IDF. The output is sparse — most documents touch only a tiny fraction of the vocabulary, and storing zeros wastes memory. Use sparse matrices throughout downstream pipelines.

5. N-Grams: Cheap Word-Order Recovery

code

vec = TfidfVectorizer(ngram_range=(1, 2))   # unigrams + bigrams
X = vec.fit_transform(docs)
print(vec.get_feature_names_out()[:10])
# ['and', 'and the', 'cat', 'cat and', 'cat sat', 'dog', 'dog sat', 'mat', 'on', 'on the']

Bigrams ("not great" vs "great") capture some local word order at the cost of vocabulary explosion (V × V terms in the worst case). In practice unigrams + bigrams is the default sweet spot; trigrams add little and inflate the feature count 100×.

6. The Production Sweet Spot

code

vec = TfidfVectorizer(
    lowercase=True,
    stop_words="english",
    ngram_range=(1, 2),
    min_df=5,                    # drop terms in < 5 docs (typos, rare)
    max_df=0.9,                  # drop terms in > 90% of docs (junk)
    sublinear_tf=True,           # 1 + log(tf) instead of raw tf
    max_features=50_000,         # cap vocab
)

Six parameters cover most TF-IDF tuning. sublinear_tf empirically helps because raw counts grow non-linearly with document length. min_df is the most-skipped parameter and the source of the largest accuracy lifts — a 1M-doc corpus with no min_df ends up with millions of typos and one-off tokens as features.

7. TF-IDF + Linear Model: The 5-Minute Baseline

code

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words="english", ngram_range=(1, 2))),
    ("clf",   LogisticRegression(max_iter=1000)),
])

pipe.fit(X_train_text, y_train)
print(pipe.score(X_test_text, y_test))

Five lines of code; on most text-classification benchmarks (sentiment, topic, spam) this baseline reaches within 2-5 percentage points of fine-tuned BERT — at 1000× the inference speed. Always build this baseline first before reaching for anything heavier.

8. BM25: TF-IDF's Production Cousin

TF-IDF for classification; BM25 for retrieval. BM25 (Robertson 1976; still the standard) saturates the term frequency contribution and length-normalizes more carefully:

code

BM25(q, d) = Σ_{t ∈ q} idf(t) · (tf · (k₁+1)) / (tf + k₁(1 − b + b·|d|/avgdl))

Used by every search engine that doesn't run on neural embeddings — Elasticsearch, OpenSearch, Solr, Vespa. In hybrid retrieval (Section 5 of the LLM-GenAI course), BM25 is the lexical half paired with dense embeddings.

9. Where TF-IDF Still Beats Transformers in 2026

Use case	Why TF-IDF wins
High-volume spam detection	~1ms inference; cost; auditable rules
Lexical retrieval at scale	BM25 over inverted index; sub-millisecond
Rare-term queries	Dense models miss exact technical terms
Low-resource languages	No pretrained model exists
Regulated environments	Coefficients directly inspectable
Bootstrapping new datasets	Trains in seconds; iterate on data instead of model

10. The Mental Model

← Previous lessonTokenization, Stemming, and Lemmatization

Up next · Building a Text Preprocessing Pipeline