Introduction to NLP and Text Data

30 min readvideoText Preprocessing and Representation

1 of 32Natural Language Processing

Introduction to NLP and Text Data

Natural Language Processing is the discipline of building systems that work with human language — text, speech, dialog, documents. It is the oldest applied AI field (the 1954 Georgetown machine translation experiment predates "machine learning" itself) and the most-changed field of the last decade: from rule-heavy classical pipelines to transformer-based models that perform tasks no one explicitly programmed them to do. This course is the practitioner's path through the pieces of NLP that still matter in 2026 — preprocessing, classical models, embeddings, sequence models, transformers, and the applications you'll actually build.

1. What NLP Actually Covers

Family	Examples
Classification	Spam detection, sentiment analysis, topic labeling
Sequence labeling	Named-entity recognition, part-of-speech tagging
Generation	Translation, summarization, question answering, chat
Retrieval	Semantic search, RAG, document deduplication
Structured extraction	Form filling, contract analysis, knowledge-graph construction
Speech / multimodal	ASR, TTS, vision-language models

This course covers the first four families end-to-end and briefly tours the fifth. Speech and multimodal get their own dedicated courses.

2. Why Language Is Hard

Natural language is the messiest data type ML touches. Five reasons it stays hard even after decades of work:

Ambiguity — "I saw the man with the telescope" has two parses; a model has to disambiguate from context.
Compositionality — "not great" and "great" have opposite meanings from a one-word change; bag-of-words throws this away.
Vocabulary size — English has ~170,000 word forms in current use; a single Reddit thread can introduce ten new tokens. Open vocabulary is the default.
Cultural and contextual meaning — "this is sick" is praise on Twitter and a complaint at the doctor's office.
Multilingual reality — there are ~7,000 living languages; the web has data for ~100; most NLP research has been English-centric.

3. The Four Eras of NLP

Era	Years	Approach
Rule-based	1950s-1990s	Hand-written grammars; expert systems
Statistical	1990s-2010s	Naive Bayes, HMMs, CRFs, n-grams; bag-of-words
Neural	2013-2018	Word embeddings, RNNs/LSTMs, seq2seq with attention
Transformer / LLM	2018-present	BERT, GPT, in-context learning, foundation models

Each era didn't replace the prior one — production NLP in 2026 still uses TF-IDF for low-latency retrieval, CRFs for structured extraction, and rule-based systems where the specification is precise. The right choice depends on the problem.

4. The Deep-Learning Inflection Points

2013 — Word2Vec (Mikolov et al.): "king − man + woman ≈ queen". Dense, semantic word embeddings.
2014 — Sequence-to-Sequence (Sutskever et al.): encoder-decoder LSTMs for translation.
2015 — Attention (Bahdanau et al.): removed the seq2seq information bottleneck.
2017 — Transformer (Vaswani et al.): replaced recurrence with attention. The architectural shift that made everything else possible.
2018 — BERT & GPT-1: pretraining + fine-tuning became the standard recipe.
2020 — GPT-3: in-context learning; prompting replaces fine-tuning for many tasks.
2022-23 — ChatGPT, GPT-4, Claude, LLaMA: LLMs are general-purpose tools; NLP merges with "applied AI".

5. The Production NLP Stack in 2026

code

┌──────────────────────────┐
│ Tokenizer + preprocessor │   spaCy, HF tokenizers
└────────────┬─────────────┘
             ▼
┌──────────────────────────┐
│ Representation           │   TF-IDF, embeddings, transformer hidden states
└────────────┬─────────────┘
             ▼
┌──────────────────────────┐
│ Task model               │   classifier, NER, generator, retriever
│ (classical or fine-tuned │
│  transformer or LLM API) │
└────────────┬─────────────┘
             ▼
┌──────────────────────────┐
│ Postprocessing           │   threshold, confidence calibration, output schema
└────────────┬─────────────┘
             ▼
┌──────────────────────────┐
│ Serving + monitoring     │   FastAPI, drift detection, eval set
└──────────────────────────┘

The shape is the same whether the task model is logistic regression on TF-IDF features or a 70B-parameter LLM. Most production NLP is still in the middle: fine-tuned BERT-class models for classification / NER, with classical methods for retrieval and LLMs for the open-ended tasks they uniquely enable.

6. When Classical NLP Still Wins

Note

Don't Reach for an LLM First Classical methods earn their place in 2026 production systems for several reasons:

Cost — a TF-IDF + logistic regression spam classifier costs ~$0.0001 per million inferences; an LLM call costs 1000-10000× more.
Latency — sub-millisecond inference on a single CPU vs hundreds of milliseconds for LLM APIs.
Determinism — same input → same output, every time. LLMs sample.
Auditability — coefficients of a linear model on TF-IDF tokens are directly inspectable; LLM behavior is not.
Compliance — many regulated industries require explainable models. Classical methods qualify; most LLMs don't.

Reach for LLMs when the problem is open-ended, low-volume, or when the user-facing value justifies the cost. Otherwise, classical approaches still ship.

7. The Course Map

Section 1 (this one): tokenization, bag-of-words, TF-IDF — the foundations every NLP pipeline still uses.
Section 2: classical models — Naive Bayes, CRFs, LDA. Workhorses for the past 25 years.
Section 3: word embeddings — Word2Vec, GloVe, FastText; a semantic-search project.
Section 4: sequence models — RNNs, LSTMs, seq2seq, attention; a neural machine translator.
Section 5: transformers — BERT, GPT, Hugging Face workflows.
Section 6: applications — QA, summarization, multi-label classification, ethics, capstone project.

8. Setup Check

code

pip install nltk==3.9.1 spacy==3.7.5 \
            scikit-learn==1.5.2 transformers==4.46.0 \
            datasets==3.0.0 sentencepiece==0.2.0

python -m spacy download en_core_web_sm
python -c "import nltk; nltk.download('punkt_tab')"

Six packages cover this whole course:

nltk — classical preprocessing, tokenizers, stemmers.
spaCy — production-grade NLP pipeline (POS, NER, parsing).
scikit-learn — TF-IDF, Naive Bayes, logistic regression, evaluation.
transformers — Hugging Face's library; every modern model.
datasets — preprocessed NLP datasets with one-line loading.
sentencepiece — modern subword tokenizers used by most LLMs.

9. The Mental Model

Up next · Tokenization, Stemming, and Lemmatization