Introduction to NLP and Text Data
1 of 32Natural Language Processing
Introduction to NLP and Text Data
Natural Language Processing is the discipline of building systems that work with human language — text, speech, dialog, documents. It is the oldest applied AI field (the 1954 Georgetown machine translation experiment predates "machine learning" itself) and the most-changed field of the last decade: from rule-heavy classical pipelines to transformer-based models that perform tasks no one explicitly programmed them to do. This course is the practitioner's path through the pieces of NLP that still matter in 2026 — preprocessing, classical models, embeddings, sequence models, transformers, and the applications you'll actually build.
1. What NLP Actually Covers
| Family | Examples |
|---|---|
| Classification | Spam detection, sentiment analysis, topic labeling |
| Sequence labeling | Named-entity recognition, part-of-speech tagging |
| Generation | Translation, summarization, question answering, chat |
| Retrieval | Semantic search, RAG, document deduplication |
| Structured extraction | Form filling, contract analysis, knowledge-graph construction |
| Speech / multimodal | ASR, TTS, vision-language models |
This course covers the first four families end-to-end and briefly tours the fifth. Speech and multimodal get their own dedicated courses.
2. Why Language Is Hard
Natural language is the messiest data type ML touches. Five reasons it stays hard even after decades of work:
- Ambiguity — "I saw the man with the telescope" has two parses; a model has to disambiguate from context.
- Compositionality — "not great" and "great" have opposite meanings from a one-word change; bag-of-words throws this away.
- Vocabulary size — English has ~170,000 word forms in current use; a single Reddit thread can introduce ten new tokens. Open vocabulary is the default.
- Cultural and contextual meaning — "this is sick" is praise on Twitter and a complaint at the doctor's office.
- Multilingual reality — there are ~7,000 living languages; the web has data for ~100; most NLP research has been English-centric.
3. The Four Eras of NLP
| Era | Years | Approach |
|---|---|---|
| Rule-based | 1950s-1990s | Hand-written grammars; expert systems |
| Statistical | 1990s-2010s | Naive Bayes, HMMs, CRFs, n-grams; bag-of-words |
| Neural | 2013-2018 | Word embeddings, RNNs/LSTMs, seq2seq with attention |
| Transformer / LLM | 2018-present | BERT, GPT, in-context learning, foundation models |
Each era didn't replace the prior one — production NLP in 2026 still uses TF-IDF for low-latency retrieval, CRFs for structured extraction, and rule-based systems where the specification is precise. The right choice depends on the problem.
4. The Deep-Learning Inflection Points
- 2013 — Word2Vec (Mikolov et al.): "king − man + woman ≈ queen". Dense, semantic word embeddings.
- 2014 — Sequence-to-Sequence (Sutskever et al.): encoder-decoder LSTMs for translation.
- 2015 — Attention (Bahdanau et al.): removed the seq2seq information bottleneck.
- 2017 — Transformer (Vaswani et al.): replaced recurrence with attention. The architectural shift that made everything else possible.
- 2018 — BERT & GPT-1: pretraining + fine-tuning became the standard recipe.
- 2020 — GPT-3: in-context learning; prompting replaces fine-tuning for many tasks.
- 2022-23 — ChatGPT, GPT-4, Claude, LLaMA: LLMs are general-purpose tools; NLP merges with "applied AI".
5. The Production NLP Stack in 2026
┌──────────────────────────┐
│ Tokenizer + preprocessor │ spaCy, HF tokenizers
└────────────┬─────────────┘
▼
┌──────────────────────────┐
│ Representation │ TF-IDF, embeddings, transformer hidden states
└────────────┬─────────────┘
▼
┌──────────────────────────┐
│ Task model │ classifier, NER, generator, retriever
│ (classical or fine-tuned │
│ transformer or LLM API) │
└────────────┬─────────────┘
▼
┌──────────────────────────┐
│ Postprocessing │ threshold, confidence calibration, output schema
└────────────┬─────────────┘
▼
┌──────────────────────────┐
│ Serving + monitoring │ FastAPI, drift detection, eval set
└──────────────────────────┘
The shape is the same whether the task model is logistic regression on TF-IDF features or a 70B-parameter LLM. Most production NLP is still in the middle: fine-tuned BERT-class models for classification / NER, with classical methods for retrieval and LLMs for the open-ended tasks they uniquely enable.
6. When Classical NLP Still Wins
7. The Course Map
- Section 1 (this one): tokenization, bag-of-words, TF-IDF — the foundations every NLP pipeline still uses.
- Section 2: classical models — Naive Bayes, CRFs, LDA. Workhorses for the past 25 years.
- Section 3: word embeddings — Word2Vec, GloVe, FastText; a semantic-search project.
- Section 4: sequence models — RNNs, LSTMs, seq2seq, attention; a neural machine translator.
- Section 5: transformers — BERT, GPT, Hugging Face workflows.
- Section 6: applications — QA, summarization, multi-label classification, ethics, capstone project.
8. Setup Check
pip install nltk==3.9.1 spacy==3.7.5 \
scikit-learn==1.5.2 transformers==4.46.0 \
datasets==3.0.0 sentencepiece==0.2.0
python -m spacy download en_core_web_sm
python -c "import nltk; nltk.download('punkt_tab')"
Six packages cover this whole course:
- nltk — classical preprocessing, tokenizers, stemmers.
- spaCy — production-grade NLP pipeline (POS, NER, parsing).
- scikit-learn — TF-IDF, Naive Bayes, logistic regression, evaluation.
- transformers — Hugging Face's library; every modern model.
- datasets — preprocessed NLP datasets with one-line loading.
- sentencepiece — modern subword tokenizers used by most LLMs.