AIMaks

Introduction to NLP and Text Data

30 min readvideoText Preprocessing and Representation
1 of 32Natural Language Processing

Introduction to NLP and Text Data

Natural Language Processing is the discipline of building systems that work with human language — text, speech, dialog, documents. It is the oldest applied AI field (the 1954 Georgetown machine translation experiment predates "machine learning" itself) and the most-changed field of the last decade: from rule-heavy classical pipelines to transformer-based models that perform tasks no one explicitly programmed them to do. This course is the practitioner's path through the pieces of NLP that still matter in 2026 — preprocessing, classical models, embeddings, sequence models, transformers, and the applications you'll actually build.

1. What NLP Actually Covers

FamilyExamples
ClassificationSpam detection, sentiment analysis, topic labeling
Sequence labelingNamed-entity recognition, part-of-speech tagging
GenerationTranslation, summarization, question answering, chat
RetrievalSemantic search, RAG, document deduplication
Structured extractionForm filling, contract analysis, knowledge-graph construction
Speech / multimodalASR, TTS, vision-language models

This course covers the first four families end-to-end and briefly tours the fifth. Speech and multimodal get their own dedicated courses.

2. Why Language Is Hard

Natural language is the messiest data type ML touches. Five reasons it stays hard even after decades of work:

  • Ambiguity — "I saw the man with the telescope" has two parses; a model has to disambiguate from context.
  • Compositionality — "not great" and "great" have opposite meanings from a one-word change; bag-of-words throws this away.
  • Vocabulary size — English has ~170,000 word forms in current use; a single Reddit thread can introduce ten new tokens. Open vocabulary is the default.
  • Cultural and contextual meaning — "this is sick" is praise on Twitter and a complaint at the doctor's office.
  • Multilingual reality — there are ~7,000 living languages; the web has data for ~100; most NLP research has been English-centric.

3. The Four Eras of NLP

EraYearsApproach
Rule-based1950s-1990sHand-written grammars; expert systems
Statistical1990s-2010sNaive Bayes, HMMs, CRFs, n-grams; bag-of-words
Neural2013-2018Word embeddings, RNNs/LSTMs, seq2seq with attention
Transformer / LLM2018-presentBERT, GPT, in-context learning, foundation models

Each era didn't replace the prior one — production NLP in 2026 still uses TF-IDF for low-latency retrieval, CRFs for structured extraction, and rule-based systems where the specification is precise. The right choice depends on the problem.

4. The Deep-Learning Inflection Points

  • 2013 — Word2Vec (Mikolov et al.): "king − man + woman ≈ queen". Dense, semantic word embeddings.
  • 2014 — Sequence-to-Sequence (Sutskever et al.): encoder-decoder LSTMs for translation.
  • 2015 — Attention (Bahdanau et al.): removed the seq2seq information bottleneck.
  • 2017 — Transformer (Vaswani et al.): replaced recurrence with attention. The architectural shift that made everything else possible.
  • 2018 — BERT & GPT-1: pretraining + fine-tuning became the standard recipe.
  • 2020 — GPT-3: in-context learning; prompting replaces fine-tuning for many tasks.
  • 2022-23 — ChatGPT, GPT-4, Claude, LLaMA: LLMs are general-purpose tools; NLP merges with "applied AI".

5. The Production NLP Stack in 2026

code
┌──────────────────────────┐
│ Tokenizer + preprocessor │   spaCy, HF tokenizers
└────────────┬─────────────┘

┌──────────────────────────┐
│ Representation           │   TF-IDF, embeddings, transformer hidden states
└────────────┬─────────────┘

┌──────────────────────────┐
│ Task model               │   classifier, NER, generator, retriever
│ (classical or fine-tuned │
│  transformer or LLM API) │
└────────────┬─────────────┘

┌──────────────────────────┐
│ Postprocessing           │   threshold, confidence calibration, output schema
└────────────┬─────────────┘

┌──────────────────────────┐
│ Serving + monitoring     │   FastAPI, drift detection, eval set
└──────────────────────────┘

The shape is the same whether the task model is logistic regression on TF-IDF features or a 70B-parameter LLM. Most production NLP is still in the middle: fine-tuned BERT-class models for classification / NER, with classical methods for retrieval and LLMs for the open-ended tasks they uniquely enable.

6. When Classical NLP Still Wins

7. The Course Map

  1. Section 1 (this one): tokenization, bag-of-words, TF-IDF — the foundations every NLP pipeline still uses.
  2. Section 2: classical models — Naive Bayes, CRFs, LDA. Workhorses for the past 25 years.
  3. Section 3: word embeddings — Word2Vec, GloVe, FastText; a semantic-search project.
  4. Section 4: sequence models — RNNs, LSTMs, seq2seq, attention; a neural machine translator.
  5. Section 5: transformers — BERT, GPT, Hugging Face workflows.
  6. Section 6: applications — QA, summarization, multi-label classification, ethics, capstone project.

8. Setup Check

code
pip install nltk==3.9.1 spacy==3.7.5 \
            scikit-learn==1.5.2 transformers==4.46.0 \
            datasets==3.0.0 sentencepiece==0.2.0

python -m spacy download en_core_web_sm
python -c "import nltk; nltk.download('punkt_tab')"

Six packages cover this whole course:

  • nltk — classical preprocessing, tokenizers, stemmers.
  • spaCy — production-grade NLP pipeline (POS, NER, parsing).
  • scikit-learn — TF-IDF, Naive Bayes, logistic regression, evaluation.
  • transformers — Hugging Face's library; every modern model.
  • datasets — preprocessed NLP datasets with one-line loading.
  • sentencepiece — modern subword tokenizers used by most LLMs.

9. The Mental Model

Up next · Tokenization, Stemming, and Lemmatization