Key AI Concepts Every PM Should Know
2 of 18AI for Product Managers
Key AI Concepts Every PM Should Know
Lesson 1 gave you the term hierarchy. This lesson gives you the working vocabulary — the concepts that come up in every serious AI feature spec, every vendor evaluation, and every conversation with the engineers building your product. By the end you will be able to read a model card, push back on a bad estimate, and write a roadmap that prices itself correctly.
1. Training Data — Quality, Recency, Bias
A model's behavior is downstream of its training data. Three properties of that data drive the product properties you care about:
- Quality. Curated, deduplicated, filtered data → better models. "Garbage in, garbage out" is still the most reliable rule in ML.
- Recency. Every model has a knowledge cutoff (e.g., GPT-5 ~early-2025, Claude 4.5 ~mid-2025). Anything after the cutoff is invisible unless you supply it via context (RAG) or tools (web search).
- Bias. Whatever skews exist in the data — language, culture, profession, gender — show up in outputs. This is a product risk, a legal risk in some jurisdictions (EU AI Act), and a brand risk.
2. Generalization vs Memorization (Overfitting)
ML's core promise: a model trained on examples should perform well on new examples it has never seen. When that fails — when the model just memorized the training set — it's called overfitting. The model "knows the test set" but flunks reality.
| Symptom | What's actually happening |
|---|---|
| Demo perfect, prod broken | Demo data leaked into training; real users send different inputs |
| Eval scores great, NPS bad | Eval set isn't representative of real traffic |
| Model regresses on a redesign | UX changed input distribution; model never saw the new shape |
3. Model Size, Parameters, Context Window
| Term | What it is | What it implies |
|---|---|---|
| Parameters | Number of "knobs" in the network. 8B, 70B, 405B, 1T+ | Bigger ≈ smarter (within a family) but slower and more expensive |
| Context window | Max tokens (input + output) the model can attend to in one call. 128K, 200K, 1M, 2M in 2026 | How much you can stuff into a prompt before it forgets the start |
| Output limit | Max tokens it can generate. Often 4K-64K, sometimes 200K with extended-output modes | Caps the length of generated content per call |
Frontier models in 2026 typically advertise 200K-2M token context windows (Claude 4.5 Sonnet, Gemini 2.5 Pro, GPT-5). But "advertised" ≠ "useful" — practical performance often degrades past 100K tokens, especially for retrieval inside the prompt. Lesson 3 covers this in detail.
4. Tokens — the Billing Unit
LLMs don't read characters or words; they read tokens. A token is a chunk of text — usually a frequent sub-word — that the model's tokenizer was trained to recognize. As a PM you need three things:
- Rule of thumb: 1 token ≈ 3-4 English characters ≈ 0.75 words. 1,000 tokens ≈ 750 words ≈ 1.5 pages.
- Pricing: input tokens are cheaper than output tokens (often 3-5x). Cached input tokens are cheaper still (often 10x).
- Non-English text and code use more tokens. Japanese, Chinese, Arabic — sometimes 2-3x the tokens of equivalent English. This shows up in your bill.
| Model (illustrative 2026) | Input /1M | Cached input $/1M |
|---|---|---|
| Claude Opus 4 | 75 | $1.50 |
| Claude Sonnet 4.5 | 15 | $0.30 |
| Claude Haiku 4 | 5 | $0.10 |
| GPT-5 | 30 | $1.00 |
| GPT-4o-mini | 0.60 | $0.075 |
5. Latency, Throughput, and Why "Average" Lies
| Metric | What it measures | Why it matters |
|---|---|---|
| Time-to-first-token (TTFT) | Delay before any output appears | UX feel. Sub-1s feels snappy; >3s feels broken |
| Tokens/sec (TPS) | Streaming generation speed | 50+ TPS reads as fast; below 20 feels sluggish |
| p50 latency | Median end-to-end time | The "typical" experience |
| p99 latency | Worst 1% of requests | The complaints, the support tickets, the churn |
In production, p99 is often 5-10x p50. A model that's "fast on average" can still be unusable for the users who hit the long tail. Always demand both p50 and p99 in eval reports.
6. Hallucination — What It Is, Why It Happens
A hallucination is a confident, plausible, factually wrong output. LLMs hallucinate because they were trained to produce likely next tokens, not true ones. Truth was a side effect of training data quality, not an objective.
You cannot get hallucination to 0%. You can drive it down a long way:
- RAG. Retrieve real documents at query time and ask the model to answer from them — and only from them.
- Tool use. Let the model call calculators, databases, web search; ground answers in tool outputs.
- Structured output / JSON schema. Force the model to emit data in a fixed shape; reject malformed output.
- Citations. Force the model to cite the passage it used; reject answers without citations.
- Evals + human review for high-stakes outputs.
7. Fine-Tuning vs Prompting vs RAG
These are the three ways to "customize" model behavior. They are not interchangeable.
| Prompting | RAG | Fine-tuning | |
|---|---|---|---|
| What it changes | What you ask the model | What context the model sees | The model's weights |
| Best for | Behavior, tone, format | Knowledge / facts | Style + niche domain expertise |
| Setup time | Hours | Days-weeks | Weeks-months |
| Per-query cost | Baseline | +retrieval cost (small) | Often cheaper at scale |
| Updates | Edit prompt | Re-index docs | Re-train |
| Default 2026 answer | Always start here | Add when knowledge is the issue | Add when style/domain is the issue and prompting failed |
8. Closed vs Open-Weight Models
| Closed (GPT, Claude, Gemini) | Open weight (Llama, DeepSeek, Qwen, Mistral) | |
|---|---|---|
| Quality | Frontier | ~6-12 months behind frontier; close on many tasks |
| Cost | Predictable per-token API pricing | Variable: hosted (cheap) or self-hosted (GPU bills) |
| Data residency | Vendor's terms | Yours |
| Customization | Limited (fine-tune offering varies) | Full — you have the weights |
| Vendor lock-in | Real | None |
| Eng overhead | ~Zero | Significant — inference infra is a real team |
9. Multimodality — the New Surface Area
Frontier models in 2026 are not just text. Most accept images natively (screenshots, photos, charts, PDFs); many handle audio (Gemini Live, GPT-4o voice) and video (Gemini 2.5 Pro, Sora 2). New product surfaces open up:
- Vision Q&A on screenshots — Notion AI, Linear AI summarizing screenshots of Figma; support tools reading user-uploaded screenshots.
- Document understanding — extract structured data from PDFs and images without OCR.
- Voice agents — real-time interruption-tolerant phone agents (sub-500ms round-trip).
- Video understanding — summarize meetings, support call recordings, demos.
10. Agents — What They Actually Are
An agent is the simplest possible recipe:
while not done:
response = LLM(prompt + history + available_tools)
if response.is_tool_call:
result = execute_tool(response.tool, response.args)
history.append((response, result))
else:
done = True
return response.text
An LLM, a list of tools (search, send_email, query_db, create_pr), and a loop. That's it. The "intelligence" is in the LLM choosing which tool to call next. Cursor's agent mode, Claude Code, Linear's auto-triage, Devin — all this shape underneath.
Where agents work in 2026:
- Constrained tool sets, short horizons (3-10 steps), reversible actions, recoverable failures — coding, data extraction, customer support routing.
Where they still struggle:
- Long horizons (50+ steps), irreversible actions without human approval, multi-agent coordination, novel tool combinations.
11. Cost Levers — How Each Shows Up on the Bill
| Lever | Mechanism | Typical savings |
|---|---|---|
| Smaller model tier | Route easy queries to Haiku/Mini, hard ones to frontier | 3-15x |
| Prompt caching | Cache stable system prompts and reused context | 5-10x on input tokens |
| Shorter outputs | Cap max_tokens; use structured JSON instead of prose | 2-5x on output cost |
| RAG instead of giant prompts | Retrieve top-K relevant chunks, not the full corpus | 10-100x on input tokens |
| Batch APIs (24h SLA) | Anthropic/OpenAI batch tier | ~50% off |
| Self-hosted open weights | Trade $/req for fixed GPU bill | Crosses over above ~10M req/month |
12. The PM AI Vocabulary Cheat Sheet
| Term | One-line meaning |
|---|---|
| Token | Sub-word unit; the billing atom |
| Context window | Max tokens per call (input + output) |
| Temperature | 0 = deterministic-ish, 1+ = creative; main randomness knob |
| System prompt | Persistent instructions that steer behavior |
| Few-shot | Showing examples in the prompt |
| Chain-of-thought | Asking the model to "think step by step" |
| RAG | Retrieval-Augmented Generation; bring docs to the model |
| Embedding | A vector representation of text used for semantic search |
| Fine-tuning | Adjusting model weights on your data |
| Tool / function calling | Model emits structured calls to your code |
| Agent | LLM + tools + loop |
| Eval | Automated test suite for model quality |
| Hallucination | Confident, plausible, wrong |
| Prompt injection | Attacker text that overrides your instructions |
| p99 latency | The 1% of slowest requests; the user complaints |