Key AI Concepts Every PM Should Know

35 min readvideoAI Literacy for PMs

2 of 18AI for Product Managers

Key AI Concepts Every PM Should Know

Lesson 1 gave you the term hierarchy. This lesson gives you the working vocabulary — the concepts that come up in every serious AI feature spec, every vendor evaluation, and every conversation with the engineers building your product. By the end you will be able to read a model card, push back on a bad estimate, and write a roadmap that prices itself correctly.

1. Training Data — Quality, Recency, Bias

A model's behavior is downstream of its training data. Three properties of that data drive the product properties you care about:

Quality. Curated, deduplicated, filtered data → better models. "Garbage in, garbage out" is still the most reliable rule in ML.
Recency. Every model has a knowledge cutoff (e.g., GPT-5 ~early-2025, Claude 4.5 ~mid-2025). Anything after the cutoff is invisible unless you supply it via context (RAG) or tools (web search).
Bias. Whatever skews exist in the data — language, culture, profession, gender — show up in outputs. This is a product risk, a legal risk in some jurisdictions (EU AI Act), and a brand risk.

2. Generalization vs Memorization (Overfitting)

ML's core promise: a model trained on examples should perform well on new examples it has never seen. When that fails — when the model just memorized the training set — it's called overfitting. The model "knows the test set" but flunks reality.

Symptom	What's actually happening
Demo perfect, prod broken	Demo data leaked into training; real users send different inputs
Eval scores great, NPS bad	Eval set isn't representative of real traffic
Model regresses on a redesign	UX changed input distribution; model never saw the new shape

3. Model Size, Parameters, Context Window

Term	What it is	What it implies
Parameters	Number of "knobs" in the network. 8B, 70B, 405B, 1T+	Bigger ≈ smarter (within a family) but slower and more expensive
Context window	Max tokens (input + output) the model can attend to in one call. 128K, 200K, 1M, 2M in 2026	How much you can stuff into a prompt before it forgets the start
Output limit	Max tokens it can generate. Often 4K-64K, sometimes 200K with extended-output modes	Caps the length of generated content per call

Frontier models in 2026 typically advertise 200K-2M token context windows (Claude 4.5 Sonnet, Gemini 2.5 Pro, GPT-5). But "advertised" ≠ "useful" — practical performance often degrades past 100K tokens, especially for retrieval inside the prompt. Lesson 3 covers this in detail.

4. Tokens — the Billing Unit

LLMs don't read characters or words; they read tokens. A token is a chunk of text — usually a frequent sub-word — that the model's tokenizer was trained to recognize. As a PM you need three things:

Rule of thumb: 1 token ≈ 3-4 English characters ≈ 0.75 words. 1,000 tokens ≈ 750 words ≈ 1.5 pages.
Pricing: input tokens are cheaper than output tokens (often 3-5x). Cached input tokens are cheaper still (often 10x).
Non-English text and code use more tokens. Japanese, Chinese, Arabic — sometimes 2-3x the tokens of equivalent English. This shows up in your bill.

Model (illustrative 2026)	Input $/1 M < / t h >< t h > O u tp u t$ /1M	Cached input $/1M
Claude Opus 4	$15 < / t d >< t d >$ 75	$1.50
Claude Sonnet 4.5	$3 < / t d >< t d >$ 15	$0.30
Claude Haiku 4	$1 < / t d >< t d >$ 5	$0.10
GPT-5	$10 < / t d >< t d >$ 30	$1.00
GPT-4o-mini	$0.15 < / t d >< t d >$ 0.60	$0.075

5. Latency, Throughput, and Why "Average" Lies

Metric	What it measures	Why it matters
Time-to-first-token (TTFT)	Delay before any output appears	UX feel. Sub-1s feels snappy; >3s feels broken
Tokens/sec (TPS)	Streaming generation speed	50+ TPS reads as fast; below 20 feels sluggish
p50 latency	Median end-to-end time	The "typical" experience
p99 latency	Worst 1% of requests	The complaints, the support tickets, the churn

In production, p99 is often 5-10x p50. A model that's "fast on average" can still be unusable for the users who hit the long tail. Always demand both p50 and p99 in eval reports.

6. Hallucination — What It Is, Why It Happens

A hallucination is a confident, plausible, factually wrong output. LLMs hallucinate because they were trained to produce likely next tokens, not true ones. Truth was a side effect of training data quality, not an objective.

You cannot get hallucination to 0%. You can drive it down a long way:

RAG. Retrieve real documents at query time and ask the model to answer from them — and only from them.
Tool use. Let the model call calculators, databases, web search; ground answers in tool outputs.
Structured output / JSON schema. Force the model to emit data in a fixed shape; reject malformed output.
Citations. Force the model to cite the passage it used; reject answers without citations.
Evals + human review for high-stakes outputs.

7. Fine-Tuning vs Prompting vs RAG

These are the three ways to "customize" model behavior. They are not interchangeable.

	Prompting	RAG	Fine-tuning
What it changes	What you ask the model	What context the model sees	The model's weights
Best for	Behavior, tone, format	Knowledge / facts	Style + niche domain expertise
Setup time	Hours	Days-weeks	Weeks-months
Per-query cost	Baseline	+retrieval cost (small)	Often cheaper at scale
Updates	Edit prompt	Re-index docs	Re-train
Default 2026 answer	Always start here	Add when knowledge is the issue	Add when style/domain is the issue and prompting failed

8. Closed vs Open-Weight Models

	Closed (GPT, Claude, Gemini)	Open weight (Llama, DeepSeek, Qwen, Mistral)
Quality	Frontier	~6-12 months behind frontier; close on many tasks
Cost	Predictable per-token API pricing	Variable: hosted (cheap) or self-hosted (GPU bills)
Data residency	Vendor's terms	Yours
Customization	Limited (fine-tune offering varies)	Full — you have the weights
Vendor lock-in	Real	None
Eng overhead	~Zero	Significant — inference infra is a real team

9. Multimodality — the New Surface Area

Frontier models in 2026 are not just text. Most accept images natively (screenshots, photos, charts, PDFs); many handle audio (Gemini Live, GPT-4o voice) and video (Gemini 2.5 Pro, Sora 2). New product surfaces open up:

Vision Q&A on screenshots — Notion AI, Linear AI summarizing screenshots of Figma; support tools reading user-uploaded screenshots.
Document understanding — extract structured data from PDFs and images without OCR.
Voice agents — real-time interruption-tolerant phone agents (sub-500ms round-trip).
Video understanding — summarize meetings, support call recordings, demos.

10. Agents — What They Actually Are

An agent is the simplest possible recipe:

code

while not done:
    response = LLM(prompt + history + available_tools)
    if response.is_tool_call:
        result = execute_tool(response.tool, response.args)
        history.append((response, result))
    else:
        done = True
        return response.text

An LLM, a list of tools (search, send_email, query_db, create_pr), and a loop. That's it. The "intelligence" is in the LLM choosing which tool to call next. Cursor's agent mode, Claude Code, Linear's auto-triage, Devin — all this shape underneath.

Where agents work in 2026:

Constrained tool sets, short horizons (3-10 steps), reversible actions, recoverable failures — coding, data extraction, customer support routing.

Where they still struggle:

Long horizons (50+ steps), irreversible actions without human approval, multi-agent coordination, novel tool combinations.

11. Cost Levers — How Each Shows Up on the Bill

Lever	Mechanism	Typical savings
Smaller model tier	Route easy queries to Haiku/Mini, hard ones to frontier	3-15x
Prompt caching	Cache stable system prompts and reused context	5-10x on input tokens
Shorter outputs	Cap max_tokens; use structured JSON instead of prose	2-5x on output cost
RAG instead of giant prompts	Retrieve top-K relevant chunks, not the full corpus	10-100x on input tokens
Batch APIs (24h SLA)	Anthropic/OpenAI batch tier	~50% off
Self-hosted open weights	Trade $/req for fixed GPU bill	Crosses over above ~10M req/month

12. The PM AI Vocabulary Cheat Sheet

Term	One-line meaning
Token	Sub-word unit; the billing atom
Context window	Max tokens per call (input + output)
Temperature	0 = deterministic-ish, 1+ = creative; main randomness knob
System prompt	Persistent instructions that steer behavior
Few-shot	Showing examples in the prompt
Chain-of-thought	Asking the model to "think step by step"
RAG	Retrieval-Augmented Generation; bring docs to the model
Embedding	A vector representation of text used for semantic search
Fine-tuning	Adjusting model weights on your data
Tool / function calling	Model emits structured calls to your code
Agent	LLM + tools + loop
Eval	Automated test suite for model quality
Hallucination	Confident, plausible, wrong
Prompt injection	Attacker text that overrides your instructions
p99 latency	The 1% of slowest requests; the user complaints

← Previous lessonAI and ML Demystified for Product Managers

Up next · Understanding Model Capabilities and Limitations