AIMaks

Key AI Concepts Every PM Should Know

35 min readvideoAI Literacy for PMs
2 of 18AI for Product Managers

Key AI Concepts Every PM Should Know

Lesson 1 gave you the term hierarchy. This lesson gives you the working vocabulary — the concepts that come up in every serious AI feature spec, every vendor evaluation, and every conversation with the engineers building your product. By the end you will be able to read a model card, push back on a bad estimate, and write a roadmap that prices itself correctly.

1. Training Data — Quality, Recency, Bias

A model's behavior is downstream of its training data. Three properties of that data drive the product properties you care about:

  • Quality. Curated, deduplicated, filtered data → better models. "Garbage in, garbage out" is still the most reliable rule in ML.
  • Recency. Every model has a knowledge cutoff (e.g., GPT-5 ~early-2025, Claude 4.5 ~mid-2025). Anything after the cutoff is invisible unless you supply it via context (RAG) or tools (web search).
  • Bias. Whatever skews exist in the data — language, culture, profession, gender — show up in outputs. This is a product risk, a legal risk in some jurisdictions (EU AI Act), and a brand risk.

2. Generalization vs Memorization (Overfitting)

ML's core promise: a model trained on examples should perform well on new examples it has never seen. When that fails — when the model just memorized the training set — it's called overfitting. The model "knows the test set" but flunks reality.

SymptomWhat's actually happening
Demo perfect, prod brokenDemo data leaked into training; real users send different inputs
Eval scores great, NPS badEval set isn't representative of real traffic
Model regresses on a redesignUX changed input distribution; model never saw the new shape

3. Model Size, Parameters, Context Window

TermWhat it isWhat it implies
ParametersNumber of "knobs" in the network. 8B, 70B, 405B, 1T+Bigger ≈ smarter (within a family) but slower and more expensive
Context windowMax tokens (input + output) the model can attend to in one call. 128K, 200K, 1M, 2M in 2026How much you can stuff into a prompt before it forgets the start
Output limitMax tokens it can generate. Often 4K-64K, sometimes 200K with extended-output modesCaps the length of generated content per call

Frontier models in 2026 typically advertise 200K-2M token context windows (Claude 4.5 Sonnet, Gemini 2.5 Pro, GPT-5). But "advertised" ≠ "useful" — practical performance often degrades past 100K tokens, especially for retrieval inside the prompt. Lesson 3 covers this in detail.

4. Tokens — the Billing Unit

LLMs don't read characters or words; they read tokens. A token is a chunk of text — usually a frequent sub-word — that the model's tokenizer was trained to recognize. As a PM you need three things:

  • Rule of thumb: 1 token ≈ 3-4 English characters ≈ 0.75 words. 1,000 tokens ≈ 750 words ≈ 1.5 pages.
  • Pricing: input tokens are cheaper than output tokens (often 3-5x). Cached input tokens are cheaper still (often 10x).
  • Non-English text and code use more tokens. Japanese, Chinese, Arabic — sometimes 2-3x the tokens of equivalent English. This shows up in your bill.
Model (illustrative 2026)Input /1MCached input $/1M
Claude Opus 475$1.50
Claude Sonnet 4.515$0.30
Claude Haiku 45$0.10
GPT-530$1.00
GPT-4o-mini0.60$0.075

5. Latency, Throughput, and Why "Average" Lies

MetricWhat it measuresWhy it matters
Time-to-first-token (TTFT)Delay before any output appearsUX feel. Sub-1s feels snappy; >3s feels broken
Tokens/sec (TPS)Streaming generation speed50+ TPS reads as fast; below 20 feels sluggish
p50 latencyMedian end-to-end timeThe "typical" experience
p99 latencyWorst 1% of requestsThe complaints, the support tickets, the churn

In production, p99 is often 5-10x p50. A model that's "fast on average" can still be unusable for the users who hit the long tail. Always demand both p50 and p99 in eval reports.

6. Hallucination — What It Is, Why It Happens

A hallucination is a confident, plausible, factually wrong output. LLMs hallucinate because they were trained to produce likely next tokens, not true ones. Truth was a side effect of training data quality, not an objective.

You cannot get hallucination to 0%. You can drive it down a long way:

  • RAG. Retrieve real documents at query time and ask the model to answer from them — and only from them.
  • Tool use. Let the model call calculators, databases, web search; ground answers in tool outputs.
  • Structured output / JSON schema. Force the model to emit data in a fixed shape; reject malformed output.
  • Citations. Force the model to cite the passage it used; reject answers without citations.
  • Evals + human review for high-stakes outputs.

7. Fine-Tuning vs Prompting vs RAG

These are the three ways to "customize" model behavior. They are not interchangeable.

PromptingRAGFine-tuning
What it changesWhat you ask the modelWhat context the model seesThe model's weights
Best forBehavior, tone, formatKnowledge / factsStyle + niche domain expertise
Setup timeHoursDays-weeksWeeks-months
Per-query costBaseline+retrieval cost (small)Often cheaper at scale
UpdatesEdit promptRe-index docsRe-train
Default 2026 answerAlways start hereAdd when knowledge is the issueAdd when style/domain is the issue and prompting failed

8. Closed vs Open-Weight Models

Closed (GPT, Claude, Gemini)Open weight (Llama, DeepSeek, Qwen, Mistral)
QualityFrontier~6-12 months behind frontier; close on many tasks
CostPredictable per-token API pricingVariable: hosted (cheap) or self-hosted (GPU bills)
Data residencyVendor's termsYours
CustomizationLimited (fine-tune offering varies)Full — you have the weights
Vendor lock-inRealNone
Eng overhead~ZeroSignificant — inference infra is a real team

9. Multimodality — the New Surface Area

Frontier models in 2026 are not just text. Most accept images natively (screenshots, photos, charts, PDFs); many handle audio (Gemini Live, GPT-4o voice) and video (Gemini 2.5 Pro, Sora 2). New product surfaces open up:

  • Vision Q&A on screenshots — Notion AI, Linear AI summarizing screenshots of Figma; support tools reading user-uploaded screenshots.
  • Document understanding — extract structured data from PDFs and images without OCR.
  • Voice agents — real-time interruption-tolerant phone agents (sub-500ms round-trip).
  • Video understanding — summarize meetings, support call recordings, demos.

10. Agents — What They Actually Are

An agent is the simplest possible recipe:

code
while not done:
    response = LLM(prompt + history + available_tools)
    if response.is_tool_call:
        result = execute_tool(response.tool, response.args)
        history.append((response, result))
    else:
        done = True
        return response.text

An LLM, a list of tools (search, send_email, query_db, create_pr), and a loop. That's it. The "intelligence" is in the LLM choosing which tool to call next. Cursor's agent mode, Claude Code, Linear's auto-triage, Devin — all this shape underneath.

Where agents work in 2026:

  • Constrained tool sets, short horizons (3-10 steps), reversible actions, recoverable failures — coding, data extraction, customer support routing.

Where they still struggle:

  • Long horizons (50+ steps), irreversible actions without human approval, multi-agent coordination, novel tool combinations.

11. Cost Levers — How Each Shows Up on the Bill

LeverMechanismTypical savings
Smaller model tierRoute easy queries to Haiku/Mini, hard ones to frontier3-15x
Prompt cachingCache stable system prompts and reused context5-10x on input tokens
Shorter outputsCap max_tokens; use structured JSON instead of prose2-5x on output cost
RAG instead of giant promptsRetrieve top-K relevant chunks, not the full corpus10-100x on input tokens
Batch APIs (24h SLA)Anthropic/OpenAI batch tier~50% off
Self-hosted open weightsTrade $/req for fixed GPU billCrosses over above ~10M req/month

12. The PM AI Vocabulary Cheat Sheet

TermOne-line meaning
TokenSub-word unit; the billing atom
Context windowMax tokens per call (input + output)
Temperature0 = deterministic-ish, 1+ = creative; main randomness knob
System promptPersistent instructions that steer behavior
Few-shotShowing examples in the prompt
Chain-of-thoughtAsking the model to "think step by step"
RAGRetrieval-Augmented Generation; bring docs to the model
EmbeddingA vector representation of text used for semantic search
Fine-tuningAdjusting model weights on your data
Tool / function callingModel emits structured calls to your code
AgentLLM + tools + loop
EvalAutomated test suite for model quality
HallucinationConfident, plausible, wrong
Prompt injectionAttacker text that overrides your instructions
p99 latencyThe 1% of slowest requests; the user complaints
Up next · Understanding Model Capabilities and Limitations