Understanding Model Capabilities and Limitations

25 min readreadingAI Literacy for PMs

3 of 18AI for Product Managers

Understanding Model Capabilities and Limitations

Lesson 1 told you what AI is. Lesson 2 gave you the vocabulary. This lesson is the field guide: what frontier models in 2026 reliably do, where they break, and how to scope a feature without getting blindsided. The single most expensive PM mistake in AI is scoping against a capability the model only almost has. This lesson is the defense against that.

1. The 2026 Capability Frontier — What Works Reliably

Before listing failure modes, anchor on what is now boring and reliable. If your feature lives here, you can ship.

Capability	Reliability	Production example
Summarization (1-50 page docs)	High	Notion AI, Granola
Classification / intent / routing	High	Intercom Fin, Zendesk AI
Structured extraction (JSON from text/images)	High	Ramp invoice parsing, Brex receipt OCR
Drafting (emails, marketing copy, PRDs)	High with editing	Gmail Help Me Write, Linear AI
Code completion (single-file)	High	Cursor, GitHub Copilot
Code agent (multi-file, with tests)	Medium-high	Claude Code, Cursor agent, Devin
Q&A over your own docs (RAG)	High with eval	Glean, Notion Q&A
Vision Q&A (screenshots, PDFs)	High	Claude vision, GPT-4o vision
Translation (major languages)	High	DeepL Pro, Google Translate
Multi-step reasoning (3-10 steps)	Medium-high	Math word problems, agent planning

2. Where Models Are Unreliable

Capability	Reliability	Why it fails
Long-horizon planning (50+ steps)	Low	Compounding error; lost context
Exact arithmetic at scale	Low without tools	Models are token predictors, not calculators
Novel research / discovery	Low	Trained on past; novelty is by definition out-of-distribution
Niche domain expertise (rare languages, obscure law)	Variable	Sparse training data
Multi-agent coordination	Low-medium	Context fragmentation; no shared state
Real-time current events (without tools)	Zero past cutoff	Frozen training data
Personally-identifying retrieval ("what's MY address")	Zero by design	Privacy; no training memory of individuals
Adversarial input (prompt injection)	Variable	No clean separation between data and instructions

3. Why Benchmarks ≠ Real Product Performance

Vendors quote benchmark scores: MMLU, GPQA, SWE-Bench, HumanEval, MMMU, ARC-AGI. These are signal, not ground truth, for your product.

Benchmark	What it measures	What it misses
MMLU	57 academic subjects, multiple choice	Open-ended generation; tone; instructions following
GPQA	PhD-level science Q&A	How the model behaves on dumb questions
SWE-Bench	Real GitHub issues, model writes a PR	Your codebase, your style, your build system
HumanEval	164 short Python functions from a docstring	Multi-file refactors; legacy code; non-Python
MMMU	Multimodal university exam questions	Real screenshots, real PDFs, real charts

4. The Jagged Frontier

Frontier LLMs are not uniformly smart. They can solve a graduate-level physics problem and then fumble the arithmetic in a 4-row spreadsheet. They can write a clean PRD and then mis-format the date. This shape is sometimes called the jagged frontier: capability varies enormously across superficially similar tasks.

Examples that surprise people every quarter:

Models that ace SAT reading can fail at counting the letters in a word.
Models that write production code can fail at counting how many functions they wrote.
Models that understand calculus can compute 3.11 - 3.9 wrong.
Models that pass the bar exam can fabricate citations.

5. Failure Modes You Must Design Around

Failure	Frequency	Mitigation
Hallucination	1-15% even on frontier	RAG, tool use, citations, evals, HITL for high-stakes
Refusal of valid request	0.5-5%	Prompt design, system prompt tuning, fallback to alt model
Off-topic / drift	1-5%	System prompt anchoring, output schema, post-filter
Format break (invalid JSON, bad markdown)	0.1-2%	Structured output APIs, JSON schema, retry-with-fix
Prompt injection	Domain-dependent	Treat user input as data; don't put it in system prompt; output filtering; least-privilege tools
Latency spike (p99)	Always	Timeout + fallback model; degraded mode UX
Vendor outage	Twice a year-ish	Multi-provider fallback or graceful degradation

6. The Cost-Quality-Latency Triangle

Pick two. The triangle is not negotiable; only the trade-off point is.

code

            QUALITY
              ▲
              |
              |
   COST ◄─────┴─────► LATENCY

You optimize for	You sacrifice	Typical move
Quality + low latency	Cost	Frontier model, dedicated capacity, no caching tricks
Quality + low cost	Latency	Batch API (24h), self-hosted, smaller-model + retry-on-fail
Low cost + low latency	Quality	Mid-tier model, terse prompts, accept some failures

Real products often run multiple branches: latency-critical paths use one config, batch reports use another, premium tier users get the third.

7. Practical Limits Worth Respecting

8. Capability Creep — Roadmapping Against a Moving Target

What was infeasible 12 months ago ships today. Every PM is rebuilding their feature against a model that's smarter than the one they scoped against.

Year	What was hard	What's now boring
2022	Conversational fluency	Now: ChatGPT, table stakes
2023	Multi-step reasoning	Now: o1/o3 reasoning models
2024	Multimodal (vision in chat)	Now: GPT-4o, Claude vision, default
2025	Long-context (1M tokens)	Now: Gemini 2.5, Claude 4.5
2026	Reliable multi-step coding agents	Now: Cursor agent, Claude Code

9. The PM Capability Checklist

Before scoping any AI feature, run through this list with your engineering lead:

Is this on the reliable side of the frontier? (Section 1 vs Section 2 of this lesson.)
Do we have a held-out eval set? If not, the feature is unscopable.
What's the cost per use, and what's the revenue? Unit economics in the spec, not after launch.
What's the p99 latency budget? What's the fallback when we miss it?
What's our hallucination tolerance? What's the recovery when it happens?
What's the failure UX? Models will fail; the failure mode is part of the design, not an afterthought.
Are we exposed to prompt injection? Does user input flow into a high-privilege agent or tool call?
Which model tier? With which fallback? "Frontier always" is rarely right.
How will we know when a model upgrade improves or regresses us? Regression evals on every version bump.
What's the human-in-the-loop story for the top 5% of high-stakes outputs?

← Previous lessonKey AI Concepts Every PM Should Know

Up next · Quiz: AI Literacy