Understanding Model Capabilities and Limitations
3 of 18AI for Product Managers
Understanding Model Capabilities and Limitations
Lesson 1 told you what AI is. Lesson 2 gave you the vocabulary. This lesson is the field guide: what frontier models in 2026 reliably do, where they break, and how to scope a feature without getting blindsided. The single most expensive PM mistake in AI is scoping against a capability the model only almost has. This lesson is the defense against that.
1. The 2026 Capability Frontier — What Works Reliably
Before listing failure modes, anchor on what is now boring and reliable. If your feature lives here, you can ship.
| Capability | Reliability | Production example |
|---|---|---|
| Summarization (1-50 page docs) | High | Notion AI, Granola |
| Classification / intent / routing | High | Intercom Fin, Zendesk AI |
| Structured extraction (JSON from text/images) | High | Ramp invoice parsing, Brex receipt OCR |
| Drafting (emails, marketing copy, PRDs) | High with editing | Gmail Help Me Write, Linear AI |
| Code completion (single-file) | High | Cursor, GitHub Copilot |
| Code agent (multi-file, with tests) | Medium-high | Claude Code, Cursor agent, Devin |
| Q&A over your own docs (RAG) | High with eval | Glean, Notion Q&A |
| Vision Q&A (screenshots, PDFs) | High | Claude vision, GPT-4o vision |
| Translation (major languages) | High | DeepL Pro, Google Translate |
| Multi-step reasoning (3-10 steps) | Medium-high | Math word problems, agent planning |
2. Where Models Are Unreliable
| Capability | Reliability | Why it fails |
|---|---|---|
| Long-horizon planning (50+ steps) | Low | Compounding error; lost context |
| Exact arithmetic at scale | Low without tools | Models are token predictors, not calculators |
| Novel research / discovery | Low | Trained on past; novelty is by definition out-of-distribution |
| Niche domain expertise (rare languages, obscure law) | Variable | Sparse training data |
| Multi-agent coordination | Low-medium | Context fragmentation; no shared state |
| Real-time current events (without tools) | Zero past cutoff | Frozen training data |
| Personally-identifying retrieval ("what's MY address") | Zero by design | Privacy; no training memory of individuals |
| Adversarial input (prompt injection) | Variable | No clean separation between data and instructions |
3. Why Benchmarks ≠ Real Product Performance
Vendors quote benchmark scores: MMLU, GPQA, SWE-Bench, HumanEval, MMMU, ARC-AGI. These are signal, not ground truth, for your product.
| Benchmark | What it measures | What it misses |
|---|---|---|
| MMLU | 57 academic subjects, multiple choice | Open-ended generation; tone; instructions following |
| GPQA | PhD-level science Q&A | How the model behaves on dumb questions |
| SWE-Bench | Real GitHub issues, model writes a PR | Your codebase, your style, your build system |
| HumanEval | 164 short Python functions from a docstring | Multi-file refactors; legacy code; non-Python |
| MMMU | Multimodal university exam questions | Real screenshots, real PDFs, real charts |
4. The Jagged Frontier
Frontier LLMs are not uniformly smart. They can solve a graduate-level physics problem and then fumble the arithmetic in a 4-row spreadsheet. They can write a clean PRD and then mis-format the date. This shape is sometimes called the jagged frontier: capability varies enormously across superficially similar tasks.
Examples that surprise people every quarter:
- Models that ace SAT reading can fail at counting the letters in a word.
- Models that write production code can fail at counting how many functions they wrote.
- Models that understand calculus can compute
3.11 - 3.9wrong. - Models that pass the bar exam can fabricate citations.
5. Failure Modes You Must Design Around
| Failure | Frequency | Mitigation |
|---|---|---|
| Hallucination | 1-15% even on frontier | RAG, tool use, citations, evals, HITL for high-stakes |
| Refusal of valid request | 0.5-5% | Prompt design, system prompt tuning, fallback to alt model |
| Off-topic / drift | 1-5% | System prompt anchoring, output schema, post-filter |
| Format break (invalid JSON, bad markdown) | 0.1-2% | Structured output APIs, JSON schema, retry-with-fix |
| Prompt injection | Domain-dependent | Treat user input as data; don't put it in system prompt; output filtering; least-privilege tools |
| Latency spike (p99) | Always | Timeout + fallback model; degraded mode UX |
| Vendor outage | Twice a year-ish | Multi-provider fallback or graceful degradation |
6. The Cost-Quality-Latency Triangle
Pick two. The triangle is not negotiable; only the trade-off point is.
QUALITY
▲
|
|
COST ◄─────┴─────► LATENCY
| You optimize for | You sacrifice | Typical move |
|---|---|---|
| Quality + low latency | Cost | Frontier model, dedicated capacity, no caching tricks |
| Quality + low cost | Latency | Batch API (24h), self-hosted, smaller-model + retry-on-fail |
| Low cost + low latency | Quality | Mid-tier model, terse prompts, accept some failures |
Real products often run multiple branches: latency-critical paths use one config, batch reports use another, premium tier users get the third.
7. Practical Limits Worth Respecting
8. Capability Creep — Roadmapping Against a Moving Target
What was infeasible 12 months ago ships today. Every PM is rebuilding their feature against a model that's smarter than the one they scoped against.
| Year | What was hard | What's now boring |
|---|---|---|
| 2022 | Conversational fluency | Now: ChatGPT, table stakes |
| 2023 | Multi-step reasoning | Now: o1/o3 reasoning models |
| 2024 | Multimodal (vision in chat) | Now: GPT-4o, Claude vision, default |
| 2025 | Long-context (1M tokens) | Now: Gemini 2.5, Claude 4.5 |
| 2026 | Reliable multi-step coding agents | Now: Cursor agent, Claude Code |
9. The PM Capability Checklist
Before scoping any AI feature, run through this list with your engineering lead:
- Is this on the reliable side of the frontier? (Section 1 vs Section 2 of this lesson.)
- Do we have a held-out eval set? If not, the feature is unscopable.
- What's the cost per use, and what's the revenue? Unit economics in the spec, not after launch.
- What's the p99 latency budget? What's the fallback when we miss it?
- What's our hallucination tolerance? What's the recovery when it happens?
- What's the failure UX? Models will fail; the failure mode is part of the design, not an afterthought.
- Are we exposed to prompt injection? Does user input flow into a high-privilege agent or tool call?
- Which model tier? With which fallback? "Frontier always" is rarely right.
- How will we know when a model upgrade improves or regresses us? Regression evals on every version bump.
- What's the human-in-the-loop story for the top 5% of high-stakes outputs?