AIMaks

Understanding Model Capabilities and Limitations

25 min readreadingAI Literacy for PMs
3 of 18AI for Product Managers

Understanding Model Capabilities and Limitations

Lesson 1 told you what AI is. Lesson 2 gave you the vocabulary. This lesson is the field guide: what frontier models in 2026 reliably do, where they break, and how to scope a feature without getting blindsided. The single most expensive PM mistake in AI is scoping against a capability the model only almost has. This lesson is the defense against that.

1. The 2026 Capability Frontier — What Works Reliably

Before listing failure modes, anchor on what is now boring and reliable. If your feature lives here, you can ship.

CapabilityReliabilityProduction example
Summarization (1-50 page docs)HighNotion AI, Granola
Classification / intent / routingHighIntercom Fin, Zendesk AI
Structured extraction (JSON from text/images)HighRamp invoice parsing, Brex receipt OCR
Drafting (emails, marketing copy, PRDs)High with editingGmail Help Me Write, Linear AI
Code completion (single-file)HighCursor, GitHub Copilot
Code agent (multi-file, with tests)Medium-highClaude Code, Cursor agent, Devin
Q&A over your own docs (RAG)High with evalGlean, Notion Q&A
Vision Q&A (screenshots, PDFs)HighClaude vision, GPT-4o vision
Translation (major languages)HighDeepL Pro, Google Translate
Multi-step reasoning (3-10 steps)Medium-highMath word problems, agent planning

2. Where Models Are Unreliable

CapabilityReliabilityWhy it fails
Long-horizon planning (50+ steps)LowCompounding error; lost context
Exact arithmetic at scaleLow without toolsModels are token predictors, not calculators
Novel research / discoveryLowTrained on past; novelty is by definition out-of-distribution
Niche domain expertise (rare languages, obscure law)VariableSparse training data
Multi-agent coordinationLow-mediumContext fragmentation; no shared state
Real-time current events (without tools)Zero past cutoffFrozen training data
Personally-identifying retrieval ("what's MY address")Zero by designPrivacy; no training memory of individuals
Adversarial input (prompt injection)VariableNo clean separation between data and instructions

3. Why Benchmarks ≠ Real Product Performance

Vendors quote benchmark scores: MMLU, GPQA, SWE-Bench, HumanEval, MMMU, ARC-AGI. These are signal, not ground truth, for your product.

BenchmarkWhat it measuresWhat it misses
MMLU57 academic subjects, multiple choiceOpen-ended generation; tone; instructions following
GPQAPhD-level science Q&AHow the model behaves on dumb questions
SWE-BenchReal GitHub issues, model writes a PRYour codebase, your style, your build system
HumanEval164 short Python functions from a docstringMulti-file refactors; legacy code; non-Python
MMMUMultimodal university exam questionsReal screenshots, real PDFs, real charts

4. The Jagged Frontier

Frontier LLMs are not uniformly smart. They can solve a graduate-level physics problem and then fumble the arithmetic in a 4-row spreadsheet. They can write a clean PRD and then mis-format the date. This shape is sometimes called the jagged frontier: capability varies enormously across superficially similar tasks.

Examples that surprise people every quarter:

  • Models that ace SAT reading can fail at counting the letters in a word.
  • Models that write production code can fail at counting how many functions they wrote.
  • Models that understand calculus can compute 3.11 - 3.9 wrong.
  • Models that pass the bar exam can fabricate citations.

5. Failure Modes You Must Design Around

FailureFrequencyMitigation
Hallucination1-15% even on frontierRAG, tool use, citations, evals, HITL for high-stakes
Refusal of valid request0.5-5%Prompt design, system prompt tuning, fallback to alt model
Off-topic / drift1-5%System prompt anchoring, output schema, post-filter
Format break (invalid JSON, bad markdown)0.1-2%Structured output APIs, JSON schema, retry-with-fix
Prompt injectionDomain-dependentTreat user input as data; don't put it in system prompt; output filtering; least-privilege tools
Latency spike (p99)AlwaysTimeout + fallback model; degraded mode UX
Vendor outageTwice a year-ishMulti-provider fallback or graceful degradation

6. The Cost-Quality-Latency Triangle

Pick two. The triangle is not negotiable; only the trade-off point is.

code
            QUALITY

              |
              |
   COST ◄─────┴─────► LATENCY
You optimize forYou sacrificeTypical move
Quality + low latencyCostFrontier model, dedicated capacity, no caching tricks
Quality + low costLatencyBatch API (24h), self-hosted, smaller-model + retry-on-fail
Low cost + low latencyQualityMid-tier model, terse prompts, accept some failures

Real products often run multiple branches: latency-critical paths use one config, batch reports use another, premium tier users get the third.

7. Practical Limits Worth Respecting

8. Capability Creep — Roadmapping Against a Moving Target

What was infeasible 12 months ago ships today. Every PM is rebuilding their feature against a model that's smarter than the one they scoped against.

YearWhat was hardWhat's now boring
2022Conversational fluencyNow: ChatGPT, table stakes
2023Multi-step reasoningNow: o1/o3 reasoning models
2024Multimodal (vision in chat)Now: GPT-4o, Claude vision, default
2025Long-context (1M tokens)Now: Gemini 2.5, Claude 4.5
2026Reliable multi-step coding agentsNow: Cursor agent, Claude Code

9. The PM Capability Checklist

Before scoping any AI feature, run through this list with your engineering lead:

  1. Is this on the reliable side of the frontier? (Section 1 vs Section 2 of this lesson.)
  2. Do we have a held-out eval set? If not, the feature is unscopable.
  3. What's the cost per use, and what's the revenue? Unit economics in the spec, not after launch.
  4. What's the p99 latency budget? What's the fallback when we miss it?
  5. What's our hallucination tolerance? What's the recovery when it happens?
  6. What's the failure UX? Models will fail; the failure mode is part of the design, not an afterthought.
  7. Are we exposed to prompt injection? Does user input flow into a high-privilege agent or tool call?
  8. Which model tier? With which fallback? "Frontier always" is rarely right.
  9. How will we know when a model upgrade improves or regresses us? Regression evals on every version bump.
  10. What's the human-in-the-loop story for the top 5% of high-stakes outputs?
Up next · Quiz: AI Literacy