Zero-Shot and Few-Shot Prompting
2 of 16Prompt Engineering Mastery
Zero-Shot and Few-Shot Prompting
The single most-cited insight from the GPT-3 paper (Brown et al. 2020, "Language Models are Few-Shot Learners") was that you can teach a frozen LLM a new task by showing it a handful of examples in the prompt — no fine-tuning, no weight updates, no training run. In 2026 that capability has matured into a craft with rules of thumb, common pitfalls, and recent surprises (instruction-tuned models often beat few-shot with a well-written zero-shot prompt). This lesson walks the spectrum from zero-shot through one-shot through many-shot, covers when each wins, and shows how to actually pick and order examples in production.
1. The Definitions
| Mode | Examples in prompt | Typical when |
|---|---|---|
| Zero-shot | 0 | Simple, well-known tasks; instruction-tuned models |
| One-shot | 1 | Format cue; "look like this" |
| Few-shot | 2-8 | Custom labels, edge cases, structured output |
| Many-shot | 16-1000+ | Long-context models; replaces light fine-tuning |
In the original GPT-3 paper, zero-shot vs few-shot was a 10-30 point accuracy gap on most NLP benchmarks. With modern instruction-tuned 2026 models (gpt-4o, gpt-4.1, claude-sonnet-4-6, claude-opus-4-7, gemini-2.5-pro), zero-shot is dramatically stronger; the gap has compressed to 0-5 points on common tasks and few-shot wins big mainly on custom formats, niche domains, and structured-output edge cases.
2. Zero-Shot in Practice
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": (
"Classify the sentiment of the following review as "
"positive, negative, or neutral. Reply with one word.\n\n"
"Review: The flight was on time and the seats were comfortable."
)
}],
temperature=0,
)
print(resp.choices[0].message.content)
# positive
Three things this prompt does right: (a) names the task ("classify the sentiment"), (b) names the label space ("positive, negative, or neutral"), (c) constrains the output ("one word"). Zero-shot wins or loses on instruction clarity. Most "zero-shot doesn't work" complaints are actually "the instruction was vague" complaints.
3. Few-Shot in Practice
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": (
"Classify each review's sentiment as POS, NEG, or NEU.\n\n"
"Review: Fast delivery, item as described.\n"
"Sentiment: POS\n\n"
"Review: Arrived broken, customer service ignored my emails.\n"
"Sentiment: NEG\n\n"
"Review: Box was a bit dented but the product works.\n"
"Sentiment: NEU\n\n"
"Review: The flight was on time and the seats were comfortable.\n"
"Sentiment:"
)
}],
temperature=0,
max_tokens=4,
)
print(resp.choices[0].message.content.strip())
# POS
Three examples + a custom label space (POS/NEG/NEU instead of full words). Few-shot is the right tool when the label set is unusual, the format is strict, or the task is ambiguous in plain English.
4. The Sweet Spot: 1-5 Shots, Diminishing After 8
Empirically across many tasks (classification, extraction, formatting, simple generation), accuracy plotted against shot count looks like:
accuracy
▲
│ ╭─────────── diminishing
│ ╭──╯ returns
│ ╭───╯
│ ╭───╯
│ ╭─╯
│╱
┼────┬────┬────┬────┬────┬────▶ shots
0 1 2 4 8 16
The big jump is 0 → 1 (especially for format). 1 → 5 buys diminishing returns. Beyond 8-16 most tasks plateau or even regress on smaller models due to context dilution. Anthropic's "Many-Shot In-Context Learning" paper (2024) showed that with very long contexts and 100s of examples, a second growth regime emerges — useful, but specific to long-context models and tasks where you have hundreds of high-quality examples available.
5. The Order Matters: Recency Bias
# Bad: all positives first, all negatives last
shots = [pos1, pos2, pos3, neg1, neg2, neg3] # model leans NEG on next
# Good: interleaved and balanced
import random
random.seed(42)
shots = [pos1, neg1, neu1, pos2, neg2, neu2]
random.shuffle(shots)
6. Shot Diversity Beats Shot Quantity
Three carefully chosen, diverse examples almost always outperform six near-duplicate ones. Diversity dimensions to cover:
- Label diversity — every class in your label space should appear at least once.
- Length diversity — mix short and long inputs.
- Edge-case diversity — include the sarcastic, the ambiguous, the multi-clause cases that the model would otherwise default-classify wrong.
- Distractor diversity — for extraction tasks, include examples where the obvious-looking candidate is the wrong answer.
7. Format the Examples Like the Output
Few-shot is teaching format as much as semantics. Your examples must be in exactly the format you want the model to produce — same delimiters, same field names, same casing, same trailing characters.
# Anthropic / messages — multi-turn few-shot is preferred over flat text
import anthropic
client = anthropic.Anthropic()
resp = client.messages.create(
model="claude-sonnet-4-6",
system="Extract the company name and amount from each invoice line.",
messages=[
{"role": "user", "content": "Acme Corp $1,234.50"},
{"role": "assistant", "content": '{"company": "Acme Corp", "amount": 1234.50}'},
{"role": "user", "content": "BetaWorks Ltd. $99.99"},
{"role": "assistant", "content": '{"company": "BetaWorks Ltd.", "amount": 99.99}'},
{"role": "user", "content": "Gamma Industries $500"},
],
max_tokens=128,
)
print(resp.content[0].text)
# {"company": "Gamma Industries", "amount": 500}
Embedding few-shot examples as alternating
user / assistant turns is more
reliable than packing them all into a single user message.
Anthropic explicitly recommends this; OpenAI works either
way but the multi-turn pattern generalizes better to
multi-step agents.
8. The "ICL ≈ Gradient Descent" Intuition
An influential 2022-2023 line of research (Akyürek, Dai, von Oswald and others) showed that for simple tasks, in-context learning can be mechanistically equivalent to running a few gradient-descent steps inside the forward pass — the attention layers implement a kind of meta-learner. You don't need this to be literally true to use the intuition: each shot is approximately one step of "training" the model on your task.
Practitioner consequence: the marginal value of shot N+1 is similar to the marginal value of one more fine-tuning step. After ~5 shots, the model has roughly internalized your task; further shots clarify edge cases rather than redefining the task.
9. When Zero-Shot Beats Few-Shot
In 2026, with frontier instruction-tuned models, zero-shot often wins. Specifically when:
- The task is well-known and well-named (sentiment classification, summarization, translation, grammar-fix).
- Your few-shot examples are not great (low quality, label imbalance, narrow distribution).
- The few-shot examples bias the model toward a wrong heuristic that the zero-shot prompt would avoid.
- You're paying per token and the few-shot examples are eating most of your budget.
- You're chaining many calls in an agent and few-shot bloat compounds.
The empirical rule that holds in 2026: start with a careful zero-shot prompt; add few-shot only when a measurable evaluation shows it wins. Don't add shots out of superstition.
10. Lost-in-the-Middle
Long few-shot prompts run into Liu et al. 2023's "Lost in the Middle" finding: facts placed in the middle of a long context are recalled noticeably worse than facts at the start or end. For few-shot prompting specifically:
- The first example anchors the format.
- The last example biases the next output (recency).
- Examples in the middle are weighted least.
If you have 5 critical examples and one is "the tricky edge case", put it at the start or the end — never in the middle.
11. Dynamic Shot Picking via kNN
The advanced production technique: instead of using the same N shots for every query, retrieve the most similar examples to the current query from a large pool. Same pattern as RAG but applied to few-shot.
# Build a pool of labeled examples; embed them once.
from sentence_transformers import SentenceTransformer
import numpy as np
embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
pool = [
{"text": "Fast delivery, item as described.", "label": "POS"},
{"text": "Arrived broken, customer service ignored.", "label": "NEG"},
{"text": "Box dented but product works.", "label": "NEU"},
# ... thousands more
]
pool_emb = embedder.encode([x["text"] for x in pool], normalize_embeddings=True)
def pick_shots(query, k=4):
q = embedder.encode(query, normalize_embeddings=True)
scores = pool_emb @ q
return [pool[i] for i in np.argsort(scores)[-k:][::-1]]
shots = pick_shots("The flight was on time and seats comfortable.")
Empirically, dynamic kNN-shot selection beats static random shots by 3-10 points on classification benchmarks. Worth the engineering when your task is hard enough that 3-10 points matters.