Zero-Shot and Few-Shot Prompting

35 min readvideoPrompting Foundations

2 of 16Prompt Engineering Mastery

Zero-Shot and Few-Shot Prompting

The single most-cited insight from the GPT-3 paper (Brown et al. 2020, "Language Models are Few-Shot Learners") was that you can teach a frozen LLM a new task by showing it a handful of examples in the prompt — no fine-tuning, no weight updates, no training run. In 2026 that capability has matured into a craft with rules of thumb, common pitfalls, and recent surprises (instruction-tuned models often beat few-shot with a well-written zero-shot prompt). This lesson walks the spectrum from zero-shot through one-shot through many-shot, covers when each wins, and shows how to actually pick and order examples in production.

1. The Definitions

Mode	Examples in prompt	Typical when
Zero-shot	0	Simple, well-known tasks; instruction-tuned models
One-shot	1	Format cue; "look like this"
Few-shot	2-8	Custom labels, edge cases, structured output
Many-shot	16-1000+	Long-context models; replaces light fine-tuning

In the original GPT-3 paper, zero-shot vs few-shot was a 10-30 point accuracy gap on most NLP benchmarks. With modern instruction-tuned 2026 models (gpt-4o, gpt-4.1, claude-sonnet-4-6, claude-opus-4-7, gemini-2.5-pro), zero-shot is dramatically stronger; the gap has compressed to 0-5 points on common tasks and few-shot wins big mainly on custom formats, niche domains, and structured-output edge cases.

2. Zero-Shot in Practice

code

from openai import OpenAI
client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": (
            "Classify the sentiment of the following review as "
            "positive, negative, or neutral. Reply with one word.\n\n"
            "Review: The flight was on time and the seats were comfortable."
        )
    }],
    temperature=0,
)
print(resp.choices[0].message.content)
# positive

Three things this prompt does right: (a) names the task ("classify the sentiment"), (b) names the label space ("positive, negative, or neutral"), (c) constrains the output ("one word"). Zero-shot wins or loses on instruction clarity. Most "zero-shot doesn't work" complaints are actually "the instruction was vague" complaints.

3. Few-Shot in Practice

code

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": (
            "Classify each review's sentiment as POS, NEG, or NEU.\n\n"
            "Review: Fast delivery, item as described.\n"
            "Sentiment: POS\n\n"
            "Review: Arrived broken, customer service ignored my emails.\n"
            "Sentiment: NEG\n\n"
            "Review: Box was a bit dented but the product works.\n"
            "Sentiment: NEU\n\n"
            "Review: The flight was on time and the seats were comfortable.\n"
            "Sentiment:"
        )
    }],
    temperature=0,
    max_tokens=4,
)
print(resp.choices[0].message.content.strip())
# POS

Three examples + a custom label space (POS/NEG/NEU instead of full words). Few-shot is the right tool when the label set is unusual, the format is strict, or the task is ambiguous in plain English.

4. The Sweet Spot: 1-5 Shots, Diminishing After 8

Empirically across many tasks (classification, extraction, formatting, simple generation), accuracy plotted against shot count looks like:

code

accuracy
   ▲
   │              ╭───────────  diminishing
   │           ╭──╯              returns
   │       ╭───╯
   │   ╭───╯
   │ ╭─╯
   │╱
   ┼────┬────┬────┬────┬────┬────▶ shots
   0    1    2    4    8    16

The big jump is 0 → 1 (especially for format). 1 → 5 buys diminishing returns. Beyond 8-16 most tasks plateau or even regress on smaller models due to context dilution. Anthropic's "Many-Shot In-Context Learning" paper (2024) showed that with very long contexts and 100s of examples, a second growth regime emerges — useful, but specific to long-context models and tasks where you have hundreds of high-quality examples available.

5. The Order Matters: Recency Bias

code

# Bad: all positives first, all negatives last
shots = [pos1, pos2, pos3, neg1, neg2, neg3]   # model leans NEG on next

# Good: interleaved and balanced
import random
random.seed(42)
shots = [pos1, neg1, neu1, pos2, neg2, neu2]
random.shuffle(shots)

6. Shot Diversity Beats Shot Quantity

Three carefully chosen, diverse examples almost always outperform six near-duplicate ones. Diversity dimensions to cover:

Label diversity — every class in your label space should appear at least once.
Length diversity — mix short and long inputs.
Edge-case diversity — include the sarcastic, the ambiguous, the multi-clause cases that the model would otherwise default-classify wrong.
Distractor diversity — for extraction tasks, include examples where the obvious-looking candidate is the wrong answer.

7. Format the Examples Like the Output

Few-shot is teaching format as much as semantics. Your examples must be in exactly the format you want the model to produce — same delimiters, same field names, same casing, same trailing characters.

code

# Anthropic / messages — multi-turn few-shot is preferred over flat text
import anthropic
client = anthropic.Anthropic()

resp = client.messages.create(
    model="claude-sonnet-4-6",
    system="Extract the company name and amount from each invoice line.",
    messages=[
        {"role": "user",      "content": "Acme Corp $1,234.50"},
        {"role": "assistant", "content": '{"company": "Acme Corp", "amount": 1234.50}'},
        {"role": "user",      "content": "BetaWorks Ltd. $99.99"},
        {"role": "assistant", "content": '{"company": "BetaWorks Ltd.", "amount": 99.99}'},
        {"role": "user",      "content": "Gamma Industries $500"},
    ],
    max_tokens=128,
)
print(resp.content[0].text)
# {"company": "Gamma Industries", "amount": 500}

Embedding few-shot examples as alternating user / assistant turns is more reliable than packing them all into a single user message. Anthropic explicitly recommends this; OpenAI works either way but the multi-turn pattern generalizes better to multi-step agents.

8. The "ICL ≈ Gradient Descent" Intuition

An influential 2022-2023 line of research (Akyürek, Dai, von Oswald and others) showed that for simple tasks, in-context learning can be mechanistically equivalent to running a few gradient-descent steps inside the forward pass — the attention layers implement a kind of meta-learner. You don't need this to be literally true to use the intuition: each shot is approximately one step of "training" the model on your task.

Practitioner consequence: the marginal value of shot N+1 is similar to the marginal value of one more fine-tuning step. After ~5 shots, the model has roughly internalized your task; further shots clarify edge cases rather than redefining the task.

9. When Zero-Shot Beats Few-Shot

In 2026, with frontier instruction-tuned models, zero-shot often wins. Specifically when:

The task is well-known and well-named (sentiment classification, summarization, translation, grammar-fix).
Your few-shot examples are not great (low quality, label imbalance, narrow distribution).
The few-shot examples bias the model toward a wrong heuristic that the zero-shot prompt would avoid.
You're paying per token and the few-shot examples are eating most of your budget.
You're chaining many calls in an agent and few-shot bloat compounds.

The empirical rule that holds in 2026: start with a careful zero-shot prompt; add few-shot only when a measurable evaluation shows it wins. Don't add shots out of superstition.

10. Lost-in-the-Middle

Long few-shot prompts run into Liu et al. 2023's "Lost in the Middle" finding: facts placed in the middle of a long context are recalled noticeably worse than facts at the start or end. For few-shot prompting specifically:

The first example anchors the format.
The last example biases the next output (recency).
Examples in the middle are weighted least.

If you have 5 critical examples and one is "the tricky edge case", put it at the start or the end — never in the middle.

11. Dynamic Shot Picking via kNN

The advanced production technique: instead of using the same N shots for every query, retrieve the most similar examples to the current query from a large pool. Same pattern as RAG but applied to few-shot.

code

# Build a pool of labeled examples; embed them once.
from sentence_transformers import SentenceTransformer
import numpy as np

embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
pool = [
    {"text": "Fast delivery, item as described.",                "label": "POS"},
    {"text": "Arrived broken, customer service ignored.",        "label": "NEG"},
    {"text": "Box dented but product works.",                    "label": "NEU"},
    # ... thousands more
]
pool_emb = embedder.encode([x["text"] for x in pool], normalize_embeddings=True)

def pick_shots(query, k=4):
    q = embedder.encode(query, normalize_embeddings=True)
    scores = pool_emb @ q
    return [pool[i] for i in np.argsort(scores)[-k:][::-1]]

shots = pick_shots("The flight was on time and seats comfortable.")

Empirically, dynamic kNN-shot selection beats static random shots by 3-10 points on classification benchmarks. Worth the engineering when your task is hard enough that 3-10 points matters.

12. Decision Checklist

13. The Mental Model

← Previous lessonHow LLMs Process Prompts

Up next · Prompt Structure and Formatting