How LLMs Process Prompts
1 of 16Prompt Engineering Mastery
How LLMs Process Prompts
Prompt engineering is, at its core, an exercise in steering a function whose internals you cannot see. To steer it well, you have to know the shape of the input it actually consumes — not the English you typed, but the tokens, roles, and position-encoded vectors that arrive at the model. This lesson walks the lifecycle of a prompt end-to-end: from your string, through tokenization, into the chat-template that separates system / user / assistant turns, into the attention layers that condition every output token on every input token, and out the other side as sampled logits. By the end you'll know why "strawberry" has three r's the model can't count, why temperature=0 isn't actually deterministic in production, and how the 200k / 1M token context windows of 2026 frontier models change what's possible at the input layer.
1. The Prompt Lifecycle in One Diagram
┌─────────────────────────────────────────────────────────────┐
│ 1. Your input (Python str / JSON messages array) │
│ [{"role":"system", "content":"You are concise."}, │
│ {"role":"user", "content":"How many r's in │
│ 'strawberry'?"}] │
└──────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Chat template render (model-specific) │
│ "<|im_start|>system\nYou are concise.<|im_end|>\n │
│ <|im_start|>user\nHow many r's...<|im_end|>\n │
│ <|im_start|>assistant\n" │
└──────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Tokenize (BPE) → list[int] │
│ [27, 91, 318, 5011, ..., 1495, 320, 0, 13, 27, 91, ...] │
│ (typically 0.75 tokens/word in English) │
└──────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Embed + position-encode → tensor[seq_len, d_model] │
└──────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Forward pass through N transformer layers │
│ Each output token attends to every preceding token. │
└──────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ 6. Final hidden state → logits over vocab (~100k tokens) │
└──────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ 7. Sampler (temperature, top_p, top_k, penalties) │
│ → next token id │
│ → loop back to step 5 (autoregressive) │
└──────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ 8. Detokenize ids back to a string for the user │
└─────────────────────────────────────────────────────────────┘
Every prompt-engineering technique you'll learn in this course is some kind of intervention on one of these eight stages. Few-shot prompting changes step 1. XML delimiters shape step 2. Token-budgeting fights step 3. Sampling parameters live at step 7. Knowing which stage you're steering is the difference between mystery and engineering.
2. Tokenization: Where the Model Actually Reads
LLMs do not read characters or words. They read tokens — sub-word units produced by a Byte Pair Encoding (BPE) tokenizer (or a close variant like tiktoken's c100k_base, Anthropic's claude tokenizer, or Gemini's SentencePiece). BPE starts from individual bytes and iteratively merges the most frequent adjacent pair until the vocabulary hits a target size (~100k-256k for 2026 frontier models).
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o") # o200k_base, ~200k vocab
enc.encode("strawberry")
# [301, 675, 1395] — three tokens: " straw" + "berry" splits etc.
enc.encode("antidisestablishmentarianism")
# [519, 11503, 553, 14133, 8344, 2191] — six tokens
enc.encode("hello")
# [24912] — single token
enc.encode(" hello")
# [22691] — leading space gives a DIFFERENT token id
Three practitioner consequences:
- Tokens are not characters. The model
sees
"strawberry"as 2-3 chunks. When you ask "how many r's are in strawberry", the model has to reason character-level over an opaque token. This is the famous "strawberry problem" — modern frontier models mostly get it right via chain-of-thought, but the underlying reason it was ever hard is tokenization. - Leading spaces matter.
"hello"and" hello"are different tokens. When you build prompt templates with f-strings, an accidental space at the start of an inserted variable can change the model's behavior subtly. - Cost and context are measured in tokens. English is roughly 0.75 tokens per word; code is denser (1-1.5 tokens per word); CJK and emoji are token-heavy (sometimes 1+ tokens per character). Always measure with the actual tokenizer.
3. The Prompt as Autoregressive Continuation
The fundamental operation of a transformer LLM is: given a sequence of tokens, predict the next one. Generation is just this operation applied in a loop, feeding each new token back into the input until a stop token (or max length) is reached.
# Pseudocode for autoregressive generation
tokens = tokenize(prompt)
while len(tokens) < max_length:
logits = model.forward(tokens) # shape: [vocab_size]
next_token = sample(logits, temp=0.7) # one of ~200k ids
tokens.append(next_token)
if next_token == STOP_TOKEN:
break
return detokenize(tokens[len(prompt_tokens):])
This has a critical implication for prompting: the
model is always trying to write the most likely continuation
of what's already there. A prompt that ends with
"Q: What's the capital of France?\nA:" works
because the most likely continuation of "Q:..../A:" in the
training data is the answer. Prompts that don't sit on a
natural continuation pattern fight the model.
4. Attention Over the Prompt
Every transformer layer applies self-attention: for each output position, the model computes a weighted sum over every preceding input position. With N input tokens, attention is O(N²) in compute and memory.
For each output token t:
attention_t = softmax( Q_t · K_{1..t}^T / sqrt(d_k) ) · V_{1..t}
↑ ↑
one query all preceding keys/values
vector (the entire prompt)
Two prompt-engineering consequences:
- Context isn't free. Doubling the prompt length quadruples attention compute. Frontier models in 2026 use various optimizations (flash attention, sliding windows, ring attention, MoE routing) but the cost scaling pressure is real and shows up in latency and pricing.
- Position effects are real. Empirically, LLMs attend most strongly to the start and end of the context window — the famous "lost in the middle" finding from Liu et al. 2023. If you put critical instructions in the middle of a 100k-token prompt, accuracy drops noticeably. Put critical instructions at the start AND restate them at the end.
5. The Chat Template: System / User / Assistant
Modern LLMs are not raw next-token predictors at the API level — they are chat models, fine-tuned on multi-turn conversations with role-tagged messages. When you call the chat completions API, the SDK assembles your messages into a model-specific template:
| Role | Purpose | Practitioner notes |
|---|---|---|
system | Persistent instructions, persona, rules | Higher-priority than user in most models. One per conversation. |
user | The user's question / input | Can be many turns; attacker-controlled in production. |
assistant | Model's prior responses | Used for multi-turn context and few-shot examples. |
tool / function | Tool-call results | Required for agentic / tool-use flows. |
# OpenAI / chat.completions
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-4.1",
messages=[
{"role": "system", "content": "You are a terse Python assistant."},
{"role": "user", "content": "Reverse a list."},
],
temperature=0.2,
)
print(resp.choices[0].message.content)
# Anthropic / messages — system is a top-level field, not a role
import anthropic
client = anthropic.Anthropic()
resp = client.messages.create(
model="claude-sonnet-4-6",
system="You are a terse Python assistant.",
messages=[{"role": "user", "content": "Reverse a list."}],
max_tokens=256,
)
print(resp.content[0].text)
The two SDKs disagree on whether system is a
message role or a top-level field — minor friction when
writing provider-agnostic code via litellm or your own
wrapper. The semantics are the same: the system message is
the highest-priority instruction the model sees.
6. Sampling: How the Next Token Is Chosen
At each step the model emits logits — one float per token in the vocab. The sampler turns those logits into a single token. Five knobs you'll see in every API:
| Parameter | What it does | Typical range |
|---|---|---|
temperature | Divides logits before softmax. 0 → argmax (greedy); higher → flatter distribution, more variety. | 0.0 - 1.5 |
top_p (nucleus) | Sample only from the smallest set of tokens whose cumulative probability ≥ p. | 0.9 - 1.0 |
top_k | Sample only from the K most-likely tokens. (Less common in OpenAI; standard in HF / vLLM.) | 20 - 100 |
frequency_penalty | Penalize tokens proportional to how often they've already appeared. Reduces repetition. | 0.0 - 1.0 |
presence_penalty | Penalize tokens that have appeared at all (0/1 binary). Encourages topic switching. | 0.0 - 1.0 |
# Greedy decoding — always pick the argmax
client.chat.completions.create(model="gpt-4o", messages=msgs, temperature=0)
# Sampled decoding — pick from a flatter distribution
client.chat.completions.create(model="gpt-4o", messages=msgs,
temperature=0.8, top_p=0.95)
Practitioner defaults: temperature=0 for
classification / extraction / RAG / code generation,
temperature=0.7-1.0 for creative writing or
brainstorming. top_p=0.95 is a fine default
almost everywhere; only set top_k if your
framework doesn't expose top_p.
7. Why temperature=0 Is Not Deterministic in Production
8. Log-Probabilities: Looking Inside the Sampler
Most chat APIs let you request the log-probability of each emitted token (and the top alternatives). This is underused. Log-probs are how you build classifiers without prompting for a class name, score the model's confidence, detect hallucinations early, or implement constrained decoding.
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Sentiment of: I love this. Reply with only 'pos' or 'neg'."}],
temperature=0,
logprobs=True,
top_logprobs=5,
max_tokens=1,
)
for tok in resp.choices[0].logprobs.content:
print(tok.token, tok.logprob)
for alt in tok.top_logprobs:
print(f" alt: {alt.token!r:>8} logp={alt.logprob:.3f} p={2.718**alt.logprob:.3f}")
# pos -0.001
# alt: 'pos' logp=-0.001 p=0.999
# alt: 'neg' logp=-7.214 p=0.001
# alt: 'neu' logp=-9.832 p=0.000
The model's confidence in "pos" here is 99.9%. If you saw
p=0.55 for "pos" vs p=0.42 for
"neg" you'd know the model was guessing. That's a real
signal you can route on in production.
9. Context Windows in 2026
| Model | Context window | Output limit | Notes |
|---|---|---|---|
| GPT-4.1 | 1,000,000 tokens | 32k | 1M as of mid-2025; usable but lost-in-the-middle is real |
| GPT-4o | 128,000 tokens | 16k | Workhorse default |
| Claude Sonnet 4.6 | 200,000 tokens | 64k | Strong long-context recall |
| Claude Opus 4.7 | 1,000,000 tokens | 64k | 1M tier; flagship reasoning |
| Gemini 2.5 Pro | 2,000,000 tokens | 64k | Largest commercial window |
A few practitioner numbers to memorize: a typical novel is ~100k tokens, a long technical paper is ~10-30k, this whole course is roughly ~200k tokens of HTML. "Context window = put the whole codebase in" is mostly real in 2026, but with three caveats:
- Cost scales linearly with input tokens. Prompt caching (Anthropic, OpenAI, Gemini all support it) cuts repeated prefix costs by 50-90%.
- Latency scales worse than linearly. A 500k-token prompt can take 30-90 seconds for the first token.
- Recall isn't perfect at the tail. Even "needle in a haystack" benchmarks show 95-99% on most models — meaning 1-5% of facts in a long doc go unretrieved. RAG often beats stuffing the whole corpus into context.
10. Token-Budget Arithmetic
A useful mental model for any prompt:
budget = context_window - max_output_tokens - safety_margin
allocate_to:
system_prompt : 500 - 2,000 tokens
few_shot_examples: 500 - 4,000 tokens
retrieved_docs : remaining (RAG: 60-80% of budget)
user_query : 50 - 500 tokens
format_overhead : ~50 - ~200 tokens (XML tags, JSON keys)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
def count(text: str) -> int:
return len(enc.encode(text))
prompt_tokens = count(system) + count(query) + sum(count(d) for d in docs)
remaining = 128_000 - prompt_tokens - 16_000 # leave room for output
print(f"used {prompt_tokens}, output budget {remaining}")
Always count, never estimate. The number of production bugs caused by silently truncated prompts (because the dev guessed at token counts instead of measuring) is enormous.