How LLMs Process Prompts

30 min readvideoPrompting Foundations

1 of 16Prompt Engineering Mastery

How LLMs Process Prompts

Prompt engineering is, at its core, an exercise in steering a function whose internals you cannot see. To steer it well, you have to know the shape of the input it actually consumes — not the English you typed, but the tokens, roles, and position-encoded vectors that arrive at the model. This lesson walks the lifecycle of a prompt end-to-end: from your string, through tokenization, into the chat-template that separates system / user / assistant turns, into the attention layers that condition every output token on every input token, and out the other side as sampled logits. By the end you'll know why "strawberry" has three r's the model can't count, why temperature=0 isn't actually deterministic in production, and how the 200k / 1M token context windows of 2026 frontier models change what's possible at the input layer.

1. The Prompt Lifecycle in One Diagram

code

┌─────────────────────────────────────────────────────────────┐
│ 1. Your input (Python str / JSON messages array)            │
│    [{"role":"system",   "content":"You are concise."},      │
│     {"role":"user",     "content":"How many r's in         │
│                                    'strawberry'?"}]         │
└──────────────────────┬──────────────────────────────────────┘
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Chat template render (model-specific)                    │
│    "<|im_start|>system\nYou are concise.<|im_end|>\n      │
│     <|im_start|>user\nHow many r's...<|im_end|>\n          │
│     <|im_start|>assistant\n"                               │
└──────────────────────┬──────────────────────────────────────┘
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Tokenize (BPE) → list[int]                               │
│    [27, 91, 318, 5011, ..., 1495, 320, 0, 13, 27, 91, ...]  │
│    (typically 0.75 tokens/word in English)                  │
└──────────────────────┬──────────────────────────────────────┘
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Embed + position-encode → tensor[seq_len, d_model]       │
└──────────────────────┬──────────────────────────────────────┘
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Forward pass through N transformer layers                │
│    Each output token attends to every preceding token.      │
└──────────────────────┬──────────────────────────────────────┘
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 6. Final hidden state → logits over vocab (~100k tokens)    │
└──────────────────────┬──────────────────────────────────────┘
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 7. Sampler (temperature, top_p, top_k, penalties)           │
│    → next token id                                          │
│    → loop back to step 5 (autoregressive)                   │
└──────────────────────┬──────────────────────────────────────┘
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 8. Detokenize ids back to a string for the user             │
└─────────────────────────────────────────────────────────────┘

Every prompt-engineering technique you'll learn in this course is some kind of intervention on one of these eight stages. Few-shot prompting changes step 1. XML delimiters shape step 2. Token-budgeting fights step 3. Sampling parameters live at step 7. Knowing which stage you're steering is the difference between mystery and engineering.

2. Tokenization: Where the Model Actually Reads

LLMs do not read characters or words. They read tokens — sub-word units produced by a Byte Pair Encoding (BPE) tokenizer (or a close variant like tiktoken's c100k_base, Anthropic's claude tokenizer, or Gemini's SentencePiece). BPE starts from individual bytes and iteratively merges the most frequent adjacent pair until the vocabulary hits a target size (~100k-256k for 2026 frontier models).

code

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")  # o200k_base, ~200k vocab

enc.encode("strawberry")
# [301, 675, 1395]  — three tokens: " straw" + "berry" splits etc.

enc.encode("antidisestablishmentarianism")
# [519, 11503, 553, 14133, 8344, 2191]   — six tokens

enc.encode("hello")
# [24912]                                  — single token

enc.encode(" hello")
# [22691]   — leading space gives a DIFFERENT token id

Three practitioner consequences:

Tokens are not characters. The model sees "strawberry" as 2-3 chunks. When you ask "how many r's are in strawberry", the model has to reason character-level over an opaque token. This is the famous "strawberry problem" — modern frontier models mostly get it right via chain-of-thought, but the underlying reason it was ever hard is tokenization.
Leading spaces matter. "hello" and " hello" are different tokens. When you build prompt templates with f-strings, an accidental space at the start of an inserted variable can change the model's behavior subtly.
Cost and context are measured in tokens. English is roughly 0.75 tokens per word; code is denser (1-1.5 tokens per word); CJK and emoji are token-heavy (sometimes 1+ tokens per character). Always measure with the actual tokenizer.

3. The Prompt as Autoregressive Continuation

The fundamental operation of a transformer LLM is: given a sequence of tokens, predict the next one. Generation is just this operation applied in a loop, feeding each new token back into the input until a stop token (or max length) is reached.

code

# Pseudocode for autoregressive generation
tokens = tokenize(prompt)
while len(tokens) < max_length:
    logits = model.forward(tokens)        # shape: [vocab_size]
    next_token = sample(logits, temp=0.7) # one of ~200k ids
    tokens.append(next_token)
    if next_token == STOP_TOKEN:
        break
return detokenize(tokens[len(prompt_tokens):])

This has a critical implication for prompting: the model is always trying to write the most likely continuation of what's already there. A prompt that ends with "Q: What's the capital of France?\nA:" works because the most likely continuation of "Q:..../A:" in the training data is the answer. Prompts that don't sit on a natural continuation pattern fight the model.

4. Attention Over the Prompt

Every transformer layer applies self-attention: for each output position, the model computes a weighted sum over every preceding input position. With N input tokens, attention is O(N²) in compute and memory.

code

For each output token t:
    attention_t = softmax( Q_t · K_{1..t}^T / sqrt(d_k) ) · V_{1..t}
                  ↑                ↑
                  one query        all preceding keys/values
                  vector           (the entire prompt)

Two prompt-engineering consequences:

Context isn't free. Doubling the prompt length quadruples attention compute. Frontier models in 2026 use various optimizations (flash attention, sliding windows, ring attention, MoE routing) but the cost scaling pressure is real and shows up in latency and pricing.
Position effects are real. Empirically, LLMs attend most strongly to the start and end of the context window — the famous "lost in the middle" finding from Liu et al. 2023. If you put critical instructions in the middle of a 100k-token prompt, accuracy drops noticeably. Put critical instructions at the start AND restate them at the end.

5. The Chat Template: System / User / Assistant

Modern LLMs are not raw next-token predictors at the API level — they are chat models, fine-tuned on multi-turn conversations with role-tagged messages. When you call the chat completions API, the SDK assembles your messages into a model-specific template:

Role	Purpose	Practitioner notes
`system`	Persistent instructions, persona, rules	Higher-priority than user in most models. One per conversation.
`user`	The user's question / input	Can be many turns; attacker-controlled in production.
`assistant`	Model's prior responses	Used for multi-turn context and few-shot examples.
`tool` / `function`	Tool-call results	Required for agentic / tool-use flows.

code

# OpenAI / chat.completions
from openai import OpenAI
client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {"role": "system", "content": "You are a terse Python assistant."},
        {"role": "user",   "content": "Reverse a list."},
    ],
    temperature=0.2,
)
print(resp.choices[0].message.content)

# Anthropic / messages — system is a top-level field, not a role
import anthropic
client = anthropic.Anthropic()
resp = client.messages.create(
    model="claude-sonnet-4-6",
    system="You are a terse Python assistant.",
    messages=[{"role": "user", "content": "Reverse a list."}],
    max_tokens=256,
)
print(resp.content[0].text)

The two SDKs disagree on whether system is a message role or a top-level field — minor friction when writing provider-agnostic code via litellm or your own wrapper. The semantics are the same: the system message is the highest-priority instruction the model sees.

6. Sampling: How the Next Token Is Chosen

At each step the model emits logits — one float per token in the vocab. The sampler turns those logits into a single token. Five knobs you'll see in every API:

Parameter	What it does	Typical range
`temperature`	Divides logits before softmax. 0 → argmax (greedy); higher → flatter distribution, more variety.	0.0 - 1.5
`top_p` (nucleus)	Sample only from the smallest set of tokens whose cumulative probability ≥ p.	0.9 - 1.0
`top_k`	Sample only from the K most-likely tokens. (Less common in OpenAI; standard in HF / vLLM.)	20 - 100
`frequency_penalty`	Penalize tokens proportional to how often they've already appeared. Reduces repetition.	0.0 - 1.0
`presence_penalty`	Penalize tokens that have appeared at all (0/1 binary). Encourages topic switching.	0.0 - 1.0

code

# Greedy decoding — always pick the argmax
client.chat.completions.create(model="gpt-4o", messages=msgs, temperature=0)

# Sampled decoding — pick from a flatter distribution
client.chat.completions.create(model="gpt-4o", messages=msgs,
                               temperature=0.8, top_p=0.95)

Practitioner defaults: temperature=0 for classification / extraction / RAG / code generation, temperature=0.7-1.0 for creative writing or brainstorming. top_p=0.95 is a fine default almost everywhere; only set top_k if your framework doesn't expose top_p.

7. Why temperature=0 Is Not Deterministic in Production

Warning

The Footgun That Bites Everyone

"temperature=0 means deterministic" is the most repeated and most wrong claim in prompt engineering. In production, the same prompt + temperature=0 will occasionally produce different outputs. Three real causes:

Batching non-determinism. Frontier APIs batch requests on the GPU. Floating-point addition isn't associative, so the order in which your tokens are summed depends on which batch you landed in. The argmax can flip when two top tokens have very close logits.
GPU non-determinism. Some CUDA kernels (notably reductions and atomic adds) are non-deterministic by default. PyTorch's torch.use_deterministic_algorithms(True) helps locally; on hosted APIs you have no control.
Model versioning. Provider routes between checkpoint versions silently. OpenAI exposes seed and system_fingerprint for partial determinism; Anthropic does not (as of 2026).

In practice, expect 95-99% reproduction at temperature=0 and pin seed + system_fingerprint when you need it. For test suites, use snapshot tests over multiple runs, not single-call exact-match.

8. Log-Probabilities: Looking Inside the Sampler

Most chat APIs let you request the log-probability of each emitted token (and the top alternatives). This is underused. Log-probs are how you build classifiers without prompting for a class name, score the model's confidence, detect hallucinations early, or implement constrained decoding.

code

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Sentiment of: I love this. Reply with only 'pos' or 'neg'."}],
    temperature=0,
    logprobs=True,
    top_logprobs=5,
    max_tokens=1,
)
for tok in resp.choices[0].logprobs.content:
    print(tok.token, tok.logprob)
    for alt in tok.top_logprobs:
        print(f"  alt: {alt.token!r:>8}  logp={alt.logprob:.3f}  p={2.718**alt.logprob:.3f}")
# pos -0.001
#   alt: 'pos'   logp=-0.001  p=0.999
#   alt: 'neg'   logp=-7.214  p=0.001
#   alt: 'neu'   logp=-9.832  p=0.000

The model's confidence in "pos" here is 99.9%. If you saw p=0.55 for "pos" vs p=0.42 for "neg" you'd know the model was guessing. That's a real signal you can route on in production.

9. Context Windows in 2026

Model	Context window	Output limit	Notes
GPT-4.1	1,000,000 tokens	32k	1M as of mid-2025; usable but lost-in-the-middle is real
GPT-4o	128,000 tokens	16k	Workhorse default
Claude Sonnet 4.6	200,000 tokens	64k	Strong long-context recall
Claude Opus 4.7	1,000,000 tokens	64k	1M tier; flagship reasoning
Gemini 2.5 Pro	2,000,000 tokens	64k	Largest commercial window

A few practitioner numbers to memorize: a typical novel is ~100k tokens, a long technical paper is ~10-30k, this whole course is roughly ~200k tokens of HTML. "Context window = put the whole codebase in" is mostly real in 2026, but with three caveats:

Cost scales linearly with input tokens. Prompt caching (Anthropic, OpenAI, Gemini all support it) cuts repeated prefix costs by 50-90%.
Latency scales worse than linearly. A 500k-token prompt can take 30-90 seconds for the first token.
Recall isn't perfect at the tail. Even "needle in a haystack" benchmarks show 95-99% on most models — meaning 1-5% of facts in a long doc go unretrieved. RAG often beats stuffing the whole corpus into context.

10. Token-Budget Arithmetic

A useful mental model for any prompt:

code

budget = context_window - max_output_tokens - safety_margin

allocate_to:
  system_prompt   :  500 -  2,000   tokens
  few_shot_examples:  500 -  4,000   tokens
  retrieved_docs  :  remaining (RAG: 60-80% of budget)
  user_query      :   50 -    500   tokens
  format_overhead :  ~50 - ~200     tokens (XML tags, JSON keys)

code

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

def count(text: str) -> int:
    return len(enc.encode(text))

prompt_tokens = count(system) + count(query) + sum(count(d) for d in docs)
remaining = 128_000 - prompt_tokens - 16_000   # leave room for output
print(f"used {prompt_tokens}, output budget {remaining}")

Always count, never estimate. The number of production bugs caused by silently truncated prompts (because the dev guessed at token counts instead of measuring) is enormous.

11. The Mental Model

Result

What You Learned Prompts become tokens via BPE; tokens become role-tagged messages via the chat template; the transformer attends over every preceding token; the sampler turns logits into the next token id; the loop runs until stop. Tokenization is the source of strawberry-class bugs and cost surprises. Sampling parameters (temperature, top_p, penalties) are the "creativity dial" — temperature=0 for deterministic tasks, 0.7+ for creative ones — but temperature=0 isn't truly deterministic in production. Frontier 2026 models span 128k-2M context windows with prompt caching, but lost-in-the-middle, latency, and cost remind you to use the budget deliberately. Lesson 2 covers zero/few-shot prompting; Lesson 3 walks formatting and structure; Lesson 4 quizzes everything from this section.

Up next · Zero-Shot and Few-Shot Prompting