AIMaks

How LLMs Process Prompts

30 min readvideoPrompting Foundations
1 of 16Prompt Engineering Mastery

How LLMs Process Prompts

Prompt engineering is, at its core, an exercise in steering a function whose internals you cannot see. To steer it well, you have to know the shape of the input it actually consumes — not the English you typed, but the tokens, roles, and position-encoded vectors that arrive at the model. This lesson walks the lifecycle of a prompt end-to-end: from your string, through tokenization, into the chat-template that separates system / user / assistant turns, into the attention layers that condition every output token on every input token, and out the other side as sampled logits. By the end you'll know why "strawberry" has three r's the model can't count, why temperature=0 isn't actually deterministic in production, and how the 200k / 1M token context windows of 2026 frontier models change what's possible at the input layer.

1. The Prompt Lifecycle in One Diagram

code
┌─────────────────────────────────────────────────────────────┐
│ 1. Your input (Python str / JSON messages array)            │
│    [{"role":"system",   "content":"You are concise."},      │
│     {"role":"user",     "content":"How many r's in         │
│                                    'strawberry'?"}]         │
└──────────────────────┬──────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ 2. Chat template render (model-specific)                    │
│    "<|im_start|>system\nYou are concise.<|im_end|>\n      │
│     <|im_start|>user\nHow many r's...<|im_end|>\n          │
│     <|im_start|>assistant\n"                               │
└──────────────────────┬──────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ 3. Tokenize (BPE) → list[int]                               │
│    [27, 91, 318, 5011, ..., 1495, 320, 0, 13, 27, 91, ...]  │
│    (typically 0.75 tokens/word in English)                  │
└──────────────────────┬──────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ 4. Embed + position-encode → tensor[seq_len, d_model]       │
└──────────────────────┬──────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ 5. Forward pass through N transformer layers                │
│    Each output token attends to every preceding token.      │
└──────────────────────┬──────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ 6. Final hidden state → logits over vocab (~100k tokens)    │
└──────────────────────┬──────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ 7. Sampler (temperature, top_p, top_k, penalties)           │
│    → next token id                                          │
│    → loop back to step 5 (autoregressive)                   │
└──────────────────────┬──────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ 8. Detokenize ids back to a string for the user             │
└─────────────────────────────────────────────────────────────┘

Every prompt-engineering technique you'll learn in this course is some kind of intervention on one of these eight stages. Few-shot prompting changes step 1. XML delimiters shape step 2. Token-budgeting fights step 3. Sampling parameters live at step 7. Knowing which stage you're steering is the difference between mystery and engineering.

2. Tokenization: Where the Model Actually Reads

LLMs do not read characters or words. They read tokens — sub-word units produced by a Byte Pair Encoding (BPE) tokenizer (or a close variant like tiktoken's c100k_base, Anthropic's claude tokenizer, or Gemini's SentencePiece). BPE starts from individual bytes and iteratively merges the most frequent adjacent pair until the vocabulary hits a target size (~100k-256k for 2026 frontier models).

code
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")  # o200k_base, ~200k vocab

enc.encode("strawberry")
# [301, 675, 1395]  — three tokens: " straw" + "berry" splits etc.

enc.encode("antidisestablishmentarianism")
# [519, 11503, 553, 14133, 8344, 2191]   — six tokens

enc.encode("hello")
# [24912]                                  — single token

enc.encode(" hello")
# [22691]   — leading space gives a DIFFERENT token id

Three practitioner consequences:

  • Tokens are not characters. The model sees "strawberry" as 2-3 chunks. When you ask "how many r's are in strawberry", the model has to reason character-level over an opaque token. This is the famous "strawberry problem" — modern frontier models mostly get it right via chain-of-thought, but the underlying reason it was ever hard is tokenization.
  • Leading spaces matter. "hello" and " hello" are different tokens. When you build prompt templates with f-strings, an accidental space at the start of an inserted variable can change the model's behavior subtly.
  • Cost and context are measured in tokens. English is roughly 0.75 tokens per word; code is denser (1-1.5 tokens per word); CJK and emoji are token-heavy (sometimes 1+ tokens per character). Always measure with the actual tokenizer.

3. The Prompt as Autoregressive Continuation

The fundamental operation of a transformer LLM is: given a sequence of tokens, predict the next one. Generation is just this operation applied in a loop, feeding each new token back into the input until a stop token (or max length) is reached.

code
# Pseudocode for autoregressive generation
tokens = tokenize(prompt)
while len(tokens) < max_length:
    logits = model.forward(tokens)        # shape: [vocab_size]
    next_token = sample(logits, temp=0.7) # one of ~200k ids
    tokens.append(next_token)
    if next_token == STOP_TOKEN:
        break
return detokenize(tokens[len(prompt_tokens):])

This has a critical implication for prompting: the model is always trying to write the most likely continuation of what's already there. A prompt that ends with "Q: What's the capital of France?\nA:" works because the most likely continuation of "Q:..../A:" in the training data is the answer. Prompts that don't sit on a natural continuation pattern fight the model.

4. Attention Over the Prompt

Every transformer layer applies self-attention: for each output position, the model computes a weighted sum over every preceding input position. With N input tokens, attention is O(N²) in compute and memory.

code
For each output token t:
    attention_t = softmax( Q_t · K_{1..t}^T / sqrt(d_k) ) · V_{1..t}
                  ↑                ↑
                  one query        all preceding keys/values
                  vector           (the entire prompt)

Two prompt-engineering consequences:

  • Context isn't free. Doubling the prompt length quadruples attention compute. Frontier models in 2026 use various optimizations (flash attention, sliding windows, ring attention, MoE routing) but the cost scaling pressure is real and shows up in latency and pricing.
  • Position effects are real. Empirically, LLMs attend most strongly to the start and end of the context window — the famous "lost in the middle" finding from Liu et al. 2023. If you put critical instructions in the middle of a 100k-token prompt, accuracy drops noticeably. Put critical instructions at the start AND restate them at the end.

5. The Chat Template: System / User / Assistant

Modern LLMs are not raw next-token predictors at the API level — they are chat models, fine-tuned on multi-turn conversations with role-tagged messages. When you call the chat completions API, the SDK assembles your messages into a model-specific template:

RolePurposePractitioner notes
systemPersistent instructions, persona, rulesHigher-priority than user in most models. One per conversation.
userThe user's question / inputCan be many turns; attacker-controlled in production.
assistantModel's prior responsesUsed for multi-turn context and few-shot examples.
tool / functionTool-call resultsRequired for agentic / tool-use flows.
code
# OpenAI / chat.completions
from openai import OpenAI
client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {"role": "system", "content": "You are a terse Python assistant."},
        {"role": "user",   "content": "Reverse a list."},
    ],
    temperature=0.2,
)
print(resp.choices[0].message.content)

# Anthropic / messages — system is a top-level field, not a role
import anthropic
client = anthropic.Anthropic()
resp = client.messages.create(
    model="claude-sonnet-4-6",
    system="You are a terse Python assistant.",
    messages=[{"role": "user", "content": "Reverse a list."}],
    max_tokens=256,
)
print(resp.content[0].text)

The two SDKs disagree on whether system is a message role or a top-level field — minor friction when writing provider-agnostic code via litellm or your own wrapper. The semantics are the same: the system message is the highest-priority instruction the model sees.

6. Sampling: How the Next Token Is Chosen

At each step the model emits logits — one float per token in the vocab. The sampler turns those logits into a single token. Five knobs you'll see in every API:

ParameterWhat it doesTypical range
temperatureDivides logits before softmax. 0 → argmax (greedy); higher → flatter distribution, more variety.0.0 - 1.5
top_p (nucleus)Sample only from the smallest set of tokens whose cumulative probability ≥ p.0.9 - 1.0
top_kSample only from the K most-likely tokens. (Less common in OpenAI; standard in HF / vLLM.)20 - 100
frequency_penaltyPenalize tokens proportional to how often they've already appeared. Reduces repetition.0.0 - 1.0
presence_penaltyPenalize tokens that have appeared at all (0/1 binary). Encourages topic switching.0.0 - 1.0
code
# Greedy decoding — always pick the argmax
client.chat.completions.create(model="gpt-4o", messages=msgs, temperature=0)

# Sampled decoding — pick from a flatter distribution
client.chat.completions.create(model="gpt-4o", messages=msgs,
                               temperature=0.8, top_p=0.95)

Practitioner defaults: temperature=0 for classification / extraction / RAG / code generation, temperature=0.7-1.0 for creative writing or brainstorming. top_p=0.95 is a fine default almost everywhere; only set top_k if your framework doesn't expose top_p.

7. Why temperature=0 Is Not Deterministic in Production

8. Log-Probabilities: Looking Inside the Sampler

Most chat APIs let you request the log-probability of each emitted token (and the top alternatives). This is underused. Log-probs are how you build classifiers without prompting for a class name, score the model's confidence, detect hallucinations early, or implement constrained decoding.

code
resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Sentiment of: I love this. Reply with only 'pos' or 'neg'."}],
    temperature=0,
    logprobs=True,
    top_logprobs=5,
    max_tokens=1,
)
for tok in resp.choices[0].logprobs.content:
    print(tok.token, tok.logprob)
    for alt in tok.top_logprobs:
        print(f"  alt: {alt.token!r:>8}  logp={alt.logprob:.3f}  p={2.718**alt.logprob:.3f}")
# pos -0.001
#   alt: 'pos'   logp=-0.001  p=0.999
#   alt: 'neg'   logp=-7.214  p=0.001
#   alt: 'neu'   logp=-9.832  p=0.000

The model's confidence in "pos" here is 99.9%. If you saw p=0.55 for "pos" vs p=0.42 for "neg" you'd know the model was guessing. That's a real signal you can route on in production.

9. Context Windows in 2026

ModelContext windowOutput limitNotes
GPT-4.11,000,000 tokens32k1M as of mid-2025; usable but lost-in-the-middle is real
GPT-4o128,000 tokens16kWorkhorse default
Claude Sonnet 4.6200,000 tokens64kStrong long-context recall
Claude Opus 4.71,000,000 tokens64k1M tier; flagship reasoning
Gemini 2.5 Pro2,000,000 tokens64kLargest commercial window

A few practitioner numbers to memorize: a typical novel is ~100k tokens, a long technical paper is ~10-30k, this whole course is roughly ~200k tokens of HTML. "Context window = put the whole codebase in" is mostly real in 2026, but with three caveats:

  1. Cost scales linearly with input tokens. Prompt caching (Anthropic, OpenAI, Gemini all support it) cuts repeated prefix costs by 50-90%.
  2. Latency scales worse than linearly. A 500k-token prompt can take 30-90 seconds for the first token.
  3. Recall isn't perfect at the tail. Even "needle in a haystack" benchmarks show 95-99% on most models — meaning 1-5% of facts in a long doc go unretrieved. RAG often beats stuffing the whole corpus into context.

10. Token-Budget Arithmetic

A useful mental model for any prompt:

code
budget = context_window - max_output_tokens - safety_margin

allocate_to:
  system_prompt   :  500 -  2,000   tokens
  few_shot_examples:  500 -  4,000   tokens
  retrieved_docs  :  remaining (RAG: 60-80% of budget)
  user_query      :   50 -    500   tokens
  format_overhead :  ~50 - ~200     tokens (XML tags, JSON keys)
code
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

def count(text: str) -> int:
    return len(enc.encode(text))

prompt_tokens = count(system) + count(query) + sum(count(d) for d in docs)
remaining = 128_000 - prompt_tokens - 16_000   # leave room for output
print(f"used {prompt_tokens}, output budget {remaining}")

Always count, never estimate. The number of production bugs caused by silently truncated prompts (because the dev guessed at token counts instead of measuring) is enormous.

11. The Mental Model

Up next · Zero-Shot and Few-Shot Prompting