Markov Decision Processes
2 of 34Reinforcement Learning
Markov Decision Processes
A Markov decision process (MDP) is the mathematical formalism underneath every RL problem. Tabular Q-learning, deep RL, and RLHF all operate on MDPs (or partially-observable MDPs). This lesson is the precise definition: what an MDP is, what the Markov property buys you, and how to recognize when your problem is — or isn't — an MDP.
1. The Five-Tuple Definition
An MDP is a tuple (S, A, P, R, γ):
| Symbol | What it is |
|---|---|
| S | Set of states |
| A | Set of actions |
| P(s' | s, a) | Transition probability — given current state and action, distribution over next states |
| R(s, a, s') | Reward function — scalar received on transition |
| γ ∈ [0, 1] | Discount factor — how much future reward is worth now |
Plus an optional initial state distribution. Every concrete RL problem can be written in this form, even if S is huge or continuous.
2. The Markov Property
P(s_{t+1}, r_{t+1} | s_t, a_t) = P(s_{t+1}, r_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ..., s_0, a_0)When the property doesn't hold, you have a partially observable MDP (POMDP, Section 6). The fix is to enrich the state — stack past observations, use an RNN, or learn a state representation. Atari games are a classic example: a single frame isn't Markov (you can't tell ball direction from one frame), so DQN stacks 4 frames to recover the property.
3. Episodes and Returns
An episode is a sequence of (state, action, reward) tuples ending in a terminal state. The return from time t is the discounted sum of future rewards:
G_t = r_{t+1} + γ·r_{t+2} + γ²·r_{t+3} + ... = Σ γ^k · r_{t+k+1}
The agent's goal is to maximize expected return.
4. Why Discount?
- Mathematical convenience — γ < 1 keeps G_t bounded even for infinite horizons.
- Uncertainty about the future — γ models "we may not get there" or "the model isn't accurate that far ahead".
- Time preference — economic discounting: "10 next year".
- Variance reduction — in practice, lower γ reduces variance of return estimates and stabilizes learning.
Typical values:
| γ | Use case |
|---|---|
| 0.99 | Default for most environments |
| 0.999 | Very long-horizon (continuous control, robotics) |
| 0.9 | Short-horizon games, faster learning |
| 0.0 | Bandit / contextual bandit (no future) |
5. Policies
A policy π is a mapping from states to actions:
- Deterministic — π(s) = a, one fixed action per state.
- Stochastic — π(a | s) = probability of taking action a in state s.
RL is the search for an optimal policy π* that maximizes expected return from every state.
6. Value Functions
Two value functions — for a fixed policy π:
V^π(s) = E_π [G_t | s_t = s] # state value
Q^π(s,a) = E_π [G_t | s_t = s, a_t = a] # action value
V tells you "how good is this state under π"; Q tells you "how good is this action in this state under π". They are linked:
V^π(s) = Σ_a π(a | s) · Q^π(s, a)
Q^π(s,a) = Σ_{s'} P(s' | s, a) · [ R(s,a,s') + γ · V^π(s') ]
7. Optimal Value Functions and Optimal Policies
The optimal value functions are defined as the maxima over all policies:
V*(s) = max_π V^π(s)
Q*(s,a) = max_π Q^π(s,a)
Once you know Q*, the optimal policy is trivial: take
argmax_a Q*(s, a). Most value-based RL algorithms
reduce to "estimate Q*". Lesson 4's Bellman equations describe
exactly how to compute it.
8. A Concrete Mini-MDP
┌─────┐ a=right (+1) ┌─────┐ a=right (+1) ┌─────┐
│ s0 │ ────────────▶ │ s1 │ ────────────▶ │ s2 │ (terminal +10)
└─────┘ └─────┘ └─────┘
│ a=left (0) │ a=left (-1)
▼ ▼
(loop) back to s0
S = {s0, s1, s2}
A = {left, right}
γ = 0.9
Q*(s2, anything) = 10 (already terminal). Q*(s1, right) = 1 + 0.9·10 = 10. Q*(s0, right) = 1 + 0.9·10 = 10. The optimal policy is "right everywhere"; total return from s0 is 1 + 0.9 + 0.81·10 ≈ 9.91. This is the kind of computation Sections 2 and 3 will automate.
9. Recognizing MDPs in the Wild
| Problem | S | A | R |
|---|---|---|---|
| Chess | Board state | Legal moves | +1 win, 0 draw, -1 loss |
| CartPole | (cart pos, vel, angle, angular vel) | {left, right} | +1 per step alive |
| RLHF (LLM) | Prompt + tokens so far | Next token | Reward model output at end |
| Robot arm | Joint positions, velocities | Torques | -distance to target |
| Recommender | User context + past actions | Item to show | Click / engagement |