Markov Decision Processes

35 min readvideoRL Foundations

2 of 34Reinforcement Learning

Markov Decision Processes

A Markov decision process (MDP) is the mathematical formalism underneath every RL problem. Tabular Q-learning, deep RL, and RLHF all operate on MDPs (or partially-observable MDPs). This lesson is the precise definition: what an MDP is, what the Markov property buys you, and how to recognize when your problem is — or isn't — an MDP.

1. The Five-Tuple Definition

An MDP is a tuple (S, A, P, R, γ):

Symbol	What it is
S	Set of states
A	Set of actions
P(s' \| s, a)	Transition probability — given current state and action, distribution over next states
R(s, a, s')	Reward function — scalar received on transition
γ ∈ [0, 1]	Discount factor — how much future reward is worth now

Plus an optional initial state distribution. Every concrete RL problem can be written in this form, even if S is huge or continuous.

2. The Markov Property

code

P(s_{t+1}, r_{t+1} | s_t, a_t) = P(s_{t+1}, r_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ..., s_0, a_0)

In other words: the state captures all relevant information about the past. The Markov property is what makes RL tractable — we don't have to remember the entire history.

When the property doesn't hold, you have a partially observable MDP (POMDP, Section 6). The fix is to enrich the state — stack past observations, use an RNN, or learn a state representation. Atari games are a classic example: a single frame isn't Markov (you can't tell ball direction from one frame), so DQN stacks 4 frames to recover the property.

3. Episodes and Returns

An episode is a sequence of (state, action, reward) tuples ending in a terminal state. The return from time t is the discounted sum of future rewards:

code

G_t = r_{t+1} + γ·r_{t+2} + γ²·r_{t+3} + ... = Σ γ^k · r_{t+k+1}

The agent's goal is to maximize expected return.

4. Why Discount?

Mathematical convenience — γ < 1 keeps G_t bounded even for infinite horizons.
Uncertainty about the future — γ models "we may not get there" or "the model isn't accurate that far ahead".
Time preference — economic discounting: " $10 n o w i s w or t hm or e t han$ 10 next year".
Variance reduction — in practice, lower γ reduces variance of return estimates and stabilizes learning.

Typical values:

γ	Use case
0.99	Default for most environments
0.999	Very long-horizon (continuous control, robotics)
0.9	Short-horizon games, faster learning
0.0	Bandit / contextual bandit (no future)

5. Policies

A policy π is a mapping from states to actions:

Deterministic — π(s) = a, one fixed action per state.
Stochastic — π(a | s) = probability of taking action a in state s.

RL is the search for an optimal policy π* that maximizes expected return from every state.

6. Value Functions

Two value functions — for a fixed policy π:

code

V^π(s)   = E_π [G_t | s_t = s]                # state value
Q^π(s,a) = E_π [G_t | s_t = s, a_t = a]       # action value

V tells you "how good is this state under π"; Q tells you "how good is this action in this state under π". They are linked:

code

V^π(s)   = Σ_a π(a | s) · Q^π(s, a)
Q^π(s,a) = Σ_{s'} P(s' | s, a) · [ R(s,a,s') + γ · V^π(s') ]

7. Optimal Value Functions and Optimal Policies

The optimal value functions are defined as the maxima over all policies:

code

V*(s)   = max_π V^π(s)
Q*(s,a) = max_π Q^π(s,a)

Once you know Q*, the optimal policy is trivial: take argmax_a Q*(s, a). Most value-based RL algorithms reduce to "estimate Q*". Lesson 4's Bellman equations describe exactly how to compute it.

8. A Concrete Mini-MDP

code

    ┌─────┐  a=right (+1)  ┌─────┐  a=right (+1)  ┌─────┐
    │  s0 │ ────────────▶ │  s1 │ ────────────▶ │  s2 │ (terminal +10)
    └─────┘                └─────┘                └─────┘
       │ a=left (0)            │ a=left (-1)
       ▼                       ▼
       (loop)              back to s0

S = {s0, s1, s2}
A = {left, right}
γ = 0.9

Q*(s2, anything) = 10 (already terminal). Q*(s1, right) = 1 + 0.9·10 = 10. Q*(s0, right) = 1 + 0.9·10 = 10. The optimal policy is "right everywhere"; total return from s0 is 1 + 0.9 + 0.81·10 ≈ 9.91. This is the kind of computation Sections 2 and 3 will automate.

9. Recognizing MDPs in the Wild

Problem	S	A	R
Chess	Board state	Legal moves	+1 win, 0 draw, -1 loss
CartPole	(cart pos, vel, angle, angular vel)	{left, right}	+1 per step alive
RLHF (LLM)	Prompt + tokens so far	Next token	Reward model output at end
Robot arm	Joint positions, velocities	Torques	-distance to target
Recommender	User context + past actions	Item to show	Click / engagement

10. Common Modeling Mistakes

← Previous lessonWhat is Reinforcement Learning?

Up next · Reward Signals and Return