AIMaks

Markov Decision Processes

35 min readvideoRL Foundations
2 of 34Reinforcement Learning

Markov Decision Processes

A Markov decision process (MDP) is the mathematical formalism underneath every RL problem. Tabular Q-learning, deep RL, and RLHF all operate on MDPs (or partially-observable MDPs). This lesson is the precise definition: what an MDP is, what the Markov property buys you, and how to recognize when your problem is — or isn't — an MDP.

1. The Five-Tuple Definition

An MDP is a tuple (S, A, P, R, γ):

SymbolWhat it is
SSet of states
ASet of actions
P(s' | s, a)Transition probability — given current state and action, distribution over next states
R(s, a, s')Reward function — scalar received on transition
γ ∈ [0, 1]Discount factor — how much future reward is worth now

Plus an optional initial state distribution. Every concrete RL problem can be written in this form, even if S is huge or continuous.

2. The Markov Property

code
P(s_{t+1}, r_{t+1} | s_t, a_t) = P(s_{t+1}, r_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ..., s_0, a_0)
In other words: the state captures all relevant information about the past. The Markov property is what makes RL tractable — we don't have to remember the entire history.

When the property doesn't hold, you have a partially observable MDP (POMDP, Section 6). The fix is to enrich the state — stack past observations, use an RNN, or learn a state representation. Atari games are a classic example: a single frame isn't Markov (you can't tell ball direction from one frame), so DQN stacks 4 frames to recover the property.

3. Episodes and Returns

An episode is a sequence of (state, action, reward) tuples ending in a terminal state. The return from time t is the discounted sum of future rewards:

code
G_t = r_{t+1} + γ·r_{t+2} + γ²·r_{t+3} + ... = Σ γ^k · r_{t+k+1}

The agent's goal is to maximize expected return.

4. Why Discount?

  • Mathematical convenience — γ < 1 keeps G_t bounded even for infinite horizons.
  • Uncertainty about the future — γ models "we may not get there" or "the model isn't accurate that far ahead".
  • Time preference — economic discounting: "10 next year".
  • Variance reduction — in practice, lower γ reduces variance of return estimates and stabilizes learning.

Typical values:

γUse case
0.99Default for most environments
0.999Very long-horizon (continuous control, robotics)
0.9Short-horizon games, faster learning
0.0Bandit / contextual bandit (no future)

5. Policies

A policy π is a mapping from states to actions:

  • Deterministic — π(s) = a, one fixed action per state.
  • Stochastic — π(a | s) = probability of taking action a in state s.

RL is the search for an optimal policy π* that maximizes expected return from every state.

6. Value Functions

Two value functions — for a fixed policy π:

code
V^π(s)   = E_π [G_t | s_t = s]                # state value
Q^π(s,a) = E_π [G_t | s_t = s, a_t = a]       # action value

V tells you "how good is this state under π"; Q tells you "how good is this action in this state under π". They are linked:

code
V^π(s)   = Σ_a π(a | s) · Q^π(s, a)
Q^π(s,a) = Σ_{s'} P(s' | s, a) · [ R(s,a,s') + γ · V^π(s') ]

7. Optimal Value Functions and Optimal Policies

The optimal value functions are defined as the maxima over all policies:

code
V*(s)   = max_π V^π(s)
Q*(s,a) = max_π Q^π(s,a)

Once you know Q*, the optimal policy is trivial: take argmax_a Q*(s, a). Most value-based RL algorithms reduce to "estimate Q*". Lesson 4's Bellman equations describe exactly how to compute it.

8. A Concrete Mini-MDP

code
    ┌─────┐  a=right (+1)  ┌─────┐  a=right (+1)  ┌─────┐
    │  s0 │ ────────────▶ │  s1 │ ────────────▶ │  s2 │ (terminal +10)
    └─────┘                └─────┘                └─────┘
       │ a=left (0)            │ a=left (-1)
       ▼                       ▼
       (loop)              back to s0

S = {s0, s1, s2}
A = {left, right}
γ = 0.9

Q*(s2, anything) = 10 (already terminal). Q*(s1, right) = 1 + 0.9·10 = 10. Q*(s0, right) = 1 + 0.9·10 = 10. The optimal policy is "right everywhere"; total return from s0 is 1 + 0.9 + 0.81·10 ≈ 9.91. This is the kind of computation Sections 2 and 3 will automate.

9. Recognizing MDPs in the Wild

ProblemSAR
ChessBoard stateLegal moves+1 win, 0 draw, -1 loss
CartPole(cart pos, vel, angle, angular vel){left, right}+1 per step alive
RLHF (LLM)Prompt + tokens so farNext tokenReward model output at end
Robot armJoint positions, velocitiesTorques-distance to target
RecommenderUser context + past actionsItem to showClick / engagement

10. Common Modeling Mistakes

Up next · Reward Signals and Return