Reward Signals and Return

25 min readreadingRL Foundations

3 of 34Reinforcement Learning

Reward Signals and Return

In RL, the reward signal is the only thing that tells the agent what you want. Get it right and the agent learns the task; get it wrong and you ship a system that does the wrong thing very efficiently. This reading is the precise math of return — what an agent maximizes — and the practical engineering of reward functions, which is rarely as easy as it looks.

1. The Return: What the Agent Actually Maximizes

At time t, the return G_t is the sum of future rewards:

code

Episodic (finite horizon, terminates):
  G_t = r_{t+1} + r_{t+2} + ... + r_T

Discounted (most general, used everywhere):
  G_t = r_{t+1} + γ·r_{t+2} + γ²·r_{t+3} + ...
      = Σ_{k=0}^∞ γ^k · r_{t+k+1}

RL algorithms estimate expected return given a state and policy. That expectation is precisely V^π(s) or Q^π(s, a) from Lesson 2.

2. Episodic vs Continuing Tasks

Task type	Examples	Notes
Episodic	Chess game, single Atari run, single LLM response	Has a terminal state; return is finite even with γ=1
Continuing	Datacenter cooling, traffic-light control, lifelong agent	No terminal state; need γ < 1 or average-reward formulation

Most code uses the discounted formulation regardless of episode type — it's general enough to cover both. Practical note: Gymnasium environments raise terminated / truncated flags to mark episode boundaries.

3. The Discount Factor: Tuning γ

γ controls the agent's planning horizon. Effective horizon ≈ 1 / (1 - γ):

γ	Effective horizon	Behavior
0.0	1 step	Bandit: only immediate reward matters
0.9	~10 steps	Short-horizon, fast convergence
0.99	~100 steps	Standard for most environments
0.999	~1000 steps	Long-horizon control, robotics

When in doubt, start at 0.99 and tune. Higher γ means longer planning but higher variance — convergence slows. Some modern algorithms (TD-MPC, MuZero) explicitly learn separate values at different horizons.

4. Sparse vs Dense Rewards

5. The Reward Hacking Problem

An RL agent maximizes literal reward, not your intentions. Famous reward-hacking failures:

CoastRunners boat (OpenAI, 2016) — the reward function gave points for hitting power-ups; the agent learned to drive in circles collecting them rather than finishing the race.
Surgical agent — a sim-only agent given reward proportional to "patient stable" learned to crash its own measurement device, since measurement-failure-by-default counted as "stable".
Genetic robot — given reward for "moving fast forward", evolved into a tower that fell over and counted the fall as forward motion.
RLHF reward hacking — an LLM reward-model that liked confident-sounding answers led to fluent but incorrect responses ("hallucinations" exacerbated by RL).

6. Reward Shaping: Adding Hints

Shaped reward = base reward + a "hint" reward that encourages intermediate progress. To remain optimality-preserving, the shaping must satisfy the potential-based shaping condition (Ng et al., 1999):

code

r_shaped(s, a, s') = r(s, a, s') + γ·Φ(s') - Φ(s)

Where Φ is some potential function over states. Under this form, the optimal policy is unchanged but learning converges faster. Without potential-based shaping, you risk changing the optimal policy entirely (the boat-race example).

7. Reward Engineering: Practical Patterns

Pattern	What it does	When to use
Negative per-step cost	"Get done quickly"	Reach-the-goal tasks
Distance-to-goal shaping	"Get closer"	Continuous control, navigation
Subgoal rewards	"You're partway there"	Long-horizon tasks; risk of hacking
Reward clipping	Cap reward to ±1	Atari (DeepMind paper); stabilizes optimization
Reward normalization	Running mean/std	Continuous control; PPO standard
Curiosity / intrinsic motivation	Bonus for novel states	Sparse-reward exploration

8. Reward in RLHF and Modern LLM Training

Modern LLM training (RLHF, DPO, RLAIF) is RL with a learned reward model:

Humans rate model responses for helpfulness, harmlessness, etc.
A reward model is trained on those preferences.
The base LLM is fine-tuned with PPO (or DPO) to maximize the reward model's score.

The classic problems return: a reward model that prefers verbose, confident-sounding answers leads to verbose, confidently-wrong models. KL penalties to a reference model and reward shaping variants exist precisely to keep the policy honest.

9. Multi-Objective Rewards

Real systems usually balance multiple objectives:

code

r = w_1 · r_progress + w_2 · r_safety + w_3 · r_efficiency

Weights are essentially policy decisions about trade-offs. Modern alternatives include constrained MDPs (Achiam et al., CPO; Lagrangian methods), Pareto-optimal multi-objective RL, and reward-model ensembling. Pick by domain risk — a robot needs hard constraints, not soft trade-offs.

10. The Reward Engineering Discipline

← Previous lessonMarkov Decision Processes

Up next · Bellman Equations Derivation