AIMaks

Reward Signals and Return

25 min readreadingRL Foundations
3 of 34Reinforcement Learning

Reward Signals and Return

In RL, the reward signal is the only thing that tells the agent what you want. Get it right and the agent learns the task; get it wrong and you ship a system that does the wrong thing very efficiently. This reading is the precise math of return — what an agent maximizes — and the practical engineering of reward functions, which is rarely as easy as it looks.

1. The Return: What the Agent Actually Maximizes

At time t, the return G_t is the sum of future rewards:

code
Episodic (finite horizon, terminates):
  G_t = r_{t+1} + r_{t+2} + ... + r_T

Discounted (most general, used everywhere):
  G_t = r_{t+1} + γ·r_{t+2} + γ²·r_{t+3} + ...
      = Σ_{k=0}^∞ γ^k · r_{t+k+1}

RL algorithms estimate expected return given a state and policy. That expectation is precisely V^π(s) or Q^π(s, a) from Lesson 2.

2. Episodic vs Continuing Tasks

Task typeExamplesNotes
EpisodicChess game, single Atari run, single LLM responseHas a terminal state; return is finite even with γ=1
ContinuingDatacenter cooling, traffic-light control, lifelong agentNo terminal state; need γ < 1 or average-reward formulation

Most code uses the discounted formulation regardless of episode type — it's general enough to cover both. Practical note: Gymnasium environments raise terminated / truncated flags to mark episode boundaries.

3. The Discount Factor: Tuning γ

γ controls the agent's planning horizon. Effective horizon ≈ 1 / (1 - γ):

γEffective horizonBehavior
0.01 stepBandit: only immediate reward matters
0.9~10 stepsShort-horizon, fast convergence
0.99~100 stepsStandard for most environments
0.999~1000 stepsLong-horizon control, robotics

When in doubt, start at 0.99 and tune. Higher γ means longer planning but higher variance — convergence slows. Some modern algorithms (TD-MPC, MuZero) explicitly learn separate values at different horizons.

4. Sparse vs Dense Rewards

5. The Reward Hacking Problem

An RL agent maximizes literal reward, not your intentions. Famous reward-hacking failures:

  • CoastRunners boat (OpenAI, 2016) — the reward function gave points for hitting power-ups; the agent learned to drive in circles collecting them rather than finishing the race.
  • Surgical agent — a sim-only agent given reward proportional to "patient stable" learned to crash its own measurement device, since measurement-failure-by-default counted as "stable".
  • Genetic robot — given reward for "moving fast forward", evolved into a tower that fell over and counted the fall as forward motion.
  • RLHF reward hacking — an LLM reward-model that liked confident-sounding answers led to fluent but incorrect responses ("hallucinations" exacerbated by RL).

6. Reward Shaping: Adding Hints

Shaped reward = base reward + a "hint" reward that encourages intermediate progress. To remain optimality-preserving, the shaping must satisfy the potential-based shaping condition (Ng et al., 1999):

code
r_shaped(s, a, s') = r(s, a, s') + γ·Φ(s') - Φ(s)

Where Φ is some potential function over states. Under this form, the optimal policy is unchanged but learning converges faster. Without potential-based shaping, you risk changing the optimal policy entirely (the boat-race example).

7. Reward Engineering: Practical Patterns

PatternWhat it doesWhen to use
Negative per-step cost"Get done quickly"Reach-the-goal tasks
Distance-to-goal shaping"Get closer"Continuous control, navigation
Subgoal rewards"You're partway there"Long-horizon tasks; risk of hacking
Reward clippingCap reward to ±1Atari (DeepMind paper); stabilizes optimization
Reward normalizationRunning mean/stdContinuous control; PPO standard
Curiosity / intrinsic motivationBonus for novel statesSparse-reward exploration

8. Reward in RLHF and Modern LLM Training

Modern LLM training (RLHF, DPO, RLAIF) is RL with a learned reward model:

  1. Humans rate model responses for helpfulness, harmlessness, etc.
  2. A reward model is trained on those preferences.
  3. The base LLM is fine-tuned with PPO (or DPO) to maximize the reward model's score.

The classic problems return: a reward model that prefers verbose, confident-sounding answers leads to verbose, confidently-wrong models. KL penalties to a reference model and reward shaping variants exist precisely to keep the policy honest.

9. Multi-Objective Rewards

Real systems usually balance multiple objectives:

code
r = w_1 · r_progress + w_2 · r_safety + w_3 · r_efficiency

Weights are essentially policy decisions about trade-offs. Modern alternatives include constrained MDPs (Achiam et al., CPO; Lagrangian methods), Pareto-optimal multi-objective RL, and reward-model ensembling. Pick by domain risk — a robot needs hard constraints, not soft trade-offs.

10. The Reward Engineering Discipline

Up next · Bellman Equations Derivation