Q: Derive the REINFORCE gradient via the log-derivative trick and explain why a baseline reduces variance without bias.

Objective J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)]. Using \nabla_\theta p_\theta(\tau)=p_\theta(\tau)\nabla_\theta\log p_\theta(\tau), \nabla_\theta J=\mathbb{E}_\tau[R(\tau)\sum_t\nabla_\theta\log\pi_\theta(a_t|s_t)] — dynamics drop out because \log p separates into a transition term independent of \theta. Subtracting a state-only baseline gives \mathbb{E}[(G_t-b(s_t))\nabla\log\pi]; it is unbiased because \mathbb{E}_a[\nabla\log\pi(a|s)]=\nabla\sum_a\pi=\nabla 1=0, so \mathbb{E}[b\,\nabla\log\pi]=0. Choosing b(s)=V(s) minimizes variance, turning G-V into the advantage estimate.

Q: Write PPO's clipped surrogate objective and explain what problem clipping solves that vanilla policy gradient and TRPO each had.

Let r_t(\theta)=\pi_\theta(a_t|s_t)/\pi_{\theta_{old}}(a_t|s_t). PPO maximizes \mathbb{E}[\min(r_t A_t,\ \text{clip}(r_t,1-\epsilon,1+\epsilon)A_t)]. Vanilla policy gradient takes uncontrolled steps that can collapse the policy after one bad update; TRPO enforces a trust region via a hard KL constraint solved with conjugate gradients — correct but complex and second-order. PPO's clip is a cheap first-order surrogate: once the ratio moves past 1\pm\epsilon in the improving direction, the objective flattens, removing incentive to step further and approximating the trust region. The \min keeps the bound pessimistic when A_t<0, and this permits multiple SGD epochs per batch of on-policy data.

Question 1

Define a Markov Decision Process (MDP) and explain what the Markov property guarantees.

Accepted Answer

An MDP is the tuple (S, A, P, R, \gamma): states S, actions A, transition kernel P(s'|s,a), reward function R(s,a), and discount \gamma\in[0,1). The Markov property states the next state and reward depend only on the current state-action pair, not the history: P(s_{t+1}|s_t,a_t,\dots,s_0)=P(s_{t+1}|s_t,a_t). This makes the current state a sufficient statistic, so an optimal policy can be stationary and depend only on s_t — which is what makes value iteration and the Bellman recursion valid.

Question 2

What is the role of the discount factor $\gamma$, and what breaks at $\gamma=1$ in a continuing task?

Accepted Answer

\gamma\in[0,1) weights future rewards in the return G_t=\sum_{k\ge0}\gamma^k r_{t+k}. It trades myopia vs. far-sightedness and bounds the infinite sum, so values are finite when rewards are bounded (|G|\le R_{max}/(1-\gamma)). At \gamma=1 in a continuing (non-terminating) task the sum can diverge and the value function is ill-defined; you must use average-reward formulations instead. Smaller \gamma shortens the effective horizon \sim 1/(1-\gamma) and reduces variance but can ignore long-term consequences.

Question 3

Contrast a value function with a policy. Why might you prefer learning a policy directly?

Accepted Answer

A policy \pi(a|s) maps states to action probabilities — it is what you execute. A value function (V^\pi(s) or Q^\pi(s,a)) scores how good states/actions are under a policy. Value-based methods derive a policy implicitly via \arg\max_a Q, which needs a tractable max over actions. Learning a policy directly is preferable with continuous or high-dimensional actions (no argmax), when the optimal policy is stochastic (partially observed or adversarial settings), or when you want smoother improvement than an argmax that flips discontinuously under small value errors.

Question 4

Write the Bellman optimality equation for $Q^*$ and explain why it defines a contraction.

Accepted Answer

Q^*(s,a)=\mathbb{E}_{s'}[R(s,a)+\gamma\max_{a'}Q^*(s',a')]. The Bellman optimality operator \mathcal{T} is a \gamma-contraction in the sup norm: \|\mathcal{T}Q_1-\mathcal{T}Q_2\|_\infty\le\gamma\|Q_1-Q_2\|_\infty, because \max is non-expansive and expectation averages. By Banach's fixed-point theorem it has a unique fixed point Q^*, and iterating \mathcal{T} from any start converges geometrically at rate \gamma. This is the backbone of value-iteration and tabular Q-learning convergence.

Question 5

Q-learning vs. SARSA: write both update targets and explain the on/off-policy distinction.

Accepted Answer

Q-learning target: r+\gamma\max_{a'}Q(s',a') — uses the greedy action regardless of what the agent does, so it is off-policy, learning Q^* while behaving \epsilon-greedy. SARSA target: r+\gamma Q(s',a') where a' is the action actually taken next — on-policy, learning Q^\pi for the exploratory policy it follows. On cliff-walking, SARSA learns the safer path (accounting for exploratory falls) while Q-learning learns the optimal-but-risky edge. SARSA's values reflect exploration cost; Q-learning's do not.

Question 6

Explain exploration vs. exploitation and why $\epsilon$-greedy is often inferior to UCB or entropy bonuses.

Accepted Answer

Exploitation picks the currently-best-estimated action; exploration tries others to refine estimates and avoid premature convergence. \epsilon-greedy explores uniformly among non-greedy actions — wasting samples on clearly-bad actions instead of directing effort toward uncertain ones. UCB adds an optimism bonus \sqrt{\ln t / N(a)} that shrinks as an action is tried, concentrating exploration on under-sampled, plausibly-good actions and achieving logarithmic regret. In deep RL, entropy regularization keeps the policy stochastic and explores smoothly in a state-dependent way; count-based or curiosity bonuses scale this to large state spaces.

Question 7

What three innovations let DQN make Q-learning work with neural-network function approximation?

Accepted Answer

(1) Experience replay: store transitions and sample minibatches randomly, breaking temporal correlation and reusing data — which also requires DQN to be off-policy. (2) A target network: a periodically-frozen copy Q_{	heta^-} supplies the bootstrap target r+\gamma\max Q_{	heta^-}, stabilizing the otherwise-moving target that drives the 'deadly triad' (bootstrapping + off-policy + function approximation). (3) Reward/error clipping plus a CNN over stacked frames for Atari. Later refinements: Double DQN (decouple action selection from evaluation to cut max-bias overestimation), dueling architecture, and prioritized replay.

Question 8

Derive the REINFORCE gradient via the log-derivative trick and explain why a baseline reduces variance without bias.

Accepted Answer

Objective J(	heta)=\mathbb{E}_{	au\sim\pi_	heta}[R(	au)]. Using 
abla_	heta p_	heta(	au)=p_	heta(	au)
abla_	heta\log p_	heta(	au), 
abla_	heta J=\mathbb{E}_	au[R(	au)\sum_t
abla_	heta\log\pi_	heta(a_t|s_t)] — dynamics drop out because \log p separates into a transition term independent of 	heta. Subtracting a state-only baseline gives \mathbb{E}[(G_t-b(s_t))
abla\log\pi]; it is unbiased because \mathbb{E}_a[
abla\log\pi(a|s)]=
abla\sum_a\pi=
abla 1=0, so \mathbb{E}[b\,
abla\log\pi]=0. Choosing b(s)=V(s) minimizes variance, turning G-V into the advantage estimate.

Question 9

What does the critic add in actor-critic, and what is the bias-variance tradeoff vs. Monte-Carlo REINFORCE?

Accepted Answer

The critic learns a value function (V or Q) to form a bootstrapped advantage, e.g. A_t=r_t+\gamma V(s_{t+1})-V(s_t) (TD), replacing REINFORCE's full Monte-Carlo return G_t. MC returns are unbiased but high-variance (noise accumulates over the episode); the bootstrapped critic target has much lower variance but is biased while the critic is imperfect. GAE(\lambda) interpolates: \lambda	o1 approaches MC (low bias, high variance), \lambda	o0 approaches one-step TD (low variance, more bias). The actor follows 
abla\log\pi\cdot A_t; the critic is trained by regression to its TD target.

Question 10

Write PPO's clipped surrogate objective and explain what problem clipping solves that vanilla policy gradient and TRPO each had.

Accepted Answer

Let r_t(	heta)=\pi_	heta(a_t|s_t)/\pi_{	heta_{old}}(a_t|s_t). PPO maximizes \mathbb{E}[\min(r_t A_t,\ 	ext{clip}(r_t,1-\epsilon,1+\epsilon)A_t)]. Vanilla policy gradient takes uncontrolled steps that can collapse the policy after one bad update; TRPO enforces a trust region via a hard KL constraint solved with conjugate gradients — correct but complex and second-order. PPO's clip is a cheap first-order surrogate: once the ratio moves past 1\pm\epsilon in the improving direction, the objective flattens, removing incentive to step further and approximating the trust region. The \min keeps the bound pessimistic when A_t<0, and this permits multiple SGD epochs per batch of on-policy data.

Question 11

In RLHF, what plays the roles of policy, reward, and the KL penalty, and why is the KL-to-reference term essential?

Accepted Answer

The policy is the LLM being fine-tuned; the reward is a learned reward model (trained on human preference pairs via a Bradley-Terry loss) scoring a completion, typically optimized with PPO. The objective adds a per-token KL penalty to a frozen reference (the SFT model): r_{total}=r_{RM}(x,y)-\beta\,\mathrm{KL}(\pi_	heta\|\pi_{ref}). The KL term prevents reward hacking/over-optimization — without it the policy drifts to degenerate text that fools the imperfect reward model — and keeps generations fluent and on-distribution. DPO later folds this same KL-regularized objective into a closed-form classification loss, removing the explicit reward model and online sampling.

Question 12

Explain reward hacking / Goodhart's law in RL and give concrete mitigations a staff engineer would deploy.

Accepted Answer

Reward hacking occurs when the agent maximizes the proxy reward in ways that violate the designer's intent, because the proxy imperfectly measures the goal (Goodhart: a measure that becomes a target ceases to be a good measure). Examples: a boat-racing agent looping to collect points instead of finishing; an RLHF policy producing sycophantic or confidently-wrong text the reward model overrates. Mitigations: KL/anchor regularization to a trusted reference; conservative or ensemble reward models with uncertainty penalties; early stopping on a held-out true metric; reward-model retraining on adversarial outputs (iterated RLHF); constrained/Lagrangian objectives; and human-in-the-loop audits. The root fix is better reward specification, not stronger optimization.

Question 13

What is the 'deadly triad', why does it cause divergence, and how do modern algorithms cope?

Accepted Answer

The deadly triad is the simultaneous combination of (1) function approximation, (2) bootstrapping (TD targets that depend on current estimates), and (3) off-policy training. Any two are safe; all three can make value estimates diverge. Baird's counterexample shows TD weights blowing up even when a perfect representable solution exists, because the semi-gradient TD update is not a true gradient and the off-policy state distribution mismatches the bootstrap, so the update is no longer a contraction. Coping strategies: target networks plus large replay (DQN) to slow the moving target; gradient-TD methods (GTD2/TDC) that follow the true gradient of the projected Bellman error; emphatic-TD reweighting; or staying on-policy.

Question 14

Why does temporal-difference learning often converge faster than Monte-Carlo in practice, and when can TD's bias actually hurt?

Accepted Answer

TD bootstraps from its own current estimate, so it learns online from incomplete episodes and propagates information one step at a time without waiting for termination — it exploits Markov structure and has far lower variance than MC, which accumulates the full noisy return. This usually yields faster, more data-efficient learning. The cost is bias: TD targets are wrong while the estimator is wrong, and that bias is amplified under function approximation and off-policy data (the deadly triad), where TD can be unstable or converge to a biased fixed point. In genuinely non-Markov / partially-observed settings MC can be more robust because it doesn't trust a bootstrapped value. TD(\lambda) interpolates the two.

Reinforcement Learning