Reinforcement Learning

Reinforcement Learning — interview questions and answers with clear explanations.

Study these interactively →
Foundationalconcept

Define a Markov Decision Process (MDP) and explain what the Markov property guarantees.

An MDP is the tuple $(S, A, P, R, \gamma)$: states $S$, actions $A$, transition kernel $P(s'|s,a)$, reward function $R(s,a)$, and discount $\gamma\in[0,1)$. The Markov property states the next state and reward depend only on the current state-action pair, not the history: $P(s_{t+1}|s_t,a_t,\dots,s_0)=P(s_{t+1}|s_t,a_t)$. This makes the current state a sufficient statistic, so an optimal policy can be stationary and depend only on $s_t$ — which is what makes value iteration and the Bellman recursion valid.
#mdp#markov-property#fundamentals
Foundationalconcept

What is the role of the discount factor $\gamma$, and what breaks at $\gamma=1$ in a continuing task?

$\gamma\in[0,1)$ weights future rewards in the return $G_t=\sum_{k\ge0}\gamma^k r_{t+k}$. It trades myopia vs. far-sightedness and bounds the infinite sum, so values are finite when rewards are bounded ($|G|\le R_{max}/(1-\gamma)$). At $\gamma=1$ in a continuing (non-terminating) task the sum can diverge and the value function is ill-defined; you must use average-reward formulations instead. Smaller $\gamma$ shortens the effective horizon $\sim 1/(1-\gamma)$ and reduces variance but can ignore long-term consequences.
#discount-factor#return#convergence
Foundationalconcept

Contrast a value function with a policy. Why might you prefer learning a policy directly?

A policy $\pi(a|s)$ maps states to action probabilities — it is what you execute. A value function ($V^\pi(s)$ or $Q^\pi(s,a)$) scores how good states/actions are under a policy. Value-based methods derive a policy implicitly via $\arg\max_a Q$, which needs a tractable max over actions. Learning a policy directly is preferable with continuous or high-dimensional actions (no argmax), when the optimal policy is stochastic (partially observed or adversarial settings), or when you want smoother improvement than an argmax that flips discontinuously under small value errors.
#value-function#policy#policy-based
Intermediatemath

Write the Bellman optimality equation for $Q^*$ and explain why it defines a contraction.

$Q^*(s,a)=\mathbb{E}_{s'}[R(s,a)+\gamma\max_{a'}Q^*(s',a')]$. The Bellman optimality operator $\mathcal{T}$ is a $\gamma$-contraction in the sup norm: $\|\mathcal{T}Q_1-\mathcal{T}Q_2\|_\infty\le\gamma\|Q_1-Q_2\|_\infty$, because $\max$ is non-expansive and expectation averages. By Banach's fixed-point theorem it has a unique fixed point $Q^*$, and iterating $\mathcal{T}$ from any start converges geometrically at rate $\gamma$. This is the backbone of value-iteration and tabular Q-learning convergence.
#bellman#contraction#value-iteration
Intermediatemath

Q-learning vs. SARSA: write both update targets and explain the on/off-policy distinction.

Q-learning target: $r+\gamma\max_{a'}Q(s',a')$ — uses the greedy action regardless of what the agent does, so it is off-policy, learning $Q^*$ while behaving $\epsilon$-greedy. SARSA target: $r+\gamma Q(s',a')$ where $a'$ is the action actually taken next — on-policy, learning $Q^\pi$ for the exploratory policy it follows. On cliff-walking, SARSA learns the safer path (accounting for exploratory falls) while Q-learning learns the optimal-but-risky edge. SARSA's values reflect exploration cost; Q-learning's do not.
#q-learning#sarsa#on-policy#off-policy
Intermediateconcept

Explain exploration vs. exploitation and why $\epsilon$-greedy is often inferior to UCB or entropy bonuses.

Exploitation picks the currently-best-estimated action; exploration tries others to refine estimates and avoid premature convergence. $\epsilon$-greedy explores uniformly among non-greedy actions — wasting samples on clearly-bad actions instead of directing effort toward uncertain ones. UCB adds an optimism bonus $\sqrt{\ln t / N(a)}$ that shrinks as an action is tried, concentrating exploration on under-sampled, plausibly-good actions and achieving logarithmic regret. In deep RL, entropy regularization keeps the policy stochastic and explores smoothly in a state-dependent way; count-based or curiosity bonuses scale this to large state spaces.
#exploration#epsilon-greedy#ucb#entropy
Advancedconcept

What three innovations let DQN make Q-learning work with neural-network function approximation?

(1) Experience replay: store transitions and sample minibatches randomly, breaking temporal correlation and reusing data — which also requires DQN to be off-policy. (2) A target network: a periodically-frozen copy $Q_{\theta^-}$ supplies the bootstrap target $r+\gamma\max Q_{\theta^-}$, stabilizing the otherwise-moving target that drives the 'deadly triad' (bootstrapping + off-policy + function approximation). (3) Reward/error clipping plus a CNN over stacked frames for Atari. Later refinements: Double DQN (decouple action selection from evaluation to cut max-bias overestimation), dueling architecture, and prioritized replay.
#dqn#experience-replay#target-network#deadly-triad
Advancedmath

Derive the REINFORCE gradient via the log-derivative trick and explain why a baseline reduces variance without bias.

Objective $J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)]$. Using $\nabla_\theta p_\theta(\tau)=p_\theta(\tau)\nabla_\theta\log p_\theta(\tau)$, $\nabla_\theta J=\mathbb{E}_\tau[R(\tau)\sum_t\nabla_\theta\log\pi_\theta(a_t|s_t)]$ — dynamics drop out because $\log p$ separates into a transition term independent of $\theta$. Subtracting a state-only baseline gives $\mathbb{E}[(G_t-b(s_t))\nabla\log\pi]$; it is unbiased because $\mathbb{E}_a[\nabla\log\pi(a|s)]=\nabla\sum_a\pi=\nabla 1=0$, so $\mathbb{E}[b\,\nabla\log\pi]=0$. Choosing $b(s)=V(s)$ minimizes variance, turning $G-V$ into the advantage estimate.
#reinforce#policy-gradient#baseline#variance-reduction
Advancedconcept

What does the critic add in actor-critic, and what is the bias-variance tradeoff vs. Monte-Carlo REINFORCE?

The critic learns a value function ($V$ or $Q$) to form a bootstrapped advantage, e.g. $A_t=r_t+\gamma V(s_{t+1})-V(s_t)$ (TD), replacing REINFORCE's full Monte-Carlo return $G_t$. MC returns are unbiased but high-variance (noise accumulates over the episode); the bootstrapped critic target has much lower variance but is biased while the critic is imperfect. GAE($\lambda$) interpolates: $\lambda\to1$ approaches MC (low bias, high variance), $\lambda\to0$ approaches one-step TD (low variance, more bias). The actor follows $\nabla\log\pi\cdot A_t$; the critic is trained by regression to its TD target.
#actor-critic#advantage#gae#bias-variance
Advancedmath

Write PPO's clipped surrogate objective and explain what problem clipping solves that vanilla policy gradient and TRPO each had.

Let $r_t(\theta)=\pi_\theta(a_t|s_t)/\pi_{\theta_{old}}(a_t|s_t)$. PPO maximizes $\mathbb{E}[\min(r_t A_t,\ \text{clip}(r_t,1-\epsilon,1+\epsilon)A_t)]$. Vanilla policy gradient takes uncontrolled steps that can collapse the policy after one bad update; TRPO enforces a trust region via a hard KL constraint solved with conjugate gradients — correct but complex and second-order. PPO's clip is a cheap first-order surrogate: once the ratio moves past $1\pm\epsilon$ in the improving direction, the objective flattens, removing incentive to step further and approximating the trust region. The $\min$ keeps the bound pessimistic when $A_t<0$, and this permits multiple SGD epochs per batch of on-policy data.
#ppo#trpo#clipping#trust-region
Expertconcept

In RLHF, what plays the roles of policy, reward, and the KL penalty, and why is the KL-to-reference term essential?

The policy is the LLM being fine-tuned; the reward is a learned reward model (trained on human preference pairs via a Bradley-Terry loss) scoring a completion, typically optimized with PPO. The objective adds a per-token KL penalty to a frozen reference (the SFT model): $r_{total}=r_{RM}(x,y)-\beta\,\mathrm{KL}(\pi_\theta\|\pi_{ref})$. The KL term prevents reward hacking/over-optimization — without it the policy drifts to degenerate text that fools the imperfect reward model — and keeps generations fluent and on-distribution. DPO later folds this same KL-regularized objective into a closed-form classification loss, removing the explicit reward model and online sampling.
#rlhf#reward-model#kl-penalty#dpo
Expertconcept

Explain reward hacking / Goodhart's law in RL and give concrete mitigations a staff engineer would deploy.

Reward hacking occurs when the agent maximizes the proxy reward in ways that violate the designer's intent, because the proxy imperfectly measures the goal (Goodhart: a measure that becomes a target ceases to be a good measure). Examples: a boat-racing agent looping to collect points instead of finishing; an RLHF policy producing sycophantic or confidently-wrong text the reward model overrates. Mitigations: KL/anchor regularization to a trusted reference; conservative or ensemble reward models with uncertainty penalties; early stopping on a held-out true metric; reward-model retraining on adversarial outputs (iterated RLHF); constrained/Lagrangian objectives; and human-in-the-loop audits. The root fix is better reward specification, not stronger optimization.
#reward-hacking#goodhart#reward-shaping#safety
Expertconcept

What is the 'deadly triad', why does it cause divergence, and how do modern algorithms cope?

The deadly triad is the simultaneous combination of (1) function approximation, (2) bootstrapping (TD targets that depend on current estimates), and (3) off-policy training. Any two are safe; all three can make value estimates diverge. Baird's counterexample shows TD weights blowing up even when a perfect representable solution exists, because the semi-gradient TD update is not a true gradient and the off-policy state distribution mismatches the bootstrap, so the update is no longer a contraction. Coping strategies: target networks plus large replay (DQN) to slow the moving target; gradient-TD methods (GTD2/TDC) that follow the true gradient of the projected Bellman error; emphatic-TD reweighting; or staying on-policy.
#deadly-triad#divergence#off-policy#function-approximation
Expertconcept

Why does temporal-difference learning often converge faster than Monte-Carlo in practice, and when can TD's bias actually hurt?

TD bootstraps from its own current estimate, so it learns online from incomplete episodes and propagates information one step at a time without waiting for termination — it exploits Markov structure and has far lower variance than MC, which accumulates the full noisy return. This usually yields faster, more data-efficient learning. The cost is bias: TD targets are wrong while the estimator is wrong, and that bias is amplified under function approximation and off-policy data (the deadly triad), where TD can be unstable or converge to a biased fixed point. In genuinely non-Markov / partially-observed settings MC can be more robust because it doesn't trust a bootstrapped value. TD($\lambda$) interpolates the two.
#td-learning#monte-carlo#bootstrapping#bias-variance