Q: Describe the LSTM cell's gates and how the cell state $c_t$ enables long-range memory.

An LSTM has a forget gate f_t, input gate i_t, candidate \tilde c_t, and output gate o_t, all sigmoid/tanh functions of [h_{t-1}, x_t]. The cell state updates additively: c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t, and h_t = o_t \odot \tanh(c_t). The key is the additive update and the near-linear constant error carousel along c_t: gradients flow back through c_t scaled mainly by the forget gates rather than repeated matrix-Jacobian products, so when f_t \approx 1 information and gradient persist across many steps. The gates learn what to keep, write, and expose.

Question 1

What is a vanilla (Elman) RNN, and write its recurrence relation?

Accepted Answer

A vanilla RNN processes a sequence by maintaining a hidden state updated at each timestep from the current input and the previous state, sharing the same weights across all timesteps. The Elman recurrence is h_t = 	anh(W_{hh} h_{t-1} + W_{xh} x_t + b_h) with output y_t = W_{hy} h_t + b_y. Weight sharing gives parameter efficiency and handles variable-length sequences, while h_t acts as a compressed memory of all prior inputs. The tanh nonlinearity is what lets it model nonlinear temporal dependencies.

Question 2

What is a bidirectional RNN, what does it gain, and what is its hard constraint?

Accepted Answer

A bidirectional RNN runs two independent recurrences — one left-to-right, one right-to-left — and concatenates (or sums) their hidden states at each position, so each output sees both past and future context. This helps tasks like NER, POS tagging, and acoustic modeling where downstream tokens disambiguate the current one. The hard constraint is that it needs the entire sequence up front: it cannot do streaming/online or autoregressive generation, where the future is unknown. It also doubles parameters and compute, and the backward pass cannot start until the full input is seen.

Question 3

Explain Backpropagation Through Time (BPTT) and the practical motivation for truncating it.

Accepted Answer

BPTT trains an RNN by unrolling it across timesteps into a deep feedforward graph with tied weights, then applying ordinary backprop; the gradient for a shared weight is the sum of its contributions at every timestep. For long sequences this is expensive (memory and compute grow with sequence length) and gradients across many steps become unstable. Truncated BPTT limits backprop to the last k steps (carrying the hidden state forward but cutting the gradient), bounding cost and mitigating vanishing/exploding gradients, at the price of being unable to learn dependencies longer than the truncation window.

Question 4

Describe the LSTM cell's gates and how the cell state $c_t$ enables long-range memory.

Accepted Answer

An LSTM has a forget gate f_t, input gate i_t, candidate 	ilde c_t, and output gate o_t, all sigmoid/tanh functions of [h_{t-1}, x_t]. The cell state updates additively: c_t = f_t \odot c_{t-1} + i_t \odot 	ilde c_t, and h_t = o_t \odot 	anh(c_t). The key is the additive update and the near-linear constant error carousel along c_t: gradients flow back through c_t scaled mainly by the forget gates rather than repeated matrix-Jacobian products, so when f_t \approx 1 information and gradient persist across many steps. The gates learn what to keep, write, and expose.

Question 5

How does a GRU differ from an LSTM, and when might you prefer it?

Accepted Answer

A GRU merges the LSTM's cell and hidden state into one vector and uses two gates: update z_t and reset r_t. The update is h_t = (1-z_t)\odot h_{t-1} + z_t \odot 	ilde h_t with 	ilde h_t = 	anh(W x_t + U(r_t \odot h_{t-1})), where z_t interpolates between the old state and the new candidate (an additive, leaky path like the LSTM). It has fewer parameters (no separate output gate or cell), trains faster, and often matches LSTM accuracy on smaller datasets or shorter sequences; LSTMs can edge it out on very long-range tasks where separating memory from exposure helps.

Question 6

Explain the seq2seq encoder-decoder architecture and the information-bottleneck problem that motivated attention.

Accepted Answer

Seq2seq uses an encoder RNN to compress an input sequence into a fixed-length context vector (its final hidden state), which initializes a decoder RNN that autoregressively generates the output. The bottleneck is that a single fixed vector must encode the entire source regardless of length, so accuracy degrades on long inputs and early-token information is lost. Bahdanau/Luong attention fixes this by letting the decoder compute, at each step, a weighted sum over all encoder hidden states (alignment), giving direct access to relevant source positions and removing the single-vector bottleneck — the conceptual seed of the Transformer.

Question 7

Implement an LSTM cell's forward pass in NumPy-style pseudocode for one timestep.

Accepted Answer

Fuse the recurrence with one weight matrix over [x_t, h_{t-1}]: z = W @ concat(x, h_prev) + b, where W has shape (4*H, D+H). Split into four chunks and gate: f = sigmoid(z[0:H]); i = sigmoid(z[H:2H]); g = tanh(z[2H:3H]); o = sigmoid(z[3H:4H]). Then c = f * c_prev + i * g (elementwise), and h = o * tanh(c). Return h, c. Initialize the forget-gate bias to ~1 so the cell defaults to remembering early in training. The single fused matmul is the standard efficiency trick over four separate ones.

Question 8

Derive why vanilla RNNs suffer vanishing/exploding gradients over long horizons.

Accepted Answer

The gradient of a loss at step t w.r.t. an earlier state h_k factors as a product of Jacobians: \frac{\partial h_t}{\partial h_k} = \prod_{i=k+1}^{t} \frac{\partial h_i}{\partial h_{i-1}} = \prod 	ext{diag}(\sigma'(\cdot)) W_{hh}^	op. Its norm is bounded by roughly (\gamma \|W_{hh}\|)^{t-k} where \gamma bounds the activation derivative. If the largest singular value of W_{hh} (times \gamma) is <1, this product shrinks exponentially toward zero (vanishing); if >1, it blows up (exploding). Vanishing kills long-range credit assignment; exploding is patched with gradient clipping.

Question 9

What is teacher forcing, and what is exposure bias?

Accepted Answer

Teacher forcing trains a sequence decoder by feeding the ground-truth previous token (rather than the model's own prediction) at each step, which stabilizes and speeds up training since errors don't compound during teaching. The downside is exposure bias: at inference the model conditions on its own (possibly wrong) prior outputs — a distribution it never saw in training — so small errors snowball. Mitigations include scheduled sampling (probabilistically feeding the model's own predictions during training), beam search at decode time, and sequence-level objectives; modern practice often accepts teacher forcing and leans on scale plus better decoding.

Question 10

What problem does Connectionist Temporal Classification (CTC) solve, and how does the blank token work?

Accepted Answer

CTC trains sequence models (e.g. speech or handwriting recognition) when input and output lengths differ and no per-frame alignment is given. It introduces a blank symbol and a many-to-one collapse: merge consecutive duplicate labels, then remove blanks (so 'h-e-e-l-l-o' with blanks → 'hello'; a blank between identical characters preserves a genuine repeat). The CTC loss marginalizes over all alignment paths that collapse to the target, \sum_{\pi \in \mathcal{B}^{-1}(y)} p(\pi), computed efficiently with a forward-backward dynamic program. This yields a differentiable, alignment-free objective; decoding uses greedy or prefix-beam search.

Question 11

Why did Transformers largely replace RNNs for sequence modeling — give the deeper reasons beyond 'attention is better'.

Accepted Answer

Three structural wins. (1) Parallelism: RNNs are sequentially dependent (h_t needs h_{t-1}), so training can't parallelize over time; self-attention computes all positions simultaneously, exploiting GPU/TPU throughput. (2) Constant path length: any two tokens interact in O(1) hops via attention versus O(n) through an RNN's recurrence, so gradients and long-range dependencies propagate far more easily. (3) Scalability: parallelism plus stable optimization let Transformers ride compute/data scaling laws to billions of parameters. The cost is O(n^2) attention and the loss of an inductive bias for order (hence positional encodings).

Question 12

Initializing $W_{hh}$ as an orthogonal/identity matrix with ReLU was proposed for RNNs (IRNN). Explain the theory and why it still underperforms gated units on long sequences.

Accepted Answer

Vanishing/exploding gradients trace to the spectral norm of the repeated Jacobian \prod 	ext{diag}(\sigma')W_{hh}^	op. An orthogonal W_{hh} has all singular values exactly 1, preserving gradient norm; identity init plus ReLU (Le et al.'s IRNN) starts the dynamics near a norm-preserving copy-the-state map. But this holds only at initialization — as weights move during training the spectrum drifts, and ReLU's unbounded activations make exploding states likely. Gated units (LSTM/GRU) instead make the memory path additive and data-dependently gated, so the carry behavior is learned and adaptively maintained, not a fragile initial condition. Unitary/orthogonal-constrained RNNs enforce it throughout but cost expressiveness and compute.

Question 13

Speech-recognition scenario: you must label 16 kHz audio frames with characters, you have no frame-level alignment, and you need streaming low-latency output. Which loss/architecture, and what tradeoff vs. an attention encoder-decoder?

Accepted Answer

Use a unidirectional, streaming-capable acoustic model trained with CTC loss, or better an RNN-Transducer (RNN-T) for streaming. CTC handles the missing alignment by marginalizing over all blank-augmented label paths and emits per-frame, enabling low-latency online decoding; its weakness is assuming outputs are conditionally independent given the input (poor implicit language modeling). An attention encoder-decoder gives stronger language modeling and higher accuracy but needs the full utterance (non-streaming) and can suffer alignment failures on long audio. RNN-T combines CTC-style streaming with a learned prediction/joint network that relaxes CTC's independence assumption — the common production choice.

RNNs, LSTMs & Sequence Models