What is a vanilla (Elman) RNN, and write its recurrence relation?
A vanilla RNN processes a sequence by maintaining a hidden state updated at each timestep from the current input and the previous state, sharing the same weights across all timesteps. The Elman recurrence is $h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$ with output $y_t = W_{hy} h_t + b_y$. Weight sharing gives parameter efficiency and handles variable-length sequences, while $h_t$ acts as a compressed memory of all prior inputs. The tanh nonlinearity is what lets it model nonlinear temporal dependencies.
#rnn#hidden-state#recurrence#weight-sharing
Foundationalconcept
What is a bidirectional RNN, what does it gain, and what is its hard constraint?
A bidirectional RNN runs two independent recurrences — one left-to-right, one right-to-left — and concatenates (or sums) their hidden states at each position, so each output sees both past and future context. This helps tasks like NER, POS tagging, and acoustic modeling where downstream tokens disambiguate the current one. The hard constraint is that it needs the entire sequence up front: it cannot do streaming/online or autoregressive generation, where the future is unknown. It also doubles parameters and compute, and the backward pass cannot start until the full input is seen.
#bidirectional#context#tagging#streaming
Intermediateconcept
Explain Backpropagation Through Time (BPTT) and the practical motivation for truncating it.
BPTT trains an RNN by unrolling it across timesteps into a deep feedforward graph with tied weights, then applying ordinary backprop; the gradient for a shared weight is the sum of its contributions at every timestep. For long sequences this is expensive (memory and compute grow with sequence length) and gradients across many steps become unstable. Truncated BPTT limits backprop to the last $k$ steps (carrying the hidden state forward but cutting the gradient), bounding cost and mitigating vanishing/exploding gradients, at the price of being unable to learn dependencies longer than the truncation window.
#bptt#training#truncation#gradients
Intermediateconcept
Describe the LSTM cell's gates and how the cell state $c_t$ enables long-range memory.
An LSTM has a forget gate $f_t$, input gate $i_t$, candidate $\tilde c_t$, and output gate $o_t$, all sigmoid/tanh functions of $[h_{t-1}, x_t]$. The cell state updates additively: $c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t$, and $h_t = o_t \odot \tanh(c_t)$. The key is the additive update and the near-linear constant error carousel along $c_t$: gradients flow back through $c_t$ scaled mainly by the forget gates rather than repeated matrix-Jacobian products, so when $f_t \approx 1$ information and gradient persist across many steps. The gates learn what to keep, write, and expose.
#lstm#cell-state#forget-gate#gating
Intermediateconcept
How does a GRU differ from an LSTM, and when might you prefer it?
A GRU merges the LSTM's cell and hidden state into one vector and uses two gates: update $z_t$ and reset $r_t$. The update is $h_t = (1-z_t)\odot h_{t-1} + z_t \odot \tilde h_t$ with $\tilde h_t = \tanh(W x_t + U(r_t \odot h_{t-1}))$, where $z_t$ interpolates between the old state and the new candidate (an additive, leaky path like the LSTM). It has fewer parameters (no separate output gate or cell), trains faster, and often matches LSTM accuracy on smaller datasets or shorter sequences; LSTMs can edge it out on very long-range tasks where separating memory from exposure helps.
#gru#lstm#update-gate#reset-gate
Intermediatesystem-design
Explain the seq2seq encoder-decoder architecture and the information-bottleneck problem that motivated attention.
Seq2seq uses an encoder RNN to compress an input sequence into a fixed-length context vector (its final hidden state), which initializes a decoder RNN that autoregressively generates the output. The bottleneck is that a single fixed vector must encode the entire source regardless of length, so accuracy degrades on long inputs and early-token information is lost. Bahdanau/Luong attention fixes this by letting the decoder compute, at each step, a weighted sum over all encoder hidden states (alignment), giving direct access to relevant source positions and removing the single-vector bottleneck — the conceptual seed of the Transformer.
#seq2seq#encoder-decoder#attention#bottleneck
Intermediatecoding
Implement an LSTM cell's forward pass in NumPy-style pseudocode for one timestep.
Fuse the recurrence with one weight matrix over $[x_t, h_{t-1}]$: z = W @ concat(x, h_prev) + b, where W has shape (4*H, D+H). Split into four chunks and gate: f = sigmoid(z[0:H]); i = sigmoid(z[H:2H]); g = tanh(z[2H:3H]); o = sigmoid(z[3H:4H]). Then c = f * c_prev + i * g (elementwise), and h = o * tanh(c). Return h, c. Initialize the forget-gate bias to ~1 so the cell defaults to remembering early in training. The single fused matmul is the standard efficiency trick over four separate ones.
#lstm#implementation#numpy#forward-pass
Advancedmath
Derive why vanilla RNNs suffer vanishing/exploding gradients over long horizons.
The gradient of a loss at step $t$ w.r.t. an earlier state $h_k$ factors as a product of Jacobians: $\frac{\partial h_t}{\partial h_k} = \prod_{i=k+1}^{t} \frac{\partial h_i}{\partial h_{i-1}} = \prod \text{diag}(\sigma'(\cdot)) W_{hh}^\top$. Its norm is bounded by roughly $(\gamma \|W_{hh}\|)^{t-k}$ where $\gamma$ bounds the activation derivative. If the largest singular value of $W_{hh}$ (times $\gamma$) is $<1$, this product shrinks exponentially toward zero (vanishing); if $>1$, it blows up (exploding). Vanishing kills long-range credit assignment; exploding is patched with gradient clipping.
What is teacher forcing, and what is exposure bias?
Teacher forcing trains a sequence decoder by feeding the ground-truth previous token (rather than the model's own prediction) at each step, which stabilizes and speeds up training since errors don't compound during teaching. The downside is exposure bias: at inference the model conditions on its own (possibly wrong) prior outputs — a distribution it never saw in training — so small errors snowball. Mitigations include scheduled sampling (probabilistically feeding the model's own predictions during training), beam search at decode time, and sequence-level objectives; modern practice often accepts teacher forcing and leans on scale plus better decoding.
What problem does Connectionist Temporal Classification (CTC) solve, and how does the blank token work?
CTC trains sequence models (e.g. speech or handwriting recognition) when input and output lengths differ and no per-frame alignment is given. It introduces a blank symbol and a many-to-one collapse: merge consecutive duplicate labels, then remove blanks (so 'h-e-e-l-l-o' with blanks → 'hello'; a blank between identical characters preserves a genuine repeat). The CTC loss marginalizes over all alignment paths that collapse to the target, $\sum_{\pi \in \mathcal{B}^{-1}(y)} p(\pi)$, computed efficiently with a forward-backward dynamic program. This yields a differentiable, alignment-free objective; decoding uses greedy or prefix-beam search.
#ctc#alignment#blank-token#forward-backward
Advancedconcept
Why did Transformers largely replace RNNs for sequence modeling — give the deeper reasons beyond 'attention is better'.
Three structural wins. (1) Parallelism: RNNs are sequentially dependent ($h_t$ needs $h_{t-1}$), so training can't parallelize over time; self-attention computes all positions simultaneously, exploiting GPU/TPU throughput. (2) Constant path length: any two tokens interact in $O(1)$ hops via attention versus $O(n)$ through an RNN's recurrence, so gradients and long-range dependencies propagate far more easily. (3) Scalability: parallelism plus stable optimization let Transformers ride compute/data scaling laws to billions of parameters. The cost is $O(n^2)$ attention and the loss of an inductive bias for order (hence positional encodings).
#transformers#parallelism#path-length#scaling
Expertconcept
Initializing $W_{hh}$ as an orthogonal/identity matrix with ReLU was proposed for RNNs (IRNN). Explain the theory and why it still underperforms gated units on long sequences.
Vanishing/exploding gradients trace to the spectral norm of the repeated Jacobian $\prod \text{diag}(\sigma')W_{hh}^\top$. An orthogonal $W_{hh}$ has all singular values exactly 1, preserving gradient norm; identity init plus ReLU (Le et al.'s IRNN) starts the dynamics near a norm-preserving copy-the-state map. But this holds only at initialization — as weights move during training the spectrum drifts, and ReLU's unbounded activations make exploding states likely. Gated units (LSTM/GRU) instead make the memory path additive and data-dependently gated, so the carry behavior is learned and adaptively maintained, not a fragile initial condition. Unitary/orthogonal-constrained RNNs enforce it throughout but cost expressiveness and compute.
#orthogonal-init#irnn#spectral-norm#unitary-rnn
Expertcert
Speech-recognition scenario: you must label 16 kHz audio frames with characters, you have no frame-level alignment, and you need streaming low-latency output. Which loss/architecture, and what tradeoff vs. an attention encoder-decoder?
Use a unidirectional, streaming-capable acoustic model trained with CTC loss, or better an RNN-Transducer (RNN-T) for streaming. CTC handles the missing alignment by marginalizing over all blank-augmented label paths and emits per-frame, enabling low-latency online decoding; its weakness is assuming outputs are conditionally independent given the input (poor implicit language modeling). An attention encoder-decoder gives stronger language modeling and higher accuracy but needs the full utterance (non-streaming) and can suffer alignment failures on long audio. RNN-T combines CTC-style streaming with a learned prediction/joint network that relaxes CTC's independence assumption — the common production choice.