Question 1

What are the canonical stages of the ML lifecycle, and why is it described as a loop rather than a pipeline?

Accepted Answer

The stages are problem framing, data collection/labeling, feature engineering, training/experimentation, evaluation, deployment/serving, and monitoring. It's a loop because production performance decays as the world shifts (data/concept drift), so monitoring feeds retraining triggers back into data and training. Unlike a one-way software build, a model is a function of data that ages; the feedback edge from monitoring to retraining is what makes it a cycle. Treating it as a linear pipeline causes silent degradation because nothing closes the loop between live behavior and the next model version.

Question 2

Distinguish experiment tracking, a model registry, and data/model versioning. Which tool owns which concern?

Accepted Answer

Experiment tracking (MLflow Tracking, Weights & Biases) logs runs: params, metrics, code version, artifacts — answering 'which config produced this metric.' A model registry (MLflow Registry, W&B Artifacts) is the system of record for trained model versions and their stage (Staging/Production/Archived), governing promotion and rollback. Data/model versioning (DVC, lakeFS) version-controls large datasets and model files by hash, storing pointers in git and blobs in object storage. They compose: track experiments → register the winning model → DVC pins the exact data that produced it for reproducibility.

Question 3

What is the difference between data drift and concept drift, and why does the distinction change your remediation?

Accepted Answer

Data drift (covariate shift) is a change in the input distribution P(X) while the relationship P(Y\mid X) stays fixed — e.g., new user demographics. Concept drift is a change in P(Y\mid X) itself — the same inputs now map to different labels (fraud patterns evolve). The distinction matters because data drift may be tolerable if the model still generalizes, and can sometimes be fixed by reweighting or collecting more of the new region; concept drift invalidates the learned mapping and almost always demands retraining on fresh labels. Label drift P(Y) is a third, separate signal.

Question 4

Why can't you reuse a standard software CI/CD pipeline unchanged for ML, and what extra gates does CI/CD for ML add?

Accepted Answer

Software CI tests code; ML systems also depend on data and the trained artifact, which standard CI ignores. CI/CD for ML adds: data validation (schema, distribution, null rates), model training as a pipeline step, model evaluation gates (accuracy/fairness thresholds vs a champion), and a staged rollout (shadow/canary) rather than a binary deploy. Reproducibility requires pinning data and hyperparameters, not just the commit. The unit of release is the (code, data, model) triple, so 'continuous training' (CT) joins CI/CD — Google's MLOps maturity model calls this CI/CD/CT.

Question 5

When is batch (offline) serving preferable to real-time (online) serving, and what are the cost/latency tradeoffs?

Accepted Answer

Batch serving precomputes predictions on a schedule (e.g., nightly) and stores them for cheap lookup; it fits when inputs are known ahead and freshness of minutes-to-hours is fine — churn scores, recommendations, lead scoring. It maximizes throughput and amortizes GPU cost but can't react to just-arrived inputs. Real-time serving computes on request, needed when the input only exists at request time (fraud on a live transaction, dynamic pricing) or freshness is critical; it costs more per prediction, requires low-latency feature retrieval, and must be provisioned for peak load. Many systems do both (streaming/lambda hybrid).

Question 6

Compare TorchServe, TF Serving, NVIDIA Triton, KServe, and BentoML — what layer does each occupy?

Accepted Answer

TF Serving and TorchServe are framework-specific model servers (TensorFlow/SavedModel and PyTorch), handling loading, versioning, and gRPC/REST inference. NVIDIA Triton is a framework-agnostic, multi-backend server (ONNX, TensorRT, PyTorch, TF, Python) with dynamic batching and concurrent model execution, optimized for GPU. KServe (formerly KFServing) is a Kubernetes-native control plane: it wraps any of these servers behind a standardized InferenceService CRD with autoscaling (incl. scale-to-zero via Knative), canary, and explainers. BentoML packages model + Python service + deps into a 'Bento' OCI image, deployable to many runtimes. Triton/TorchServe = runtime; KServe = orchestration; BentoML = packaging.

Question 7

Explain dynamic batching in an inference server. Why does it improve GPU utilization and what's the cost?

Accepted Answer

Dynamic batching holds incoming requests in a short queue (a max-delay window, e.g., 2 ms) and merges them into one batched forward pass. GPUs are throughput devices: a single request underutilizes the SIMT cores, so batching raises arithmetic intensity and amortizes kernel-launch and memory-transfer overhead, sharply increasing requests/sec. The cost is added tail latency — every request waits up to the batch window — so you tune max_batch_size and max_queue_delay against an SLA. It only helps if the model is GPU-bound and arrival rate is high enough to fill batches; under light load it just adds delay.

Question 8

Describe shadow deployment vs canary deployment vs blue-green. When do you choose each for a model?

Accepted Answer

Shadow (dark launch): the new model receives a copy of live traffic but its responses are discarded — validates latency and prediction behavior with zero user risk. Canary: route a small % of real traffic to the new model, watch business/quality metrics, then ramp; catches real-world regressions but exposes some users. Blue-green: stand up the full new version alongside old, then flip 100% atomically with instant rollback; fast to revert but no gradual exposure. Choose shadow to de-risk silently, canary when you need real outcome signals (clicks, conversions) unavailable offline, blue-green for low-risk swaps needing instant rollback.

Question 9

What is a feature store, and what specific failure does online/offline feature consistency prevent?

Accepted Answer

A feature store (Feast, Tecton) centralizes feature definitions and serves them in two modes: an offline store (warehouse) for training and a low-latency online store (Redis/DynamoDB) for serving. It prevents training-serving skew — the bug where a feature is computed one way in training and a subtly different way in serving (different time windows, joins, or null handling), so the model sees inference inputs unlike anything it trained on, silently degrading accuracy. By compiling features once from a shared definition and guaranteeing point-in-time-correct joins (no label leakage), the store enforces that train and serve see identical feature logic.

Question 10

What should an ML monitoring stack track beyond model accuracy, and why is accuracy often unavailable in production?

Accepted Answer

Ground-truth labels are usually delayed or absent in production (you learn true churn weeks later), so accuracy can't be computed live. The stack therefore monitors: operational metrics (latency, throughput, error rate, resource use), input drift (feature distributions vs a training baseline), prediction drift (output shifts), data-quality signals (nulls, schema violations, out-of-range), and delayed performance metrics once labels arrive. Proxy/business metrics (CTR, approval rate) act as leading indicators. The goal is to detect degradation from inputs and outputs alone, before delayed labels confirm it, and trigger investigation or retraining.

Question 11

Cert scenario: A real-time fraud model on a managed serving endpoint shows stable accuracy on its monitoring dashboard, but the fraud team reports rising missed fraud. Labels arrive ~30 days late. What is the MOST likely explanation and best next step?

Accepted Answer

The dashboard 'accuracy' is computed on stale, 30-day-old matured labels, so it reflects the model's past, not current, performance — meanwhile concept drift (evolving fraud tactics) has changed P(Y\mid X) and the live model is missing new fraud. The metric simply hasn't caught up. Best next step: stop trusting delayed accuracy as a real-time signal; stand up input/prediction drift monitoring and proxy/business metrics as leading indicators, and configure a drift- and performance-based retraining trigger with a shadow/canary rollout. Pulling forward a faster label source (analyst review queue) shortens detection lag. Retraining on fresh labeled fraud is the remediation.

Question 12

Cert scenario: You must deploy a new model version where you cannot risk any user-facing regression yet need to validate real-world latency and prediction parity at full production traffic before any user sees its output. Which rollout pattern, and why not canary?

Accepted Answer

Use a shadow (mirror) deployment: production traffic is duplicated to the new model, it runs at full real-world load and you compare its predictions and latency against the live model, but its responses are never returned to users — so zero regression risk. Canary is wrong here because it routes a real fraction of traffic to the new model, meaning some users would receive its unvalidated outputs, violating the 'no user sees it yet' constraint. Once shadow confirms parity and latency at scale, graduate to canary for outcome-metric validation, then full rollout with blue-green-style instant rollback.

Question 13

Derive and contrast the PSI and KS-test as drift detectors. When does each fail?

Accepted Answer

Population Stability Index: bin both samples, PSI=\sum_i (a_i-e_i)\ln(a_i/e_i) where a_i,e_i are actual/expected proportions per bin; rules of thumb: <0.1 stable, 0.1–0.25 moderate, >0.25 significant. The KS test uses the max gap between empirical CDFs, D=\sup_x|F_{ref}(x)-F_{cur}(x)|, giving a p-value. PSI is binning-sensitive (results swing with bin count/edges) and undefined when a bin is empty (needs smoothing). KS is sample-size sensitive — on big streams tiny, irrelevant shifts become 'significant' (p→0), and it's univariate, missing correlation drift. Both are marginal tests; neither catches multivariate drift where each feature looks fine but the joint distribution moved.

Question 14

Marginal drift tests pass on every feature yet the model degrades. How do you detect multivariate / joint drift?

Accepted Answer

Use methods that score the joint distribution. (1) Classifier two-sample test (domain classifier): label reference rows 0 and current rows 1, train a classifier; AUC meaningfully above 0.5 means the distributions are separable, i.e., drift — and feature importances localize it. (2) Maximum Mean Discrepancy (MMD), a kernel distance between sample distributions in an RKHS. (3) Monitor in a learned embedding/latent space rather than raw features. (4) Track the model's own confidence/uncertainty and prediction distribution. These catch correlation-structure shifts (each marginal unchanged but the copula moved) that per-feature PSI/KS miss; the classifier test is the most practical and interpretable.

Question 15

Design the policy for retraining triggers. Why is 'retrain on a fixed schedule' usually the wrong default?

Accepted Answer

Combine triggers: (a) performance-based — delayed-label metric crosses an SLA floor; (b) drift-based — input/prediction drift exceeds a threshold for a sustained window; (c) data-volume — enough new labeled data accumulated; (d) scheduled, as a backstop. Fixed-schedule alone is wrong because it's both wasteful (retraining a still-good model burns compute and risks shipping a worse model via training noise) and unsafe (a sudden concept shift mid-cycle goes uncorrected until the next tick). Always gate the retrained model behind an offline eval vs the incumbent champion plus a shadow/canary stage. Guard against feedback loops where the model's own outputs pollute future training labels.

Question 16

What makes an ML training run truly reproducible, and which sources of nondeterminism survive even after you pin code, data, and seeds?

Accepted Answer

Reproducibility needs the full triple pinned: code (git SHA), data (content hash via DVC/lakeFS), and environment (container image digest + locked deps), plus seeds for all RNGs (Python, NumPy, framework). Even then, nondeterminism survives: GPU kernels use atomic/non-associative floating-point reductions whose order varies (cuDNN nondeterministic algorithms), multi-threaded/async data loading changes ordering, distributed all-reduce ordering, and hardware/driver differences. You suppress these with torch.use_deterministic_algorithms(True), fixed cuDNN flags, single-threaded loaders, and pinned CUDA — at a throughput cost. Bitwise reproducibility is often impractical; aim for statistical reproducibility (metrics within tolerance) and log everything needed to rebuild.

Question 17

In Kubernetes, why is a generic CPU-utilization HPA a poor autoscaler for GPU model serving, and what do you scale on instead?

Accepted Answer

GPU inference is often GPU-bound while CPU sits low, so a CPU-based HPA never scales up under real load, or scales on the wrong signal. GPUs also can't be fractionally shared by default and cold-start (model load into VRAM) is slow, so naive scaling thrashes. Scale instead on inference-relevant signals via custom/external metrics (KEDA, Prometheus adapter): request queue depth, batch latency, requests-per-second, or GPU utilization/memory. KServe adds concurrency-based autoscaling and scale-to-zero (via Knative) for spiky traffic. Provision GPU node pools with the device plugin, set realistic readiness probes to mask warm-up, and keep a warm minimum to absorb cold-start.

Question 18

Why is serving LLMs fundamentally different from serving a classifier, and what does vLLM's PagedAttention solve?

Accepted Answer

LLM inference is autoregressive: it generates token-by-token, so latency scales with output length and a request occupies the GPU for many steps. The dominant memory cost is the KV cache — keys/values for every prior token, per request — which grows with sequence length and concurrency. Naive serving pre-allocates a contiguous max-length KV buffer per request, wasting VRAM to internal/external fragmentation and capping batch size. vLLM's PagedAttention treats the KV cache like OS virtual memory: it's split into fixed-size blocks allocated on demand and referenced via a block table, eliminating fragmentation and enabling near-full memory use, prefix sharing across requests, and continuous (in-flight) batching — far higher throughput than static batching.

Question 19

What does observability for LLM applications require that classical ML monitoring doesn't?

Accepted Answer

Outputs are open-ended text, so there's no single ground-truth label or accuracy metric. You need trace-level observability over multi-step chains/agents (prompt, retrieved context, tool calls, intermediate steps, final output) — via tooling like LangSmith, Langfuse, Arize Phoenix, or OpenTelemetry GenAI semantics. Track token usage and cost per request, latency and time-to-first-token, and quality via online LLM-as-judge scoring, hallucination/groundedness checks, retrieval relevance, refusal/safety-filter rates, and PII/toxicity detectors. Monitor prompt and embedding drift, and capture user feedback. Because there's no label, evaluation leans on reference-free judges, regression suites of golden prompts, and guardrail metrics rather than a confusion matrix.

Question 20

What does model governance encompass, and how do lineage and a model card support audit/regulatory requirements?

Accepted Answer

Governance is the controls that make a model accountable: access control and approval workflows for promotion, an immutable audit trail of who trained/approved/deployed which version, full lineage (which data, code, and features produced a given prediction), bias/fairness and performance documentation, and a defined retirement path. A model card documents intended use, training data, evaluation across slices, limitations, and ethical considerations — the human-readable disclosure regulators expect. End-to-end lineage lets you answer 'why did the model decide X for this person' and reproduce or roll back; under regimes like the EU AI Act or model-risk rules (SR 11-7), this traceability plus documented validation is mandatory, not optional.

Question 21

Distinguish covariate shift (data drift), concept drift, and label shift formally in terms of which factor of the joint distribution P(X,Y) changes, and give one production symptom of each.

Accepted Answer

Decompose P(X,Y)=P(Y\mid X)P(X)=P(X\mid Y)P(Y). Covariate/data drift: P(X) changes while P(Y\mid X) stays fixed (e.g. new user demographics; inputs move but the true mapping holds). Concept drift: P(Y\mid X) changes — the input-output relationship itself shifts (fraud tactics evolve; same features now mean a different label). Label shift (prior probability shift): P(Y) changes with P(X\mid Y) fixed (disease prevalence rises). Symptoms: data drift shows feature-distribution divergence but maybe stable accuracy; concept drift shows accuracy decay with stable inputs; label shift shows calibration/base-rate errors and skewed predicted-class proportions.

Question 22

Define the Population Stability Index (PSI) and the KS statistic for feature drift. Compute PSI given expected bin proportions [0.4,0.4,0.2] and actual [0.5,0.3,0.2], state the usual thresholds, and name a key limitation each shares.

Accepted Answer

PSI=\sum_i (a_i-e_i)\ln(a_i/e_i). Bin1: (0.5-0.4)\ln(0.5/0.4)=0.1(0.2231)=0.0223; Bin2: (0.3-0.4)\ln(0.3/0.4)=(-0.1)(-0.2877)=0.0288; Bin3: 0. PSI\approx0.051 — below the 0.1 "no significant shift" line (0.1–0.25 moderate, >0.25 major). KS is the max CDF gap \sup_x|F_{ref}(x)-F_{cur}(x)|, distribution-free for continuous univariate data. Limitations: both are univariate/marginal — blind to joint or correlation drift and to concept drift entirely; PSI is binning-sensitive and unstable with empty bins; KS over-rejects at large sample sizes.

Question 23

Implement a streaming drift detector with bounded memory that flags distribution shift online. Sketch the algorithm and its tradeoff.

Accepted Answer

Use ADWIN-style adaptive windowing or a reservoir-backed two-sample test. Sketch: keep a fixed reference window R (reservoir sample of recent baseline) and a sliding current window C. Per incoming point, update C (ring buffer, O(1)); periodically run a cheap statistic — compare means/variances or a streaming KS — and signal drift when it exceeds a threshold over a sustained sub-window, then reset the reference. ADWIN keeps a variable-length window and shrinks it when two sub-windows' means differ beyond a Hoeffding bound, giving automatic change-point detection in O(\log W) memory. Tradeoff: small windows react fast but raise false positives on noise; large windows are stable but lag real shifts — tune window/threshold against detection-delay vs false-alarm rate.

Question 24

A retrained model improves offline holdout accuracy but degrades live business metrics after deployment. Give the staff-level differential diagnosis.

Accepted Answer

Candidates: (1) Feedback loop / leakage — the new model's outputs influenced the labels or features it later trained on, inflating offline scores. (2) Training-serving skew — a feature is computed differently at serve time, so live inputs differ from the holdout. (3) Stale/non-representative holdout — it predates current drift, so offline accuracy measures the wrong distribution. (4) Optimizing a proxy misaligned with the business metric (accuracy up, but errors shifted onto high-value cases). (5) Selection/sampling bias in how the holdout was built (label availability isn't random). (6) Goodhart on the offline metric. Fix: evaluate on a recent, leakage-free, point-in-time-correct slice; validate feature parity; and gate on a live A/B against the champion measuring the actual business KPI, not offline accuracy alone.

Question 25

For an LLM-as-judge used as a production quality monitor, what failure modes make it unreliable, and how do you harden it?

Accepted Answer

Failure modes: position/order bias (favors the first or last response), verbosity and self-preference bias (prefers longer outputs or its own model family), low test-retest consistency, susceptibility to prompt injection from the judged content, and silent drift when the judge model is upgraded. It can also be miscalibrated — confident but wrong. Hardening: pairwise comparison with randomized order (average both orderings), force a rubric + reasoning-before-score, pin and version the judge model, calibrate against a human-labeled gold set and track judge-human agreement (Cohen's kappa) over time, delimit/quarantine the judged text against injection, run at temperature 0, and treat the judge as a noisy signal — ensemble or sample multiple judgments for high-stakes gates rather than trusting a single call.

Question 26

Why does quantizing an LLM for serving (INT8/FP8/INT4) often barely move aggregate benchmark scores yet still be risky in production, and how do you validate it?

Accepted Answer

Aggregate benchmarks average over easy cases, so a small per-token accuracy loss is masked; perplexity and broad multiple-choice scores stay near-flat. The risk is in the tails: quantization disproportionately hurts low-frequency tokens, long-context reasoning, and code/math precision, and outlier activations can blow up if not handled (hence SmoothQuant, AWQ, GPTQ, which protect salient weights/activations). It can also subtly shift refusal/safety behavior. Validate beyond aggregates: run task-specific eval suites, long-context and code/math benchmarks, a golden-prompt regression set, and a shadow deployment comparing full-precision vs quantized outputs on the live traffic distribution before cutover — measuring tail and per-slice regressions, not just the mean.

Question 27

Production labels arrive weeks late or never. How do you detect concept drift (a change in P(Y|X)) without ground truth, and why is feature-drift monitoring insufficient for this?

Accepted Answer

Feature drift only measures P(X); concept drift is a change in P(Y\mid X), which can shift while P(X) is stable — so marginal input monitoring misses it entirely (and fires false alarms on benign covariate shift). Without labels, proxy it: monitor the prediction distribution and class-rate vs. expected priors; track confidence/entropy and softmax-margin collapse; use uncertainty/OOD scores and reconstruction error from an autoencoder; apply importance-weighting or a domain classifier (train a model to separate ref vs current — high AUC means drift). For label shift, BBSE/MLLS estimate new P(Y) from a confusion matrix. Ultimately concept drift is only confirmed once delayed labels enable performance estimators like ATC or Mandoline; unsupervised signals are early-warning, not proof.

Question 28

Design a principled retraining trigger that avoids both alarm fatigue from noisy drift tests and silent decay from a fixed schedule. Address multiple-comparisons, persistence, and the cost of acting.

Accepted Answer

Treat triggering as sequential decision-making, not a single threshold. Prefer change-point detectors built for streaming — ADWIN, DDM/EDDM, or Page-Hinkley on a performance/proxy signal — over per-batch p-value tests, since running KS/PSI on hundreds of features every window inflates false positives (control with Bonferroni/BH FDR). Require persistence: trigger only when drift holds across k consecutive windows or exceeds an effect-size (not just significance) band, decoupling statistical from practical significance. Gate on business impact — estimated accuracy loss × cost vs. retraining cost — so trivial drift doesn't fire. Combine a hard performance SLA breach (when labels exist) with unsupervised early-warning, plus a max-staleness fallback retrain. Always shadow/canary the retrained model before promotion to avoid retraining into a worse state.

MLOps: Serving, Monitoring, Drift & CI/CD