Question 1

What are the main sources of bias in a machine learning system, and at which stage does each enter?

Accepted Answer

Bias enters at multiple stages. Historical bias is baked into the world the data describes (past hiring favored one group) even with perfect sampling. Representation/sampling bias comes from training data under-covering some groups. Measurement bias arises when labels or features are noisy proxies that differ by group (arrest as a proxy for crime). Aggregation bias fits one model where subgroups need different ones. Evaluation bias uses an unrepresentative benchmark. Deployment/feedback bias emerges when predictions reshape future data. Fixing only the model ignores upstream historical and measurement bias.

Question 2

What is the difference between interpretability and explainability?

Accepted Answer

Interpretability is an intrinsic property: the model is simple enough that a human understands its mechanism directly — linear/logistic regression, shallow trees, GAMs. You read the weights/splits and know why. Explainability is post-hoc: the model (a black box like a deep net or gradient-boosted ensemble) stays opaque, and you bolt on a separate technique (SHAP, LIME, saliency) to approximate or rationalize its behavior. Interpretable models give faithful reasons by construction; post-hoc explanations are approximations that can be unfaithful or misleading. High-stakes settings often favor inherently interpretable models over explained black boxes.

Question 3

Define demographic parity and equalized odds. Why can't you generally satisfy equalized odds and calibration at once?

Accepted Answer

Demographic parity requires equal positive-prediction rate across groups: P(\hat{Y}{=}1\mid A{=}a) constant. Equalized odds requires equal true-positive AND false-positive rates across groups: P(\hat{Y}{=}1\mid Y{=}y,A{=}a) constant for y\in\{0,1\}. The impossibility result (Kleinberg–Mullainathan–Raghavan; Chouldechova) shows that when base rates P(Y{=}1\mid A) differ across groups, calibration and equalized odds (equal FPR and FNR) cannot both hold except in degenerate cases (perfect prediction or equal base rates). So criteria conflict and you must choose which to prioritize from the harm model.

Question 4

Explain how SHAP values are computed and what property makes them theoretically principled.

Accepted Answer

SHAP attributes a prediction to features using Shapley values from cooperative game theory: each feature's contribution is its average marginal effect over all orderings of feature inclusion, \phi_i=\sum_{S\subseteq F\setminus\{i\}}\frac{|S|!(|F|-|S|-1)!}{|F|!}[f(S\cup\{i\})-f(S)]. This is the unique attribution satisfying efficiency (contributions sum to prediction minus baseline), symmetry, dummy (null player), and additivity. Exact computation is exponential, so SHAP uses approximations: KernelSHAP (weighted linear regression on coalitions) or TreeSHAP (polynomial-time exact for trees). 'Missing' features are handled by marginalizing against a background dataset.

Question 5

What does LIME do differently from SHAP, and what is LIME's key weakness?

Accepted Answer

LIME (Local Interpretable Model-agnostic Explanations) explains one prediction by perturbing the instance, querying the black box on the perturbed samples, weighting them by proximity to the original, and fitting a sparse linear surrogate locally. The surrogate's coefficients are the explanation. Unlike SHAP it has no game-theoretic guarantees and no additivity/efficiency. Its key weakness is instability: explanations depend heavily on the perturbation distribution, kernel width, and random sampling, so the same instance yields different explanations across runs. It also assumes local linearity, which fails near sharp decision boundaries.

Question 6

Define $(\varepsilon,\delta)$-differential privacy and explain the role of each parameter.

Accepted Answer

A randomized mechanism M is (\varepsilon,\delta)-differentially private if for all neighboring datasets D,D' (differing in one record) and all output sets S: P(M(D)\in S)\le e^{\varepsilon}P(M(D')\in S)+\delta. \varepsilon is the privacy budget: smaller means stronger privacy (outputs nearly indistinguishable whether or not any one person is in the data); it bounds the multiplicative leakage. \delta is a small additive failure probability allowing the e^\varepsilon bound to be violated rarely; it should be cryptographically small, well below 1/n. Pure DP is \delta{=}0. Privacy degrades under composition as you run more queries.

Question 7

Distinguish evasion, poisoning, and backdoor attacks, and which part of the ML lifecycle each targets.

Accepted Answer

Evasion (adversarial examples) targets inference: a crafted, often imperceptible perturbation x{+}\delta makes a fixed trained model misclassify; the training set is untouched. Poisoning targets training: the attacker injects or corrupts training samples to degrade accuracy or shift a decision boundary — an availability or integrity attack on the learned model. A backdoor (trojan) is targeted poisoning where the model behaves normally except when a specific trigger pattern is present, then outputs the attacker's chosen label; it's hard to detect because clean accuracy is unaffected. Defenses differ: robust training/detection for poisoning vs adversarial training/certified defenses for evasion.

Question 8

Under the EU AI Act, what are the risk tiers and what obligations attach to a high-risk system?

Accepted Answer

The EU AI Act (entered into force Aug 2024, phased application through 2026–2027 — prohibitions from Feb 2025, GPAI rules from Aug 2025, most high-risk from Aug 2026) uses a risk-based pyramid: (1) Unacceptable — banned (social scoring, manipulative/subliminal techniques, most real-time remote biometric ID in public). (2) High-risk — permitted with strict obligations (biometrics, critical infrastructure, employment, credit, education). (3) Limited — transparency duties (disclose AI interaction, label deepfakes/generated content). (4) Minimal — largely unregulated. High-risk obligations: risk-management system, data governance, technical documentation, logging/traceability, human oversight, accuracy/robustness/cybersecurity, conformity assessment, EU-database registration. General-purpose models have their own transparency and systemic-risk tier.

Question 9

Why is attention not a reliable explanation in transformer models?

Accepted Answer

Attention weights show where a layer reads from, but high attention does not equal high causal importance for the output. Jain & Wallace and Serrano & Smith showed attention is often not faithful: alternative attention distributions can produce the same prediction (attention is not unique), and zeroing high-attention tokens frequently changes the output less than gradient-based importance predicts. Attention is one of many components (residual stream, MLPs, value vectors, later layers) determining the output, and it's diffuse across heads/layers. Treat it as a weak heuristic, not a faithful attribution; prefer gradient/ablation/causal methods for importance claims.

Question 10

How does DP-SGD make neural network training differentially private, and what is the accuracy cost?

Accepted Answer

DP-SGD modifies SGD two ways per step: (1) per-example gradient clipping to an L_2 norm C, bounding any single example's influence (sensitivity); (2) adding Gaussian noise \mathcal{N}(0,\sigma^2C^2) to the summed clipped gradients before the update. A privacy accountant (moments accountant / RDP) tracks cumulative \varepsilon using subsampling amplification from minibatching. Cost: clipping biases gradient estimates and noise raises variance, hurting convergence — accuracy drops most on underrepresented classes/tails (Bagdasaryan et al.). Larger batches, more data, and careful C tuning mitigate but don't eliminate the privacy–utility tradeoff.

Question 11

What privacy guarantee does federated learning provide on its own, and why is it insufficient?

Accepted Answer

Federated learning keeps raw data on-device and shares only model updates (gradients/weights) with a server that aggregates them. By itself it provides data minimization, not a formal privacy guarantee: gradients leak information. Gradient-inversion attacks (Deep Leakage from Gradients) reconstruct training images/text from a single client's update; membership inference works on shared models. So FL must be combined with secure aggregation (server sees only the sum, not individual updates) plus differential privacy (clipped, noised updates for a formal \varepsilon bound). FL alone is a system-architecture choice, not a privacy mechanism.

Question 12

Derive the FGSM adversarial perturbation and explain why adversarial examples exist even for accurate models.

Accepted Answer

Fast Gradient Sign Method maximizes the loss under an L_\infty budget \epsilon: linearize the loss as L(x{+}\delta)\approx L(x)+
abla_x L^	op\delta; maximizing subject to \|\delta\|_\infty\le\epsilon gives \delta=\epsilon\,\mathrm{sign}(
abla_x L(x,y)). Goodfellow et al. argue adversarial examples arise from the model's excessive linearity in high dimensions: a tiny per-pixel change accumulates over n inputs as a weight-aligned dot product growing \sim n\epsilon, flipping the logit even though each change is imperceptible. So vulnerability is a property of linear behavior in high dimensions, not overfitting — which is why more data alone doesn't fix it and adversarial training is needed.

Question 13

How does a model-extraction (model-stealing) attack work, and what defenses actually help?

Accepted Answer

An adversary with only query access (a prediction API) reconstructs a functionally equivalent copy: send many inputs, collect outputs (labels, probabilities, or logits), and train a surrogate to mimic them — richer outputs (full softmax/confidences) make it cheaper. This enables IP theft and a stepping stone to transfer-based evasion. Defenses: rate-limiting and anomaly detection on query patterns, returning top-1 labels instead of full probability vectors, output perturbation/rounding, watermarking the model so a stolen copy is provable, and prediction poisoning that subtly distorts outputs to corrupt the surrogate. No single defense is sufficient; combine detection with output minimization.

Question 14

A user invokes GDPR's 'right to erasure' against your deployed ML model. What does compliance actually require, and what's the hard part?

Accepted Answer

Article 17 erasure means deleting the individual's personal data, including from training pipelines. The easy part is purging raw records and backups. The hard part is the trained model: weights can memorize and leak training data (membership-inference and extraction attacks demonstrate this), so a model trained on the deleted record may still 'contain' it. Strict compliance can require retraining without that record — expensive — or machine-unlearning (SISA sharded training that retrains only affected shards, or influence-function-based approximate unlearning). Regulators increasingly treat the model as potential personal data. DP training also limits per-record influence, easing future erasure claims.

Question 15

What is the 'lethal trifecta' for LLM agents, and why does removing any one leg neutralize it?

Accepted Answer

Simon Willison's lethal trifecta is the simultaneous combination of: (1) access to private/sensitive data, (2) exposure to untrusted content (web pages, emails, documents an attacker can write), and (3) the ability to exfiltrate — externally communicate (make requests, send data, render attacker-controlled URLs). When all three coexist, a prompt injection hidden in the untrusted content can instruct the agent to read the private data and send it out — a complete data-theft chain. Removing any one breaks it: no private data, nothing worth stealing; no untrusted input, no injection vector; no outbound channel (allow-list egress, no auto-fetch), nothing leaves. The deterministic fix is architectural — cut a leg — not 'better prompting'.

Question 16

Your team wants to use post-hoc SHAP explanations to satisfy a regulator that a credit model is fair and non-discriminatory. As principal engineer, why push back, and what do you propose instead?

Accepted Answer

Push back because post-hoc explanations can be unfaithful and adversarially manipulated: Slack et al. (2020) constructed a biased model plus scaffolding that fools LIME/SHAP into producing innocuous, fairness-clean explanations — so SHAP is not evidence of fairness, only a story about it. Fairness is a property of outcomes across groups, not of an explanation. Propose: (1) measure outcome fairness directly (equalized odds, FPR/FNR gaps, calibration by group) on held-out and live data; (2) prefer an inherently interpretable model (monotonic GAM / scorecard) for high-stakes credit so reasons are faithful by construction; (3) document data lineage, governance, and a bias-monitoring loop. Explanations support transparency duties; they don't certify fairness.

Question 17

Why does differential privacy disproportionately harm minority subgroups, and what does this imply for combining privacy and fairness goals?

Accepted Answer

DP bounds any single record's influence via gradient clipping and noise. Majority-group patterns are reinforced by many records, so the signal survives the noise; minority/tail patterns rest on few records, so clipping and noise wash out their gradient signal — Bagdasaryan et al. showed accuracy degradation under DP-SGD is far larger for underrepresented classes. This creates a direct tension: the mechanism protecting individuals erodes the model's ability to learn small-group structure, worsening fairness gaps. Implication: privacy and fairness can be antagonistic, not free to stack. Mitigations include group-aware clipping budgets, more data for tails, larger \varepsilon where justified, or per-group accounting — but you must measure subgroup utility, not just aggregate accuracy, when adding DP.

Responsible AI: Fairness, Explainability, Privacy & Security