Question 1

What does the GCP Professional Machine Learning Engineer (PMLE) certification validate, and what are the broad domains the exam covers?

Accepted Answer

PMLE validates that you can design, build, productionize, optimize, and monitor ML and generative-AI solutions on Google Cloud, spanning low-code and custom approaches. It is a ~50-60 question, 2-hour, 200 USD exam (multiple-choice/multi-select), valid 2 years, with roughly 3+ years industry ML experience and 1+ year on Google Cloud recommended. The six weighted domains: architecting low-code AI (BigQuery ML, AutoML, ML APIs, Model Garden), collaborating to manage data and models (Workbench/Colab Enterprise, ML metadata, model cards), scaling prototypes into models (Vertex AI training, distributed strategies, hardware), serving and scaling models (endpoints, batch, Feature Store), automating and orchestrating pipelines (Vertex AI Pipelines, CI/CD), and monitoring (skew/drift, Explainable AI, responsible AI). The current blueprint folds in generative AI, Model Garden, and Agent Builder.

Question 2

On Vertex AI, what is the difference between an online prediction Endpoint and a Batch Prediction job, and when do you choose each?

Accepted Answer

An online Endpoint hosts a deployed model behind a low-latency REST/gRPC service with autoscaling (min/max replicas, target utilization or request rate); use it for real-time, per-request inference where latency matters. A Batch Prediction job is asynchronous, reads inputs from Cloud Storage or BigQuery, scores them in bulk, and writes results back—no persistent endpoint, no per-request latency guarantee. Choose batch for large offline scoring (nightly scores, backfills) where throughput and cost-efficiency beat latency; choose online when an application needs synchronous responses. Endpoints also support traffic splitting across model versions for A/B testing and canary rollouts, which batch jobs do not.

Question 3

Scenario: An analytics team is fluent in SQL but has no ML engineering staff. They want to forecast weekly sales directly from a 2 TB table already in BigQuery, with minimal data movement. What is the most appropriate Google Cloud approach?

Accepted Answer

Use BigQuery ML to train a time-series model in place with CREATE MODEL ... OPTIONS(model_type='ARIMA_PLUS'), then forecast with ML.FORECAST. BQML lets SQL-literate users build and serve models without exporting data, avoiding the cost and governance overhead of moving 2 TB. It keeps training and prediction inside BigQuery's engine and integrates evaluation via ML.EVALUATE. Vertex AI custom training or AutoML would require data export or managed datasets and ML tooling the team lacks; pre-trained ML APIs do not fit a custom forecasting task. ARIMA_PLUS handles seasonality, holidays, and anomalies automatically, making it the lowest-friction, lowest-cost fit for this constraint.

Question 4

Pseudo-code a Vertex AI Pipelines (KFP v2) component that trains a model and outputs an artifact, and explain why pipelines aid MLOps.

Accepted Answer

In KFP v2 you decorate a Python function as a component and type its outputs as artifacts:

@component(base_image='...')
def train(data: Input[Dataset], model: Output[Model]):
    m = fit(data.path)
    save(m, model.path)

A @dsl.pipeline wires components, passing artifacts so the orchestrator tracks lineage. Pipelines aid MLOps by making each step containerized, parameterized, cached, and reproducible; Vertex ML Metadata auto-records artifact lineage (which data/version produced which model), enabling retraining triggers, run comparison, and CI/CD via Cloud Build. They turn an ad-hoc notebook into a versioned, schedulable, auditable workflow.

Question 5

On Vertex AI, how do you safely roll out a new model version to a production endpoint, and how does traffic splitting work?

Accepted Answer

Deploy the new model version to the same Endpoint as an additional DeployedModel, then set the endpoint's traffic split (a map of deployed-model IDs to integer percentages summing to 100). Start with a canary—e.g., 5% to the new version, 95% to the current—monitor latency, errors, and prediction/skew metrics via Model Monitoring, then progressively shift to 50/50 and finally 100% if healthy, rolling back instantly by resetting the split if metrics regress. Because both versions share the endpoint, clients hit one stable URL while traffic is divided server-side, enabling A/B testing and zero-downtime canary releases. Keep the old version deployed until the new one is proven, then undeploy it to free resources.

Question 6

AutoML vs. Vertex AI custom training: what are the real engineering tradeoffs, and what data-size/control conditions push you toward one?

Accepted Answer

AutoML (Tabular/Image/Text/Video) automates architecture search, feature engineering, and hyperparameter tuning—fast time-to-value, strong baselines, minimal code—but it is a managed black box with limited architectural control, can be costly in node-hours, and has minimum data requirements (AutoML Tabular needs roughly 1,000+ rows). Custom training gives full control over framework, architecture, loss, and distributed strategy, and is necessary for novel architectures, custom objectives, special hardware (TPU), or tight cost/latency tuning—at the price of more engineering effort. Choose AutoML when you want a strong model quickly without ML staff and the problem is standard; choose custom training when you need control, have ML engineers, or AutoML can't express your model.

Question 7

Explain how Vertex AI Feature Store prevents training-serving skew and what the offline vs. online stores are for.

Accepted Answer

Vertex AI Feature Store is a centralized repository so the exact same feature values and computation logic are used at training and serving time, eliminating skew that arises when features are recomputed differently in two code paths. The offline store holds historical feature values for building training datasets using point-in-time correct lookups—retrieving each feature value as of the label's timestamp—to avoid label leakage. The online store serves the latest feature values at low latency for real-time prediction. By ingesting features once and reading from both, a serving request sees the same feature definitions used during training, and you gain feature reuse, versioning, and monitoring across teams.

Question 8

When training a large neural net on Vertex AI, when would you choose TPUs over GPUs, and what model characteristics make a TPU a poor fit?

Accepted Answer

TPUs excel at large, dense matrix-multiply workloads with high arithmetic intensity—large batch sizes, big transformer/CNN models, and frameworks (JAX, TensorFlow, PyTorch/XLA) that compile to XLA. They deliver superior throughput-per-dollar for regular, static-shape computation, and TPU pods scale to thousands of chips for very large training. TPUs are a poor fit for models with dynamic shapes, heavy custom CUDA ops, lots of control flow or sparse/irregular operations, small models, or workloads that can't fill the large batch the systolic array needs; they also require XLA-compatible code. GPUs (e.g., A100/H100) are more flexible for custom ops, dynamic graphs, and smaller or rapidly iterating models.

Question 9

Differentiate Vertex AI Model Monitoring's three signals: training-serving skew, prediction drift, and feature-attribution drift. What does each require?

Accepted Answer

Training-serving skew compares live serving feature distributions against the training data distribution (needs the training dataset as a baseline) and flags when production inputs diverge from what the model learned. Prediction drift compares the current serving feature (or prediction) distribution against an earlier serving window—no training baseline required—catching gradual post-deployment shifts. Feature-attribution drift tracks how feature importance/contribution changes over time and requires Explainable AI to be configured so attributions exist. Skew and drift typically use a statistical distance such as Jensen-Shannon divergence (categorical) or L-infinity/JS per feature, alerting above a threshold. Use skew to validate the deployment matches training; use drift to detect post-deployment distribution change.

Question 10

Compare MirroredStrategy and MultiWorkerMirroredStrategy in TensorFlow distributed training, and how you'd request the right hardware on Vertex AI.

Accepted Answer

MirroredStrategy does synchronous data-parallel training across multiple GPUs on a single machine: each replica holds a model copy, gradients are all-reduced across local devices, weights stay in sync. MultiWorkerMirroredStrategy extends the synchronous all-reduce across multiple machines (workers) using a collective implementation (e.g., NCCL ring) over the network, configured via the TF_CONFIG environment variable. On Vertex AI custom training you set a worker pool spec: a single machine with multiple acceleratorCount GPUs for MirroredStrategy, or multiple worker replicas (a chief plus workers, each with accelerators) for MultiWorker. Multi-worker adds network overhead and fault-tolerance concerns, so prefer single-node multi-GPU until the model/batch outgrows one machine.

Question 11

Scenario: A fraud model must explain individual predictions to regulators, the model is a deep neural net, and inputs are tabular. Which Vertex Explainable AI method fits, and why not the alternatives?

Accepted Answer

Use Integrated Gradients: it is designed for differentiable models (neural nets), attributing the prediction to each input feature by integrating gradients along a path from a baseline to the input, and satisfies axioms (completeness/sensitivity) that make per-prediction explanations defensible. Sampled Shapley is model-agnostic and works for non-differentiable models (e.g., tree ensembles) but is computationally expensive and only an approximation via sampling. XRAI is region-based and meant for image models, not tabular features. Configure the method and the baseline in ExplanationMetadata/ExplanationParameters at deploy time. Integrated Gradients gives gradient-based, low-variance attributions for a differentiable tabular network—the right and most efficient match here.

Question 12

For ML preprocessing on GCP, when do you reach for Dataflow versus Dataproc, and why does Dataflow's model matter for training-serving consistency?

Accepted Answer

Dataflow is the managed Apache Beam service: a unified batch+streaming model with autoscaling and no cluster management—ideal for ML preprocessing because the same Beam pipeline (e.g., via tf.Transform) can compute features consistently in batch for training and in streaming for serving, directly reducing training-serving skew. Dataproc is managed Hadoop/Spark, the right choice when you have existing Spark/Hadoop jobs, MLlib code, or want fine cluster control and lift-and-shift of an on-prem ecosystem. Choose Dataflow for new, serverless, skew-safe feature pipelines and unified batch/stream; choose Dataproc to run existing Spark workloads or when you need ephemeral clusters with specific OSS tooling. tf.Transform bakes the analyze-then-transform graph into the serving signature.

Question 13

On Vertex AI, you want to adapt Gemini to a domain. Compare prompt engineering, RAG, and supervised fine-tuning—what failure does each address?

Accepted Answer

Prompt engineering (few-shot, instructions, system prompts) shapes behavior with no training cost and is the first lever; it fails when the model lacks the knowledge or the task needs consistent structured output beyond what context can steer. RAG retrieves relevant documents (e.g., via Vertex AI Search or a vector store) and injects them into the prompt—the right fix when the gap is missing or freshly changing factual knowledge, since it grounds answers and reduces hallucination without retraining weights. Supervised fine-tuning (parameter-efficient tuning on Vertex AI) updates the model on labeled examples—use it for a durable style, format, or skill the base model can't reliably follow even with good prompts, but it won't reliably inject new facts. Often you combine: fine-tune for behavior, RAG for knowledge.

Question 14

How do you evaluate a generative AI summarization solution on Vertex AI, and why are classification metrics like accuracy/F1 inappropriate?

Accepted Answer

Generative outputs are open-ended text with many valid phrasings, so exact-match accuracy/F1 (which assume a single correct discrete label) wrongly penalize correct paraphrases. Use reference-based overlap metrics—ROUGE for summarization (n-gram and longest-common-subsequence recall against references) and BLEU for translation-like tasks—plus embedding-based similarity (e.g., BERTScore) for semantic match. Critically, pair automated metrics with human preference evaluation or an LLM-as-a-judge rubric scoring relevance, faithfulness/groundedness (no hallucination), coherence, and safety, since ROUGE/BLEU correlate weakly with quality. Vertex AI's GenAI evaluation service supports both computed and model-based (judge) metrics; track groundedness against source documents for RAG. Define task-specific success criteria, not a single scalar.

Question 15

On a Vertex AI online prediction endpoint, you configured Model Monitoring but only have access to the production request logs, not the original training dataset. Which detection type can you enable, and what statistical comparison does it perform under the hood for numerical vs. categorical features?

Accepted Answer

Without a training baseline you can only enable prediction *drift* detection (training-serving *skew* requires the training dataset as the baseline). Drift compares the live serving feature distribution in the current time window against an earlier serving-distribution baseline rather than against training data. For numerical features Vertex uses the Jensen-Shannon divergence between the two distributions; for categorical features it uses the L-infinity distance (the max change in any category's proportion). An alert fires when the computed distance exceeds the per-feature threshold you set. Skew, by contrast, would compare serving against training using the same metrics.

Question 16

You're deploying a GPU-backed transformer to a Vertex AI online endpoint with spiky daytime traffic and near-zero overnight traffic. Compare setting min-replica-count to 0 vs 1, explain the default autoscaling signal and why CPU utilization can mislead for GPU models, and state when batch prediction is the correct choice instead.

Accepted Answer

Vertex autoscales between min and max replicas, by default targeting ~60% CPU utilization (you can also target GPU-duty-cycle or a requests-per-replica metric). For GPU transformers CPU% is misleading because the GPU can be saturated while CPU sits low, under-provisioning replicas and tanking latency — set a GPU-utilization target instead. min-replicas=0 enables scale-to-zero (no idle cost overnight) but every cold request pays a multi-second-to-minute model-load cold start; min-replicas=1 keeps one warm replica, eliminating cold starts at the cost of 24/7 GPU billing. If predictions aren't latency-sensitive (overnight scoring of a whole dataset), use batch prediction, which spins up workers, scores in bulk to GCS/BigQuery, and tears down — no standing endpoint cost.

Question 17

A retraining pipeline runs nightly but the model's offline metrics look fine while production complaints rise. Walk through the most likely MLOps failure modes and how Vertex AI surfaces them.

Accepted Answer

Most likely: training-serving skew or a silent feature pipeline divergence—offline eval uses the clean training distribution while live inputs have drifted, so the model is judged on the wrong distribution. Vertex AI Model Monitoring catches this via skew (vs. training baseline) and prediction drift (vs. an earlier serving window), alerting on per-feature distance such as Jensen-Shannon divergence. Other causes: label leakage inflating offline metrics, a feature computed differently online vs. offline (fix with Feature Store/tf.Transform single source), stale feature freshness, an upstream data-quality break that passes schema but shifts values, or a non-representative holdout. ML Metadata lineage lets you trace which data version produced the deployed model. The fix is monitoring-driven: detect drift, gate retraining on data-quality checks, and evaluate on a recent production-representative slice.

Question 18

At Google Cloud Next 2026 the platform was rebranded from Vertex AI to the Gemini Enterprise Agent Platform, with Model Garden and Agent Builder front and center. As a principal engineer, how would you frame the role of these managed GenAI services versus building agents directly on raw model APIs?

Accepted Answer

Model Garden is the curated catalog (Google first-party like Gemini plus open and partner models—now 200+, including Anthropic Claude) for discovering, evaluating, tuning, and one-click deploying foundation models, standardizing access and governance. Agent Builder provides managed orchestration for grounded, tool-using agents and RAG-backed enterprise search, handling retrieval, grounding, and connectors with built-in responsible-AI controls. The principal framing: prefer managed services for governance, security, evaluation, and time-to-value on standard enterprise patterns (grounded Q&A, search, retrieval agents); drop to the raw Gemini API plus custom orchestration only when you need control the managed layer can't express—bespoke tool graphs, custom guardrails, latency/cost tuning—while keeping deterministic validation and grounding around every model output. Note that going forward Vertex AI capabilities ship under the Agent Platform name.

Question 19

A fraud model on a Vertex AI endpoint shows stable input-feature distributions (no data drift) yet accuracy is silently degrading. The team wants monitoring that flags *why* predictions are shifting, not just that inputs moved. What Vertex AI capability addresses this, what is its hard prerequisite, and what is the cost/latency tradeoff of running it continuously?

Accepted Answer

They need feature-attribution monitoring, which tracks how much each feature *contributes* to predictions over time (skew vs training attributions, or drift over serving). It catches concept-style shifts where input distributions look stable but the model's reliance on features changes. The hard prerequisite is Vertex Explainable AI enabled on the model (e.g. sampled/integrated-gradients or Shapley config), since attribution monitoring is built on explanations. The tradeoff: computing attributions per request is far more expensive and slower than plain distribution monitoring, so you typically sample a fraction of traffic and accept added latency/compute cost rather than explaining 100% of predictions.

GCP Professional Machine Learning Engineer