Key Takeaway
Effective model monitoring combines statistical drift detection with business metric tracking, because data drift only matters when it impacts the outcomes your stakeholders care about. This playbook covers four monitoring layers with specific metrics, alert thresholds, detection methods, and automated response actions for each layer.
Prerequisites
- At least one ML model serving production traffic with logged predictions
- An observability stack (Prometheus/Grafana, Datadog, or equivalent) for metrics collection
- Access to ground truth labels or a proxy for model accuracy measurement
- A reference dataset representing the expected input distribution (typically the test or validation set)
- Basic understanding of statistical tests (KS test, PSI) and drift detection concepts
The Four Monitoring Layers
Model monitoring operates at four layers, each answering a different question. Data quality monitoring asks: is the input data well-formed and within expected bounds? Feature drift monitoring asks: has the statistical distribution of inputs changed since training? Model performance monitoring asks: is the model still producing accurate predictions? Business impact monitoring asks: are the model's predictions driving the business outcomes we expect? Each layer catches different failure modes, and no single layer is sufficient on its own.
Layer 1: Data Quality Monitoring
Data quality monitoring is the first line of defense. It catches issues before they reach the model: schema violations (unexpected types, missing required fields), value range violations (negative ages, future dates, out-of-vocabulary categories), null rate spikes (a feature that is suddenly missing for a large percentage of requests), and volume anomalies (traffic significantly above or below expected levels). These checks should run on every incoming request or batch, with alerting thresholds calibrated to your traffic patterns.
"""Data quality monitoring for model inputs.
Validates incoming data against expected schemas
and distributions, catching upstream pipeline issues
before they corrupt model predictions.
"""
from dataclasses import dataclass
from typing import Dict, List, Optional, Any
import numpy as np
@dataclass
class QualityCheckResult:
"""Result of a single data quality check."""
check_name: str
passed: bool
metric_value: float
threshold: float
details: str
class DataQualityMonitor:
"""Monitor incoming model inputs for quality issues."""
def __init__(
self,
feature_schemas: Dict[str, Dict[str, Any]],
null_rate_threshold: float = 0.05,
volume_deviation_threshold: float = 0.5,
):
self.schemas = feature_schemas
self.null_threshold = null_rate_threshold
self.volume_threshold = volume_deviation_threshold
self._baseline_volume: Optional[float] = None
def check_null_rates(
self, batch: Dict[str, List],
) -> List[QualityCheckResult]:
"""Check null rates for each feature in a batch."""
results = []
for feature, values in batch.items():
null_count = sum(1 for v in values if v is None)
null_rate = null_count / len(values) if values else 0
results.append(QualityCheckResult(
check_name=f"null_rate_{feature}",
passed=null_rate <= self.null_threshold,
metric_value=null_rate,
threshold=self.null_threshold,
details=(
f"{feature}: {null_rate:.2%} null "
f"({null_count}/{len(values)})"
),
))
return results
def check_value_ranges(
self, batch: Dict[str, List],
) -> List[QualityCheckResult]:
"""Validate feature values against defined ranges."""
results = []
for feature, values in batch.items():
schema = self.schemas.get(feature, {})
min_val = schema.get("min")
max_val = schema.get("max")
if min_val is None and max_val is None:
continue
non_null = [v for v in values if v is not None]
if not non_null:
continue
violations = sum(
1 for v in non_null
if (min_val is not None and v < min_val)
or (max_val is not None and v > max_val)
)
violation_rate = violations / len(non_null)
results.append(QualityCheckResult(
check_name=f"range_{feature}",
passed=violation_rate <= 0.01,
metric_value=violation_rate,
threshold=0.01,
details=(
f"{feature}: {violations} values "
f"outside [{min_val}, {max_val}]"
),
))
return resultsUnlock the full Knowledge Base
This article continues for 14 more sections. Upgrade to Pro for full access to all 93 articles.
That's just $0.11 per article
- Full access to all blueprints, frameworks, and playbooks
- Interactive checklists with progress tracking
- Downloadable templates (.xlsx, .pptx, .docx)
- Quarterly Technology Radar updates