Key Takeaway
The most critical skill for AI on-call is quickly distinguishing between infrastructure failures and model quality degradation, because they require different response teams and different remediation approaches. This playbook provides structured runbooks, diagnostic decision trees, and escalation paths for the five most common categories of AI system alerts.
Prerequisites
- An existing on-call rotation and incident management process
- Model monitoring infrastructure with alerting configured (see: Model Monitoring Playbook)
- Access to model serving logs, metrics dashboards, and deployment tooling
- Model rollback capability (the ability to revert to a previous model version within minutes)
- Contact information for the ML engineering team and data engineering team for escalations
AI On-Call Is Different
Traditional on-call engineers diagnose infrastructure failures: services are down, databases are slow, network is partitioned. AI on-call adds a category that does not exist in traditional systems: the service is up and responding, but the responses are wrong. An inference endpoint returning HTTP 200 with a JSON response that contains a subtly incorrect prediction looks healthy to every standard monitoring tool. Diagnosing these silent failures requires understanding model behavior, data distributions, and quality metrics that most on-call engineers have never worked with.
This playbook bridges the gap by providing structured runbooks that guide on-call engineers through AI-specific diagnosis without requiring deep ML expertise. The runbooks use a triage-first approach: determine whether the issue is infrastructure (on-call can resolve), data pipeline (escalate to data engineering), or model quality (escalate to ML team), and then follow the appropriate resolution path.
Triage Decision Tree
Unlock the full Knowledge Base
This article continues for 14 more sections. Upgrade to Pro for full access to all 93 articles.
That's just $0.11 per article
- Full access to all blueprints, frameworks, and playbooks
- Interactive checklists with progress tracking
- Downloadable templates (.xlsx, .pptx, .docx)
- Quarterly Technology Radar updates