AI TRAINING
AI Observability and Monitoring in Production
Master the tools and practices to keep LLM-powered systems reliable, observable, and cost-efficient in production.
What it covers
This practitioner-level programme equips platform engineers and SREs with the skills to instrument, trace, and evaluate AI systems running in production. Participants learn to implement structured logging for LLM calls, detect model drift, set up cost and latency alerting, and run continuous evaluations against live traffic. The format combines hands-on labs using real observability tooling (LangSmith, Arize, Prometheus, OpenTelemetry) with architecture reviews and incident post-mortems. By the end, teams can build and operate a production-grade AI observability stack from scratch.
What you'll be able to do
- Instrument an LLM API pipeline with OpenTelemetry spans and structured JSON logs exportable to any backend
- Configure automated eval pipelines that score model outputs on live traffic using LLM-as-judge and rule-based checks
- Build a drift detection alert that triggers when embedding cosine similarity or output toxicity scores shift beyond a set threshold
- Define and implement SLOs for AI endpoints covering latency P95, error rate, and per-token cost budgets
- Diagnose and remediate a simulated production incident involving a degraded LLM response quality using an observability dashboard
Topics covered
- Tracing LLM calls with OpenTelemetry and vendor-specific SDKs
- Structured logging strategies for prompt/response pipelines
- Eval-in-production: running automated quality checks against live traffic
- Drift detection for embedding models and output distributions
- Cost monitoring and token-budget alerting across providers
- Latency profiling and SLO definition for AI endpoints
- Observability tool landscape: LangSmith, Arize, Helicone, Datadog AI
- Incident response playbooks for degraded model behaviour
Delivery
Delivered as a 3-day intensive bootcamp (in-person or remote with live instruction), with approximately 60% hands-on lab time and 40% concept delivery and discussion. Participants work in small squads on a shared staging environment pre-configured with a multi-step LLM application. Materials include lab notebooks, architecture reference cards, and a starter observability stack template (Docker Compose + Prometheus + Grafana + LangSmith). Remote delivery uses breakout rooms for squad work with a shared cloud sandbox. Post-bootcamp, a follow-up office-hours session (2h) is recommended at the 30-day mark.
What makes it work
- Defining SLOs for AI endpoints before instrumenting — clarity on what 'good' looks like drives the right metric selection
- Embedding eval-in-production as a standard release gate alongside unit and integration tests
- Assigning clear ownership of cost and quality dashboards to a named team or on-call rotation
- Starting with a minimal viable observability stack (traces + cost + one quality metric) and iterating rather than boiling the ocean
Common mistakes
- Treating LLM observability as identical to classic APM — ignoring semantic drift and output quality as first-class signals
- Logging only errors and latency while omitting prompt content and token counts, making cost attribution impossible
- Running evals only offline at release time and having no mechanism to catch quality regressions in live traffic
- Setting up dashboards without defining SLOs first, resulting in metrics that nobody acts on
When NOT to take this
This bootcamp is not the right fit for teams that have not yet deployed any AI model or LLM feature to production — they would benefit more from an MLOps or LLM application-building programme first before investing in observability infrastructure.
Providers to consider
Sources
This training is part of a Data & AI catalog built for leaders serious about execution. Take the free diagnostic to see which trainings your team needs.