AI TRAINING
Data Pipelines for AI Workloads
Build production-grade data pipelines that reliably feed AI models with clean, versioned, observable data.
What it covers
This practitioner-level programme equips data engineers with the patterns and tooling to design, build, and operate data pipelines purpose-built for AI and ML workloads. Participants work through ELT design, streaming versus batch trade-offs, schema evolution strategies, and data quality gates using industry-standard tools including Airflow, dbt, Dagster, and Prefect. The format combines instructor-led sessions with hands-on labs where engineers implement real pipeline architectures on sample AI use cases. By the end, participants can ship resilient, observable pipelines that meet the data freshness and quality requirements of production ML systems.
What you'll be able to do
- Design and implement an ELT pipeline using dbt and a cloud data warehouse optimised for ML feature generation
- Choose between streaming and batch ingestion architectures based on model latency and data freshness requirements
- Configure Dagster or Prefect to orchestrate a multi-step AI data workflow with retries, branching, and SLA alerts
- Implement schema evolution policies that prevent silent data drift from breaking downstream model training
- Write and deploy data quality checks using Great Expectations or dbt tests that gate pipeline progression
Topics covered
- ELT patterns optimised for feature stores and model training data
- Streaming vs batch trade-offs for real-time inference pipelines
- Schema evolution and backward compatibility strategies
- Orchestration with Airflow, Dagster, and Prefect — when to use which
- Data transformation and lineage with dbt
- Data quality gates: expectations, anomaly detection, and alerting
- Pipeline observability: logging, metrics, and SLAs
- Handling large-scale data for LLM fine-tuning and RAG workloads
Delivery
Delivered as a 3–5 day intensive bootcamp, available in-person or fully remote via collaborative tooling (VS Code Live Share, shared cloud environments). Approximately 60% hands-on lab time across realistic datasets. Participants receive a pre-configured cloud sandbox (GCP or AWS) and access to recorded sessions for 90 days post-training. A capstone project — building an end-to-end pipeline for a simulated LLM embedding refresh workflow — is assessed and returned with written feedback.
What makes it work
- Establish data contracts between pipeline producers and ML consumers before writing transformation code
- Instrument pipelines with observability from day one — SLA tracking, freshness metrics, and anomaly alerts
- Run quality gates as mandatory pipeline steps rather than optional monitoring layers
- Align orchestration tool choice with the team's existing DevOps practices and cloud vendor ecosystem
Common mistakes
- Reusing analytical ETL pipelines for ML workloads without adapting for feature consistency and point-in-time correctness
- Choosing streaming by default without evaluating whether model latency requirements actually justify the operational overhead
- Skipping data quality gates in early pipeline versions and discovering silent schema drift only after model degradation in production
- Treating pipeline orchestration tool selection as purely technical and ignoring team familiarity and operational support costs
When NOT to take this
This training is not the right fit for teams that have not yet standardised on a cloud data warehouse and have no existing pipelines — they need foundational data engineering onboarding first before tackling AI-specific pipeline patterns.
Providers to consider
Sources
This training is part of a Data & AI catalog built for leaders serious about execution. Take the free diagnostic to see which trainings your team needs.