How mature is your Data & AI organization?Take the diagnostic
All trainings

AI TRAINING

Serving ML Models at Scale in Production

Build and operate high-throughput, low-latency ML inference infrastructure confidently and cost-efficiently.

Format
bootcamp
Duration
24–40h
Level
practitioner
Group size
6–16
Price / participant
€2K–€4K
Group price
€20K–€45K
Audience
ML platform engineers, MLOps engineers, and senior ML engineers responsible for deploying models to production
Prerequisites
Solid Python skills, familiarity with PyTorch or TensorFlow, prior experience deploying at least one ML model to a cloud or on-prem environment

What it covers

This practitioner-level programme covers the full lifecycle of deploying and scaling machine learning models in production environments. Participants gain hands-on experience with leading inference engines (vLLM, TGI, Triton Inference Server), batching strategies, quantisation techniques, and GPU/hardware selection. The training balances theoretical foundations with lab-based exercises, culminating in participants benchmarking and tuning a real inference stack end-to-end. Format combines instructor-led sessions with guided labs and peer code review.

What you'll be able to do

  • Deploy and configure vLLM and Triton Inference Server to serve a large language model with defined latency SLOs
  • Apply INT8 and INT4 quantisation techniques and measure their accuracy/throughput tradeoffs on a real model
  • Design a batching strategy that maximises GPU utilisation while respecting p95 latency targets
  • Select and justify GPU hardware (A10G, A100, H100) based on model size, throughput requirements, and cost constraints
  • Instrument an inference service with Prometheus metrics and create a dashboard tracking GPU utilisation, token throughput, and error rates

Topics covered

  • Inference engine deep-dive: vLLM, TGI, and Triton Inference Server
  • Continuous batching, dynamic batching, and request scheduling
  • Model quantisation: INT8, INT4, GPTQ, AWQ, and bitsandbytes
  • GPU selection, multi-GPU parallelism (tensor, pipeline, data)
  • Latency vs. throughput tradeoffs and SLO design
  • Model serving APIs: REST, gRPC, and streaming responses
  • Monitoring inference: latency percentiles, GPU utilisation, cost per token
  • Autoscaling strategies and KV-cache optimisation

Delivery

Delivered as a 3-5 day intensive bootcamp, either on-site at the client's premises or fully remote via a cloud lab environment (AWS, GCP, or Azure GPU instances are provisioned for participants). Approximately 60% of time is spent in hands-on labs; 40% in instructor-led conceptual sessions. Participants work in pairs during lab exercises. All lab code, benchmark scripts, and reference configurations are provided and retained by participants after the programme. A GPU-enabled cloud sandbox is included in the group price; participants using personal or company cloud accounts should budget separately.

What makes it work

  • Define latency and throughput SLOs before benchmarking so hardware and batching decisions are grounded in real requirements
  • Run quantisation accuracy evaluations on a representative task-specific dataset, not just generic benchmarks
  • Instrument inference from day one — teams that add observability early iterate much faster on performance tuning
  • Treat model serving configurations as code (versioned, reviewed, tested) to avoid configuration drift between environments

Common mistakes

  • Defaulting to simple synchronous REST serving without considering continuous batching, leading to severe GPU underutilisation at scale
  • Applying aggressive quantisation (INT4) without validating accuracy degradation on the specific task, causing silent quality regressions in production
  • Over-provisioning GPU instances based on peak load without implementing autoscaling, resulting in excessive infrastructure costs
  • Ignoring KV-cache sizing and eviction policies, causing unpredictable latency spikes under concurrent load

When NOT to take this

This training is not appropriate for teams that have not yet trained or fine-tuned a model and are still evaluating whether ML is the right solution — they need a strategy or literacy programme first, not infrastructure depth.

Providers to consider

Sources

This training is part of a Data & AI catalog built for leaders serious about execution. Take the free diagnostic to see which trainings your team needs.