AI TRAINING

Serving ML Models at Scale in Production

Build and operate high-throughput, low-latency ML inference infrastructure confidently and cost-efficiently.

Format: bootcamp
Duration: 24–40h
Level: practitioner
Group size: 6–16
Price / participant: €2K–€4K
Group price: €20K–€45K
Audience: ML platform engineers, MLOps engineers, and senior ML engineers responsible for deploying models to production
Prerequisites: Solid Python skills, familiarity with PyTorch or TensorFlow, prior experience deploying at least one ML model to a cloud or on-prem environment

What it covers

This practitioner-level programme covers the full lifecycle of deploying and scaling machine learning models in production environments. Participants gain hands-on experience with leading inference engines (vLLM, TGI, Triton Inference Server), batching strategies, quantisation techniques, and GPU/hardware selection. The training balances theoretical foundations with lab-based exercises, culminating in participants benchmarking and tuning a real inference stack end-to-end. Format combines instructor-led sessions with guided labs and peer code review.

What you'll be able to do

Deploy and configure vLLM and Triton Inference Server to serve a large language model with defined latency SLOs
Apply INT8 and INT4 quantisation techniques and measure their accuracy/throughput tradeoffs on a real model
Design a batching strategy that maximises GPU utilisation while respecting p95 latency targets
Select and justify GPU hardware (A10G, A100, H100) based on model size, throughput requirements, and cost constraints
Instrument an inference service with Prometheus metrics and create a dashboard tracking GPU utilisation, token throughput, and error rates

Topics covered

Inference engine deep-dive: vLLM, TGI, and Triton Inference Server
Continuous batching, dynamic batching, and request scheduling
Model quantisation: INT8, INT4, GPTQ, AWQ, and bitsandbytes
GPU selection, multi-GPU parallelism (tensor, pipeline, data)
Latency vs. throughput tradeoffs and SLO design
Model serving APIs: REST, gRPC, and streaming responses
Monitoring inference: latency percentiles, GPU utilisation, cost per token
Autoscaling strategies and KV-cache optimisation

Delivery

Delivered as a 3-5 day intensive bootcamp, either on-site at the client's premises or fully remote via a cloud lab environment (AWS, GCP, or Azure GPU instances are provisioned for participants). Approximately 60% of time is spent in hands-on labs; 40% in instructor-led conceptual sessions. Participants work in pairs during lab exercises. All lab code, benchmark scripts, and reference configurations are provided and retained by participants after the programme. A GPU-enabled cloud sandbox is included in the group price; participants using personal or company cloud accounts should budget separately.

What makes it work

Define latency and throughput SLOs before benchmarking so hardware and batching decisions are grounded in real requirements
Run quantisation accuracy evaluations on a representative task-specific dataset, not just generic benchmarks
Instrument inference from day one — teams that add observability early iterate much faster on performance tuning
Treat model serving configurations as code (versioned, reviewed, tested) to avoid configuration drift between environments

Common mistakes

Defaulting to simple synchronous REST serving without considering continuous batching, leading to severe GPU underutilisation at scale
Applying aggressive quantisation (INT4) without validating accuracy degradation on the specific task, causing silent quality regressions in production
Over-provisioning GPU instances based on peak load without implementing autoscaling, resulting in excessive infrastructure costs
Ignoring KV-cache sizing and eviction policies, causing unpredictable latency spikes under concurrent load

When NOT to take this

This training is not appropriate for teams that have not yet trained or fine-tuned a model and are still evaluating whether ML is the right solution — they need a strategy or literacy programme first, not infrastructure depth.

Providers to consider

Sources

This training is part of a Data & AI catalog built for leaders serious about execution. Take the free diagnostic to see which trainings your team needs.

Run the diagnostic Book a call