FORMATION IA

Déploiement et Scalabilité des Modèles ML en Production

Construisez et opérez une infrastructure d'inférence ML performante, économique et prête pour la production.

Format: bootcamp
Durée: 24–40h
Niveau: practitioner
Taille de groupe: 6–16
Prix / participant: €2K–€4K
Prix groupe: €20K–€45K
Public: ML platform engineers, MLOps engineers, and senior ML engineers responsible for deploying models to production
Prérequis: Solid Python skills, familiarity with PyTorch or TensorFlow, prior experience deploying at least one ML model to a cloud or on-prem environment

Ce qu'elle couvre

Ce programme de niveau praticien couvre le cycle complet de déploiement et de mise à l'échelle de modèles de machine learning en environnement de production. Les participants acquièrent une expérience pratique avec les principaux moteurs d'inférence (vLLM, TGI, Triton Inference Server), les stratégies de batching, les techniques de quantisation et la sélection GPU/matériel. La formation allie fondements théoriques et exercices en laboratoire, et se conclut par le benchmarking et l'optimisation d'une stack d'inférence complète. Le format combine sessions magistrales, labs guidés et revues de code entre pairs.

À l'issue, vous saurez

Deploy and configure vLLM and Triton Inference Server to serve a large language model with defined latency SLOs
Apply INT8 and INT4 quantisation techniques and measure their accuracy/throughput tradeoffs on a real model
Design a batching strategy that maximises GPU utilisation while respecting p95 latency targets
Select and justify GPU hardware (A10G, A100, H100) based on model size, throughput requirements, and cost constraints
Instrument an inference service with Prometheus metrics and create a dashboard tracking GPU utilisation, token throughput, and error rates

Sujets abordés

Inference engine deep-dive: vLLM, TGI, and Triton Inference Server
Continuous batching, dynamic batching, and request scheduling
Model quantisation: INT8, INT4, GPTQ, AWQ, and bitsandbytes
GPU selection, multi-GPU parallelism (tensor, pipeline, data)
Latency vs. throughput tradeoffs and SLO design
Model serving APIs: REST, gRPC, and streaming responses
Monitoring inference: latency percentiles, GPU utilisation, cost per token
Autoscaling strategies and KV-cache optimisation

Modalité

Delivered as a 3-5 day intensive bootcamp, either on-site at the client's premises or fully remote via a cloud lab environment (AWS, GCP, or Azure GPU instances are provisioned for participants). Approximately 60% of time is spent in hands-on labs; 40% in instructor-led conceptual sessions. Participants work in pairs during lab exercises. All lab code, benchmark scripts, and reference configurations are provided and retained by participants after the programme. A GPU-enabled cloud sandbox is included in the group price; participants using personal or company cloud accounts should budget separately.

Ce qui fait que ça marche

Define latency and throughput SLOs before benchmarking so hardware and batching decisions are grounded in real requirements
Run quantisation accuracy evaluations on a representative task-specific dataset, not just generic benchmarks
Instrument inference from day one — teams that add observability early iterate much faster on performance tuning
Treat model serving configurations as code (versioned, reviewed, tested) to avoid configuration drift between environments

Erreurs fréquentes

Defaulting to simple synchronous REST serving without considering continuous batching, leading to severe GPU underutilisation at scale
Applying aggressive quantisation (INT4) without validating accuracy degradation on the specific task, causing silent quality regressions in production
Over-provisioning GPU instances based on peak load without implementing autoscaling, resulting in excessive infrastructure costs
Ignoring KV-cache sizing and eviction policies, causing unpredictable latency spikes under concurrent load

Quand NE PAS suivre cette formation

This training is not appropriate for teams that have not yet trained or fine-tuned a model and are still evaluating whether ML is the right solution — they need a strategy or literacy programme first, not infrastructure depth.

Fournisseurs à considérer

Sources

Cette formation fait partie d'un catalogue Data & IA construit pour les leaders sérieux sur l'exécution. Lancez le diagnostic gratuit pour voir quelles formations sont prioritaires pour votre équipe.

Lancer le diagnostic Réserver un appel