Quel est le niveau de maturité de votre organisation Data & IA ?Faites le diagnostic
Toutes les formations

FORMATION IA

Déploiement et Scalabilité des Modèles ML en Production

Construisez et opérez une infrastructure d'inférence ML performante, économique et prête pour la production.

Format
bootcamp
Durée
24–40h
Niveau
practitioner
Taille de groupe
6–16
Prix / participant
€2K–€4K
Prix groupe
€20K–€45K
Public
ML platform engineers, MLOps engineers, and senior ML engineers responsible for deploying models to production
Prérequis
Solid Python skills, familiarity with PyTorch or TensorFlow, prior experience deploying at least one ML model to a cloud or on-prem environment

Ce qu'elle couvre

Ce programme de niveau praticien couvre le cycle complet de déploiement et de mise à l'échelle de modèles de machine learning en environnement de production. Les participants acquièrent une expérience pratique avec les principaux moteurs d'inférence (vLLM, TGI, Triton Inference Server), les stratégies de batching, les techniques de quantisation et la sélection GPU/matériel. La formation allie fondements théoriques et exercices en laboratoire, et se conclut par le benchmarking et l'optimisation d'une stack d'inférence complète. Le format combine sessions magistrales, labs guidés et revues de code entre pairs.

À l'issue, vous saurez

  • Deploy and configure vLLM and Triton Inference Server to serve a large language model with defined latency SLOs
  • Apply INT8 and INT4 quantisation techniques and measure their accuracy/throughput tradeoffs on a real model
  • Design a batching strategy that maximises GPU utilisation while respecting p95 latency targets
  • Select and justify GPU hardware (A10G, A100, H100) based on model size, throughput requirements, and cost constraints
  • Instrument an inference service with Prometheus metrics and create a dashboard tracking GPU utilisation, token throughput, and error rates

Sujets abordés

  • Inference engine deep-dive: vLLM, TGI, and Triton Inference Server
  • Continuous batching, dynamic batching, and request scheduling
  • Model quantisation: INT8, INT4, GPTQ, AWQ, and bitsandbytes
  • GPU selection, multi-GPU parallelism (tensor, pipeline, data)
  • Latency vs. throughput tradeoffs and SLO design
  • Model serving APIs: REST, gRPC, and streaming responses
  • Monitoring inference: latency percentiles, GPU utilisation, cost per token
  • Autoscaling strategies and KV-cache optimisation

Modalité

Delivered as a 3-5 day intensive bootcamp, either on-site at the client's premises or fully remote via a cloud lab environment (AWS, GCP, or Azure GPU instances are provisioned for participants). Approximately 60% of time is spent in hands-on labs; 40% in instructor-led conceptual sessions. Participants work in pairs during lab exercises. All lab code, benchmark scripts, and reference configurations are provided and retained by participants after the programme. A GPU-enabled cloud sandbox is included in the group price; participants using personal or company cloud accounts should budget separately.

Ce qui fait que ça marche

  • Define latency and throughput SLOs before benchmarking so hardware and batching decisions are grounded in real requirements
  • Run quantisation accuracy evaluations on a representative task-specific dataset, not just generic benchmarks
  • Instrument inference from day one — teams that add observability early iterate much faster on performance tuning
  • Treat model serving configurations as code (versioned, reviewed, tested) to avoid configuration drift between environments

Erreurs fréquentes

  • Defaulting to simple synchronous REST serving without considering continuous batching, leading to severe GPU underutilisation at scale
  • Applying aggressive quantisation (INT4) without validating accuracy degradation on the specific task, causing silent quality regressions in production
  • Over-provisioning GPU instances based on peak load without implementing autoscaling, resulting in excessive infrastructure costs
  • Ignoring KV-cache sizing and eviction policies, causing unpredictable latency spikes under concurrent load

Quand NE PAS suivre cette formation

This training is not appropriate for teams that have not yet trained or fine-tuned a model and are still evaluating whether ML is the right solution — they need a strategy or literacy programme first, not infrastructure depth.

Fournisseurs à considérer

Sources

Cette formation fait partie d'un catalogue Data & IA construit pour les leaders sérieux sur l'exécution. Lancez le diagnostic gratuit pour voir quelles formations sont prioritaires pour votre équipe.