Quel est le niveau de maturité de votre organisation Data & IA ?Faites le diagnostic
Toutes les formations

FORMATION IA

Compression de modèles et quantisation pour la production

Déployez des modèles plus légers et plus rapides sans sacrifier la précision grâce à la quantisation, l'élagage et la distillation.

Format
bootcamp
Durée
16–24h
Niveau
advanced
Taille de groupe
6–16
Prix / participant
€2K–€4K
Prix groupe
€18K–€45K
Public
ML engineers and AI infrastructure engineers responsible for deploying models to production environments
Prérequis
Solid Python skills, hands-on experience training or fine-tuning neural networks, and familiarity with at least one deep learning framework (PyTorch preferred)

Ce qu'elle couvre

Ce programme de niveau praticien couvre l'ensemble des techniques de compression de modèles — quantisation (GPTQ, AWQ, GGUF), élagage structuré et non structuré, et distillation de connaissances — avec une analyse rigoureuse des compromis. Les participants travaillent sur des laboratoires pratiques de compression de LLM et de CNN réels, en mesurant la latence, l'empreinte mémoire et la dégradation de précision sur différentes cibles matérielles (GPU, CPU, edge). À l'issue du programme, les ingénieurs sont capables de choisir et d'appliquer la bonne stratégie de compression pour des contraintes de déploiement données et de l'intégrer dans un pipeline MLOps existant.

À l'issue, vous saurez

  • Apply GPTQ, AWQ, and GGUF quantisation to a production LLM and measure the resulting accuracy-latency trade-off
  • Design and execute a pruning strategy that reduces model size by at least 50% while maintaining acceptable benchmark scores
  • Build a knowledge distillation pipeline transferring a large teacher model's capabilities to a smaller student model
  • Select the appropriate compression technique given a specific hardware target, latency budget, and accuracy threshold
  • Integrate a model compression step into an existing MLOps CI/CD pipeline with automated regression benchmarking

Sujets abordés

  • Post-training quantisation: GPTQ, AWQ, GGUF formats and toolchains
  • Quantisation-aware training (QAT) and mixed-precision strategies
  • Structured vs unstructured pruning: magnitude, movement, and lottery ticket approaches
  • Knowledge distillation: teacher-student frameworks, task-specific and task-agnostic distillation
  • Accuracy vs latency vs memory trade-off curves and how to navigate them
  • Hardware-aware compression: GPU (CUDA), CPU (ONNX Runtime, OpenVINO), and edge (TFLite, CoreML)
  • Benchmarking compressed models: perplexity, MMLU, throughput, and memory profiling
  • Integrating compression into CI/CD and MLOps pipelines

Modalité

Delivered over 3 days (in-person or virtual instructor-led), split roughly 40% theory and 60% hands-on labs. Participants need access to a GPU instance (A10 or better recommended; cloud sandbox provided if needed). Lab materials include pre-trained LLaMA-family and ResNet models for compression exercises. A shared benchmarking dashboard is used throughout to compare participant results. Remote delivery uses breakout rooms for pair-lab sessions. On-site delivery works well in a data-centre-adjacent tech hub or corporate AI lab setting.

Ce qui fait que ça marche

  • Establish a benchmark suite on domain-specific data before compression begins, so regressions are caught early
  • Pair compression experiments with a structured trade-off log that records size, latency, memory, and accuracy for every configuration
  • Involve DevOps and infrastructure engineers early to ensure compressed model formats are compatible with the serving stack
  • Run compression as part of the CI pipeline so every model update is automatically profiled against baseline metrics

Erreurs fréquentes

  • Applying aggressive INT4 quantisation without validating on domain-specific evaluation sets, leading to silent accuracy regressions in production
  • Treating compression as a one-time step rather than integrating it into the retraining and fine-tuning lifecycle
  • Ignoring hardware-specific kernel support, causing theoretically compressed models to run slower on target hardware
  • Conflating model size reduction with inference speed improvement without profiling actual throughput on the deployment target

Quand NE PAS suivre cette formation

This training is not the right fit for a team that has not yet deployed a baseline model to production — if the team is still experimenting with model selection or fine-tuning, compression optimisation is premature and the investment will not deliver measurable value.

Fournisseurs à considérer

Sources

Cette formation fait partie d'un catalogue Data & IA construit pour les leaders sérieux sur l'exécution. Lancez le diagnostic gratuit pour voir quelles formations sont prioritaires pour votre équipe.