FORMATION IA

Compression de modèles et quantisation pour la production

Déployez des modèles plus légers et plus rapides sans sacrifier la précision grâce à la quantisation, l'élagage et la distillation.

Format: bootcamp
Durée: 16–24h
Niveau: advanced
Taille de groupe: 6–16
Prix / participant: €2K–€4K
Prix groupe: €18K–€45K
Public: ML engineers and AI infrastructure engineers responsible for deploying models to production environments
Prérequis: Solid Python skills, hands-on experience training or fine-tuning neural networks, and familiarity with at least one deep learning framework (PyTorch preferred)

Ce qu'elle couvre

Ce programme de niveau praticien couvre l'ensemble des techniques de compression de modèles — quantisation (GPTQ, AWQ, GGUF), élagage structuré et non structuré, et distillation de connaissances — avec une analyse rigoureuse des compromis. Les participants travaillent sur des laboratoires pratiques de compression de LLM et de CNN réels, en mesurant la latence, l'empreinte mémoire et la dégradation de précision sur différentes cibles matérielles (GPU, CPU, edge). À l'issue du programme, les ingénieurs sont capables de choisir et d'appliquer la bonne stratégie de compression pour des contraintes de déploiement données et de l'intégrer dans un pipeline MLOps existant.

À l'issue, vous saurez

Apply GPTQ, AWQ, and GGUF quantisation to a production LLM and measure the resulting accuracy-latency trade-off
Design and execute a pruning strategy that reduces model size by at least 50% while maintaining acceptable benchmark scores
Build a knowledge distillation pipeline transferring a large teacher model's capabilities to a smaller student model
Select the appropriate compression technique given a specific hardware target, latency budget, and accuracy threshold
Integrate a model compression step into an existing MLOps CI/CD pipeline with automated regression benchmarking

Sujets abordés

Post-training quantisation: GPTQ, AWQ, GGUF formats and toolchains
Quantisation-aware training (QAT) and mixed-precision strategies
Structured vs unstructured pruning: magnitude, movement, and lottery ticket approaches
Knowledge distillation: teacher-student frameworks, task-specific and task-agnostic distillation
Accuracy vs latency vs memory trade-off curves and how to navigate them
Hardware-aware compression: GPU (CUDA), CPU (ONNX Runtime, OpenVINO), and edge (TFLite, CoreML)
Benchmarking compressed models: perplexity, MMLU, throughput, and memory profiling
Integrating compression into CI/CD and MLOps pipelines

Modalité

Delivered over 3 days (in-person or virtual instructor-led), split roughly 40% theory and 60% hands-on labs. Participants need access to a GPU instance (A10 or better recommended; cloud sandbox provided if needed). Lab materials include pre-trained LLaMA-family and ResNet models for compression exercises. A shared benchmarking dashboard is used throughout to compare participant results. Remote delivery uses breakout rooms for pair-lab sessions. On-site delivery works well in a data-centre-adjacent tech hub or corporate AI lab setting.

Ce qui fait que ça marche

Establish a benchmark suite on domain-specific data before compression begins, so regressions are caught early
Pair compression experiments with a structured trade-off log that records size, latency, memory, and accuracy for every configuration
Involve DevOps and infrastructure engineers early to ensure compressed model formats are compatible with the serving stack
Run compression as part of the CI pipeline so every model update is automatically profiled against baseline metrics

Erreurs fréquentes

Applying aggressive INT4 quantisation without validating on domain-specific evaluation sets, leading to silent accuracy regressions in production
Treating compression as a one-time step rather than integrating it into the retraining and fine-tuning lifecycle
Ignoring hardware-specific kernel support, causing theoretically compressed models to run slower on target hardware
Conflating model size reduction with inference speed improvement without profiling actual throughput on the deployment target

Quand NE PAS suivre cette formation

This training is not the right fit for a team that has not yet deployed a baseline model to production — if the team is still experimenting with model selection or fine-tuning, compression optimisation is premature and the investment will not deliver measurable value.

Fournisseurs à considérer

Sources

Cette formation fait partie d'un catalogue Data & IA construit pour les leaders sérieux sur l'exécution. Lancez le diagnostic gratuit pour voir quelles formations sont prioritaires pour votre équipe.

Lancer le diagnostic Réserver un appel