How mature is your Data & AI organization?Take the diagnostic
All trainings

AI TRAINING

Model Compression and Quantisation for Production

Ship smaller, faster models without sacrificing accuracy by mastering quantisation, pruning, and distillation.

Format
bootcamp
Duration
16–24h
Level
advanced
Group size
6–16
Price / participant
€2K–€4K
Group price
€18K–€45K
Audience
ML engineers and AI infrastructure engineers responsible for deploying models to production environments
Prerequisites
Solid Python skills, hands-on experience training or fine-tuning neural networks, and familiarity with at least one deep learning framework (PyTorch preferred)

What it covers

This practitioner-level programme teaches ML engineers the full spectrum of model compression techniques — quantisation (GPTQ, AWQ, GGUF), structured and unstructured pruning, and knowledge distillation — with a focus on measurable trade-off analysis. Participants work through hands-on labs compressing real LLMs and CNNs, benchmarking latency, memory footprint, and accuracy degradation across hardware targets (GPU, CPU, edge). By the end, engineers can select and apply the right compression strategy for a given deployment constraint and integrate it into an existing MLOps pipeline.

What you'll be able to do

  • Apply GPTQ, AWQ, and GGUF quantisation to a production LLM and measure the resulting accuracy-latency trade-off
  • Design and execute a pruning strategy that reduces model size by at least 50% while maintaining acceptable benchmark scores
  • Build a knowledge distillation pipeline transferring a large teacher model's capabilities to a smaller student model
  • Select the appropriate compression technique given a specific hardware target, latency budget, and accuracy threshold
  • Integrate a model compression step into an existing MLOps CI/CD pipeline with automated regression benchmarking

Topics covered

  • Post-training quantisation: GPTQ, AWQ, GGUF formats and toolchains
  • Quantisation-aware training (QAT) and mixed-precision strategies
  • Structured vs unstructured pruning: magnitude, movement, and lottery ticket approaches
  • Knowledge distillation: teacher-student frameworks, task-specific and task-agnostic distillation
  • Accuracy vs latency vs memory trade-off curves and how to navigate them
  • Hardware-aware compression: GPU (CUDA), CPU (ONNX Runtime, OpenVINO), and edge (TFLite, CoreML)
  • Benchmarking compressed models: perplexity, MMLU, throughput, and memory profiling
  • Integrating compression into CI/CD and MLOps pipelines

Delivery

Delivered over 3 days (in-person or virtual instructor-led), split roughly 40% theory and 60% hands-on labs. Participants need access to a GPU instance (A10 or better recommended; cloud sandbox provided if needed). Lab materials include pre-trained LLaMA-family and ResNet models for compression exercises. A shared benchmarking dashboard is used throughout to compare participant results. Remote delivery uses breakout rooms for pair-lab sessions. On-site delivery works well in a data-centre-adjacent tech hub or corporate AI lab setting.

What makes it work

  • Establish a benchmark suite on domain-specific data before compression begins, so regressions are caught early
  • Pair compression experiments with a structured trade-off log that records size, latency, memory, and accuracy for every configuration
  • Involve DevOps and infrastructure engineers early to ensure compressed model formats are compatible with the serving stack
  • Run compression as part of the CI pipeline so every model update is automatically profiled against baseline metrics

Common mistakes

  • Applying aggressive INT4 quantisation without validating on domain-specific evaluation sets, leading to silent accuracy regressions in production
  • Treating compression as a one-time step rather than integrating it into the retraining and fine-tuning lifecycle
  • Ignoring hardware-specific kernel support, causing theoretically compressed models to run slower on target hardware
  • Conflating model size reduction with inference speed improvement without profiling actual throughput on the deployment target

When NOT to take this

This training is not the right fit for a team that has not yet deployed a baseline model to production — if the team is still experimenting with model selection or fine-tuning, compression optimisation is premature and the investment will not deliver measurable value.

Providers to consider

Sources

This training is part of a Data & AI catalog built for leaders serious about execution. Take the free diagnostic to see which trainings your team needs.