AI TRAINING
Model Compression and Quantisation for Production
Ship smaller, faster models without sacrificing accuracy by mastering quantisation, pruning, and distillation.
What it covers
This practitioner-level programme teaches ML engineers the full spectrum of model compression techniques — quantisation (GPTQ, AWQ, GGUF), structured and unstructured pruning, and knowledge distillation — with a focus on measurable trade-off analysis. Participants work through hands-on labs compressing real LLMs and CNNs, benchmarking latency, memory footprint, and accuracy degradation across hardware targets (GPU, CPU, edge). By the end, engineers can select and apply the right compression strategy for a given deployment constraint and integrate it into an existing MLOps pipeline.
What you'll be able to do
- Apply GPTQ, AWQ, and GGUF quantisation to a production LLM and measure the resulting accuracy-latency trade-off
- Design and execute a pruning strategy that reduces model size by at least 50% while maintaining acceptable benchmark scores
- Build a knowledge distillation pipeline transferring a large teacher model's capabilities to a smaller student model
- Select the appropriate compression technique given a specific hardware target, latency budget, and accuracy threshold
- Integrate a model compression step into an existing MLOps CI/CD pipeline with automated regression benchmarking
Topics covered
- Post-training quantisation: GPTQ, AWQ, GGUF formats and toolchains
- Quantisation-aware training (QAT) and mixed-precision strategies
- Structured vs unstructured pruning: magnitude, movement, and lottery ticket approaches
- Knowledge distillation: teacher-student frameworks, task-specific and task-agnostic distillation
- Accuracy vs latency vs memory trade-off curves and how to navigate them
- Hardware-aware compression: GPU (CUDA), CPU (ONNX Runtime, OpenVINO), and edge (TFLite, CoreML)
- Benchmarking compressed models: perplexity, MMLU, throughput, and memory profiling
- Integrating compression into CI/CD and MLOps pipelines
Delivery
Delivered over 3 days (in-person or virtual instructor-led), split roughly 40% theory and 60% hands-on labs. Participants need access to a GPU instance (A10 or better recommended; cloud sandbox provided if needed). Lab materials include pre-trained LLaMA-family and ResNet models for compression exercises. A shared benchmarking dashboard is used throughout to compare participant results. Remote delivery uses breakout rooms for pair-lab sessions. On-site delivery works well in a data-centre-adjacent tech hub or corporate AI lab setting.
What makes it work
- Establish a benchmark suite on domain-specific data before compression begins, so regressions are caught early
- Pair compression experiments with a structured trade-off log that records size, latency, memory, and accuracy for every configuration
- Involve DevOps and infrastructure engineers early to ensure compressed model formats are compatible with the serving stack
- Run compression as part of the CI pipeline so every model update is automatically profiled against baseline metrics
Common mistakes
- Applying aggressive INT4 quantisation without validating on domain-specific evaluation sets, leading to silent accuracy regressions in production
- Treating compression as a one-time step rather than integrating it into the retraining and fine-tuning lifecycle
- Ignoring hardware-specific kernel support, causing theoretically compressed models to run slower on target hardware
- Conflating model size reduction with inference speed improvement without profiling actual throughput on the deployment target
When NOT to take this
This training is not the right fit for a team that has not yet deployed a baseline model to production — if the team is still experimenting with model selection or fine-tuning, compression optimisation is premature and the investment will not deliver measurable value.
Providers to consider
Sources
This training is part of a Data & AI catalog built for leaders serious about execution. Take the free diagnostic to see which trainings your team needs.