AI TRAINING

Model Compression and Quantisation for Production

Ship smaller, faster models without sacrificing accuracy by mastering quantisation, pruning, and distillation.

Format: bootcamp
Duration: 16–24h
Level: advanced
Group size: 6–16
Price / participant: €2K–€4K
Group price: €18K–€45K
Audience: ML engineers and AI infrastructure engineers responsible for deploying models to production environments
Prerequisites: Solid Python skills, hands-on experience training or fine-tuning neural networks, and familiarity with at least one deep learning framework (PyTorch preferred)

What it covers

This practitioner-level programme teaches ML engineers the full spectrum of model compression techniques, quantisation (GPTQ, AWQ, GGUF), structured and unstructured pruning, and knowledge distillation, with a focus on measurable trade-off analysis. Participants work through hands-on labs compressing real LLMs and CNNs, benchmarking latency, memory footprint, and accuracy degradation across hardware targets (GPU, CPU, edge). By the end, engineers can select and apply the right compression strategy for a given deployment constraint and integrate it into an existing MLOps pipeline.

What you'll be able to do

Apply GPTQ, AWQ, and GGUF quantisation to a production LLM and measure the resulting accuracy-latency trade-off
Design and execute a pruning strategy that reduces model size by at least 50% while maintaining acceptable benchmark scores
Build a knowledge distillation pipeline transferring a large teacher model's capabilities to a smaller student model
Select the appropriate compression technique given a specific hardware target, latency budget, and accuracy threshold
Integrate a model compression step into an existing MLOps CI/CD pipeline with automated regression benchmarking

Topics covered

Post-training quantisation: GPTQ, AWQ, GGUF formats and toolchains
Quantisation-aware training (QAT) and mixed-precision strategies
Structured vs unstructured pruning: magnitude, movement, and lottery ticket approaches
Knowledge distillation: teacher-student frameworks, task-specific and task-agnostic distillation
Accuracy vs latency vs memory trade-off curves and how to navigate them
Hardware-aware compression: GPU (CUDA), CPU (ONNX Runtime, OpenVINO), and edge (TFLite, CoreML)
Benchmarking compressed models: perplexity, MMLU, throughput, and memory profiling
Integrating compression into CI/CD and MLOps pipelines

Delivery

Delivered over 3 days (in-person or virtual instructor-led), split roughly 40% theory and 60% hands-on labs. Participants need access to a GPU instance (A10 or better recommended; cloud sandbox provided if needed). Lab materials include pre-trained LLaMA-family and ResNet models for compression exercises. A shared benchmarking dashboard is used throughout to compare participant results. Remote delivery uses breakout rooms for pair-lab sessions. On-site delivery works well in a data-centre-adjacent tech hub or corporate AI lab setting.

What makes it work

Establish a benchmark suite on domain-specific data before compression begins, so regressions are caught early
Pair compression experiments with a structured trade-off log that records size, latency, memory, and accuracy for every configuration
Involve DevOps and infrastructure engineers early to ensure compressed model formats are compatible with the serving stack
Run compression as part of the CI pipeline so every model update is automatically profiled against baseline metrics

Common mistakes

Applying aggressive INT4 quantisation without validating on domain-specific evaluation sets, leading to silent accuracy regressions in production
Treating compression as a one-time step rather than integrating it into the retraining and fine-tuning lifecycle
Ignoring hardware-specific kernel support, causing theoretically compressed models to run slower on target hardware
Conflating model size reduction with inference speed improvement without profiling actual throughput on the deployment target

When NOT to take this

This training is not the right fit for a team that has not yet deployed a baseline model to production, if the team is still experimenting with model selection or fine-tuning, compression optimisation is premature and the investment will not deliver measurable value.

Providers to consider

Sources

This training is part of a Data & AI catalog built for leaders serious about execution. Take the free diagnostic to see which trainings your team needs.

Run the diagnostic Book a call