AI TRAINING

Fine-Tuning LLMs: When, How, and Why

Decide confidently whether to fine-tune, prompt, or use RAG, then execute it correctly.

Format: bootcamp
Duration: 16–24h
Level: advanced
Group size: 6–16
Price / participant: €2K–€4K
Group price: €18K–€45K
Audience: ML engineers, AI engineers, and technical leads responsible for LLM integration or productionisation
Prerequisites: Solid Python skills, working knowledge of transformer architecture basics, and prior experience deploying or calling LLM APIs

What it covers

Participants work through a structured decision framework comparing prompting, retrieval-augmented generation, and fine-tuning across cost, latency, and quality dimensions. The programme covers dataset curation, instruction-tuning formats, LoRA/QLoRA techniques, evaluation design, and deployment cost modelling. Hands-on labs use open-source tooling (Hugging Face, Axolotl, LM Evaluation Harness) on realistic domain datasets. By the end, teams can confidently scope, execute, and evaluate a fine-tuning project in their own infrastructure.

What you'll be able to do

Apply a structured decision tree to determine whether prompting, RAG, or fine-tuning is the right approach for a given use case
Curate and format a domain-specific instruction dataset suitable for supervised fine-tuning
Run a QLoRA fine-tuning job on an open-source model using Hugging Face TRL or Axolotl
Design and execute an evaluation suite combining automated metrics and LLM-as-judge scoring
Estimate total cost of ownership (GPU compute, storage, inference) for a fine-tuned model vs hosted API alternatives

Topics covered

Prompting vs RAG vs fine-tuning: a cost-quality-latency decision tree
Dataset curation, cleaning, and instruction-format design (JSONL, ShareGPT, Alpaca)
Full fine-tuning vs parameter-efficient methods: LoRA, QLoRA, prefix-tuning
Supervised fine-tuning (SFT) and RLHF/DPO alignment techniques
Evaluation frameworks: BLEU, ROUGE, LLM-as-judge, domain-specific benchmarks
Tooling selection: Hugging Face TRL, Axolotl, LLaMA-Factory, OpenAI fine-tune API
Infrastructure and cost modelling: GPU hours, cloud vs on-prem, quantisation tradeoffs
Deployment and monitoring of fine-tuned models in production

Delivery

Delivered over 2–3 days, either in-person or fully remote via video conferencing with shared cloud GPU environments (e.g., Lambda Labs, RunPod, or AWS). Approximately 60% hands-on labs, 40% instruction and discussion. Participants receive a pre-configured notebook repository and retain access to lab materials post-training. A short async pre-work module (2–3 hours) on transformer fundamentals is recommended for mixed-level cohorts.

What makes it work

Define a measurable evaluation benchmark before writing a single training example
Start with the smallest model that meets quality requirements to minimise compute cost
Invest heavily in dataset quality and diversity, model behaviour reflects data behaviour
Track experiments rigorously (Weights & Biases, MLflow) to enable reproducibility and regression detection

Common mistakes

Fine-tuning when a well-crafted system prompt or RAG pipeline would solve the problem at a fraction of the cost
Using too little or poorly cleaned training data, producing a model that overfits or degrades on out-of-distribution inputs
Neglecting evaluation design before training, leading to no reliable signal on whether the fine-tune actually improved the model
Ignoring inference cost and latency implications of larger fine-tuned models compared to smaller prompted alternatives

When NOT to take this

A team that has never shipped an LLM-powered feature to production and is jumping straight to fine-tuning to avoid prompt engineering work, they should first validate the use case with prompting before incurring fine-tuning complexity and cost.

Providers to consider

Sources

This training is part of a Data & AI catalog built for leaders serious about execution. Take the free diagnostic to see which trainings your team needs.

Run the diagnostic Book a call