How mature is your Data & AI organization?Take the diagnostic
All trainings

AI TRAINING

Embeddings & Semantic Search Foundations

Build a working semantic search system using embeddings, similarity indices, and reranking techniques.

Format
bootcamp
Duration
14–24h
Level
practitioner
Group size
6–16
Price / participant
€1K–€3K
Group price
€12K–€28K
Audience
Software engineers, ML engineers, and data scientists building search or retrieval features
Prerequisites
Proficiency in Python; familiarity with basic ML concepts and REST APIs; no prior vector DB experience required

What it covers

This hands-on training covers the full pipeline for semantic search: from selecting and fine-tuning embedding models to chunking strategies, vector indexing, and reranking. Participants will build a functional semantic search prototype by the end of the session. The format combines short concept modules with guided coding labs, targeting engineers and data practitioners who want to move beyond keyword search. Learners leave with reusable code patterns and a clear mental model for integrating semantic search into production systems.

What you'll be able to do

  • Select and justify the right embedding model for a given domain and latency budget
  • Design and implement a chunking pipeline that preserves semantic coherence across document types
  • Build and query a vector index (FAISS or Qdrant) from scratch in Python
  • Add a cross-encoder reranker to a bi-encoder retrieval pipeline and measure the quality uplift
  • Evaluate retrieval quality using MRR and Recall@K on a labelled test set

Topics covered

  • Embedding model taxonomy: dense vs sparse, open-source vs API-based
  • Text chunking strategies: fixed-size, sentence, semantic, and recursive splitting
  • Vector similarity metrics: cosine, dot product, Euclidean — trade-offs
  • Vector databases and ANN indexes: FAISS, Qdrant, Weaviate, pgvector
  • Approximate nearest-neighbour search algorithms (HNSW, IVF)
  • Reranking with cross-encoders and bi-encoder pipelines
  • Evaluation metrics for retrieval quality: MRR, NDCG, Recall@K
  • Production considerations: latency, scaling, hybrid search (BM25 + dense)

Delivery

Typically delivered over 2–3 days in-person or live-virtual (Zoom/Teams). Each half-day block pairs a 30-minute concept session with a 90-minute guided coding lab using Jupyter notebooks. Participants receive a GitHub repo with starter code, pre-indexed datasets, and solution branches. Remote delivery works well; in-person is preferred for the debugging-heavy indexing labs. A cloud sandbox (Google Colab Pro or provisioned GPU instance) is provided so participants can run experiments without local setup friction.

What makes it work

  • Start with a real internal document corpus during labs — participants retain far more when data is familiar
  • Benchmark hybrid search (BM25 + dense) against pure dense from day one to build intuition
  • Pair engineers with a data owner who can label a small golden test set for immediate evaluation
  • Follow up with a short architecture review session 2–4 weeks post-training to unblock production decisions

Common mistakes

  • Using a single generic embedding model across all domains without evaluating domain-specific alternatives
  • Ignoring chunk size and overlap tuning, leading to poor retrieval precision on long documents
  • Skipping reranking entirely and assuming ANN retrieval quality is sufficient for production
  • Neglecting evaluation: shipping semantic search without a labelled test set or baseline comparison

When NOT to take this

A team that has not yet shipped any ML model to production and is still debating whether to use AI at all — they need an AI literacy or use-case scoping workshop first, not a hands-on embeddings bootcamp.

Providers to consider

Sources

This training is part of a Data & AI catalog built for leaders serious about execution. Take the free diagnostic to see which trainings your team needs.