How mature is your Data & AI organization?Take the diagnostic
All trainings

AI TRAINING

Vector Databases 101: Foundations for AI Engineers

Build production-ready vector search systems by mastering indexing, hybrid search, and database selection tradeoffs.

Format
bootcamp
Duration
12–20h
Level
practitioner
Group size
5–16
Price / participant
€1K–€3K
Group price
€8K–€20K
Audience
Software engineers, data engineers, and ML engineers building AI-powered search or retrieval systems
Prerequisites
Proficiency in Python; familiarity with REST APIs and basic machine learning concepts; prior exposure to embeddings or NLP is helpful but not required

What it covers

This hands-on technical training covers the core concepts behind vector databases, including embedding storage, approximate nearest neighbour search, and indexing algorithms such as HNSW and IVF. Participants compare leading solutions — Pinecone, Qdrant, Weaviate, and pgvector — through live coding exercises and benchmark tasks. The course addresses hybrid search patterns, cost and latency tradeoffs, and operational concerns such as scaling, filtering, and metadata management. By the end, engineers can confidently select, deploy, and query a vector database suited to their production workload.

What you'll be able to do

  • Configure and query at least two vector databases (e.g. Qdrant and pgvector) against a real dataset using Python client libraries
  • Implement a hybrid search pipeline combining dense vector similarity with keyword filters and rank the results by relevance
  • Evaluate and justify the choice of vector database for a given workload based on latency, cost, and scalability benchmarks
  • Integrate a vector store into a retrieval-augmented generation (RAG) pipeline with chunking, upsert, and top-k retrieval
  • Apply metadata filtering and namespace strategies to support multi-tenant or multi-collection production deployments

Topics covered

  • Vector embeddings: generation, dimensionality, and semantic meaning
  • ANN indexing algorithms: HNSW, IVF, PQ, and ScaNN
  • Platform comparison: Pinecone, Qdrant, Weaviate, and pgvector
  • Hybrid search: combining dense vectors with BM25/keyword filters
  • Metadata filtering, namespaces, and multi-tenancy patterns
  • Cost modelling: managed vs self-hosted, storage vs query tradeoffs
  • Latency benchmarking and performance tuning
  • Integrating vector DBs into RAG and semantic search pipelines

Delivery

Delivered over two to three days, either in-person or fully remote via a shared cloud environment (participants receive pre-configured Jupyter notebooks and vector DB sandboxes). Approximately 60% of time is hands-on lab work; 40% is instructor-led concept explanation and live demonstration. A capstone exercise requires participants to build and benchmark a semantic search endpoint from scratch. Materials include a reference architecture cheat sheet, provider comparison matrix, and cost estimation spreadsheet.

What makes it work

  • Running benchmarks on actual production-representative data during training, not toy datasets
  • Establishing a cost and latency SLA before selecting a provider, using the comparison matrix from day one
  • Pairing training with an immediate internal proof-of-concept that uses the organisation's own embeddings
  • Including a platform-agnostic abstraction layer in the architecture so the vector DB can be swapped without rewriting application code

Common mistakes

  • Choosing a managed cloud vector DB for every use case without evaluating self-hosted alternatives that may be 5-10x cheaper at scale
  • Ignoring metadata filtering design upfront, leading to full-scan fallbacks that destroy latency at production volumes
  • Treating chunk size as a trivial detail, resulting in poor retrieval recall that is later misattributed to the LLM
  • Skipping benchmark tests on representative data and query distributions before committing to an indexing algorithm

When NOT to take this

This training is not the right fit for a team that has not yet decided to use LLMs or semantic search in production — if the use case is still hypothetical, a broader AI foundations workshop should come first to validate the problem before investing in infrastructure-level skills.

Providers to consider

Sources

This training is part of a Data & AI catalog built for leaders serious about execution. Take the free diagnostic to see which trainings your team needs.