How mature is your Data & AI organization?Take the diagnostic
All trainings

AI TRAINING

Reinforcement Learning for Operational Optimisation

Build and deploy RL agents that outperform heuristics on real-world pricing, routing, and scheduling problems.

Format
programme
Duration
30–45h
Level
advanced
Group size
6–16
Price / participant
€3K–€6K
Group price
€25K–€55K
Audience
ML engineers and data scientists with supervised learning experience who need to tackle sequential decision-making problems in operations
Prerequisites
Solid Python skills, familiarity with NumPy/PyTorch, and hands-on experience training supervised or unsupervised ML models

What it covers

This practitioner-level programme covers the full RL stack: Markov Decision Processes, policy gradient methods (PPO, A3C), value-based approaches (DQN, Rainbow), and multi-agent settings. Participants work in simulation environments (Gymnasium, RLlib) to tackle canonical ops problems — dynamic pricing, vehicle routing, and job-shop scheduling — then learn how to move agents from simulation to production. The programme balances theory lectures (40%) with hands-on coding labs (60%), culminating in a capstone where teams deploy an RL policy against a business KPI benchmark.

What you'll be able to do

  • Formulate a real ops problem (pricing, routing, scheduling) as an MDP with correctly specified state space, action space, and reward function
  • Implement and tune a PPO agent in RLlib against a custom Gymnasium environment
  • Diagnose and fix common RL failure modes: reward hacking, instability, and slow convergence
  • Compare RL against supervised ML and OR baselines to make a justified build-vs-buy decision
  • Deploy a trained RL policy to a staging environment and monitor it against a business KPI

Topics covered

  • Markov Decision Processes: states, actions, rewards, discount factors
  • Value-based methods: DQN, Double DQN, Rainbow
  • Policy gradient methods: REINFORCE, PPO, A3C
  • Simulation environment design with Gymnasium and RLlib
  • Multi-agent RL for fleet and supply-chain settings
  • Dynamic pricing and demand-responsive RL policies
  • Vehicle routing and job-shop scheduling as RL problems
  • Sim-to-real transfer, reward shaping, and safe exploration

Delivery

Delivered as a 5-week blended programme: two 3-hour live virtual sessions per week led by an RL practitioner, supplemented by async reading and coding assignments. All labs run on cloud GPU instances (provided); participants need a laptop and a GitHub account. A private Slack workspace supports peer Q&A between sessions. In-person cohort delivery at client premises is available for groups of 10+, adding a full-day capstone hackathon.

What makes it work

  • Start with a small, well-scoped sub-problem where a simulator already exists or can be built cheaply before scaling
  • Involve domain experts (ops managers, logistics engineers) in reward function design and environment validation from day one
  • Establish clear baseline KPIs from OR or rule-based methods before training any agent, so improvement is measurable
  • Run parallel shadow deployments before switching RL policies into production to build stakeholder trust

Common mistakes

  • Designing a reward function that is easy to optimise but misaligned with the true business objective, leading to reward hacking
  • Skipping the simulation fidelity step and attempting sim-to-real transfer with an environment that does not capture key real-world constraints
  • Applying RL to problems where a well-tuned heuristic or mixed-integer programme already delivers near-optimal results at a fraction of the cost
  • Underestimating infrastructure complexity: RL agents in production require continuous monitoring and periodic retraining as environment dynamics shift

When NOT to take this

A team whose optimisation problem has a stable, fully observable state space and a well-defined objective function that integer programming solvers already handle within acceptable time — adding RL introduces unnecessary complexity, training cost, and interpretability risk with no measurable gain.

Providers to consider

Sources

This training is part of a Data & AI catalog built for leaders serious about execution. Take the free diagnostic to see which trainings your team needs.