Fault-Tolerant Edge Robotics: Stateful Failure-Mitigation Framework
ROS 2 Kubernetes Behavior Trees OpenSCENARIO 2 Gazebo MoveIt 2 Nav2 Prometheus Docker
Project Overview
Work carried out during my master’s thesis (Aug 2024 - Jan 2025) at TU Braunschweig and Intel Labs.
- Reactive framework that detects application and communication failures and chooses one of four fallback strategies—Restart, Warm Stand-by, Running Stand-by or Hot-Standby—depending on real-time needs.
- Designed and implemented a Behavior-Tree-based monitoring system from OpenSCENARIO 2 task definitions, enabling real-time evaluation of ROS 2 topics, pod health, and latency thresholds, with intelligent recovery triggered only when active tasks were impacted.
- Validated on a mobile manipulator (arm + base) in Gazebo and on real hardware; workloads include Nav2 navigation and MoveIt 2 manipulation.
Problem & Motivation
Modern robots increasingly offload heavy computation, such as perception, SLAM, and planning, to edge servers, allowing them to run advanced algorithms without the weight and power draw of onboard GPUs. However, this shift introduces new risks: a single container crash or 5G network dropout can leave a robot stranded mid-task. Cloud-native fail-over tools such as Kubernetes restarts are fine for stateless web apps, they are too slow and lose context for ROS 2 nodes that hold a live map or trajectory. I set out to give edge-deployed robots a state-preserving, real-time recovery path.
Challenges & Solutions
- State lost on pod restart → implemented a Task Proxy that re-publishes the last goal pose after recovery.
- False alarms from generic health probes → added task-aware checks in the BT monitor before declaring failure.
- Flexible standby strategies → designed a YAML policy layer so operators can switch between Restart, Warm Stand-by, Running Stand-by and Hot-Standby at runtime with no code changes.
- Measuring real-time impact → combined ROS 2 bag recordings with cAdvisor metrics to trace detection, spin-up and hand-over events for each strategy.
Key Takeaways
- Warm Stand-by is the sweet spot: a pre-initialised pod gives fast recovery without a noticeable CPU hit.
- Hot-Standby is seamless but costly: parallel replicas guarantee instant takeover, yet double the compute budget, best reserved for safety-critical robots.
- Task-aware monitoring cuts noise: checking whether a task is active before triggering recovery eliminates unnecessary restarts and log spam.
- Behavior Trees centralise recovery logic: the mission, monitoring rules and fail-over actions live in one declarative file, making the system easy to audit and extend.