Improving Fault Tolerance with RPC Fallbacks in DoorDash’s Microservices

Failures in a large, complex microservice architecture are inevitable, so built-in fault tolerance — retries, replication, and fallbacks — are a critical part of preventing system-wide outages and a negative user experience.

Using Fault Injection Testing to Improve DoorDash Reliability 

Three key steps are of paramount importance to prevent outages in microservice applications, especially those that depend on cloud services: Identify the potential causes for system failure, prepare for them, and test countermeasures before failure occurs.