Page 35 – DoorDash

“What would happen if we removed statement timeouts in our Postgresql databases?” That’s one of the questions asked in a management meeting. At the time I only responded that it would be bad — it would cause problems and make it harder to debug. However, I realize now that this is a topic that many people don’t dig deeply into, so I’ve decided to share some of best practices in DoorDash.

So, what would happen? In the best case, we may notice nothing for a few weeks. However, in a fast growing company like DoorDash with hundreds of engineers contributing code everyday to a complicated system, we would eventually have an outage which is introduced by unbounded resource usage which can be easily captured if not prevented by a proper timeout setting.

Why You Need Timeout

We live in a realistic world and resources are always limited. For actions like creating a connection, making databases queries or executing functions, timeout itself won’t improve efficiency of any of them. However, timeout creates an upper bound, stops outliers from causing outsized damage.

Normally timeout has a direct impact on site latency, while site health is indicated by several signals — throughput, latency, error rate and saturation. If the exception or return code is captured correctly, you will see the signal in the error rate. If you don’t have enough capacity, you may see signal in saturation, e.g., web requests will queue up and throughput will drop. Similarly, if you don’t have timeout limits, the error rate won’t increase while the resource consumption continues to grow . It will just be a matter of time before the system becomes saturated and throughput drops .

With good timeout settings, damage can be kept under control and more visible. Without timeout setting, your system may fail in unexpected ways and people will struggle to figure out why.

How to Set Timeout

Everything should have a timeout, but what should the value be? It is complicated. For example, Postgresql has various timeout related settings — statement_timeout, lock_timeout, keepalives related configurations, idle_in_transaction_session_timeout, etc. Different domains like queuing systems and web servers have their own sets of timeouts. A good way to think about it is to do it in an architectural way considering the upstreams and downstreams. Tower of Hanoi Timeouts presents a great rule of thumb for structuring nested timeouts. tl;dr:

No child request should be able to exceed the timeout of the parent.

No matter the timeout is about connection creation, lock or query/transaction execution, please make sure no child request should be able to exceed the timeout of the parent. Timeout normally means some process would be forced to exit. Parent level timeout would hide the issues which could be caught at the child level and increase the difficulty to root cause issues.

Specifically for Databases

Timeouts are oftentimes overlooked in the world of databases. A simple explanation is that lots of developers are very optimistic and trying to live in an ideal world of infinite resources. Among all of the resources, CPU and I/O are normally the most impacted by slow queries, which can be gated by statement timeout settings in Postgresql. For databases, whenever CPU or I/O is saturated, it’s a disaster. We tested some of our most resource intensive queries on our staging databases and AWS Performance Insights suggested we need 1000x bigger instance type to support it, which of course is not available in AWS (yet).

In AWS Aurora, timeout can be even more important than in traditional RDS. unexpected I/O may add too much pressure for the cluster to handle, forcing replica nodes to restart in order to recover to a healthy state. Aurora has an interesting architecture. All nodes, master or replicas, are all sharing the same underlying storage layer. You will still see millisecond level replica lag which isn’t really related with data replication. It takes the replicas sometime to update their memory for things like Btree index. When Aurora replica feels it is lagging to much, a restart will be triggered to get things consistent quickly.

Statement timeout tuning in DoorDash

Inspired by a talk with our friend Marco Almeida, who lowered the postgresql statement timeout to 1s at Thumbtack, we started tuning our statement timeout setting one year ago. The original setting was 30s which was even longer than our uwsgi Harakiri setting. Currently, our main databases’ default statement timeout setting is 2s, and it is still a work in progress to lower it even more.

We use several tools to gain visibility and insight into how our statement timeout tuning affects our system PgAnalyze helps us to gather info from pg_stat_activity so that we know how long each query takes. Exceptions caused by transactions hitting statement timeout are visible to us in Sentryand NewRelic. Error rate is reviewed weekly and we have organizational support to fix the errors since it is so important. It improves stability, performance and saves money. In this way we create a feedback loop to make our system more and more stable.

Proper timeout settings contribute to production stability and debuggability. They are important. Spend time to improve the configurations if you care about reliability and do not think about removing them.

Introduction

As we migrate our systems to a microservices-oriented architecture at DoorDash, we have taken care to balance the excitement around anticipated benefits (development velocity, deployment independence, etc.) with a more measured assessment of our preparedness to work with the complexities introduced by more distributed systems.

One of the most common pain points with microservices — or, for that matter, any distributed system, finer or coarser-grained — is ensuring that service deployments maintain backwards compatibility and that the whole ecosystem plays well together in all environments, not just production.

An integration environment where teams can bring microservices together to ensure compatibility is critical to both development velocity and system correctness; but as the number of services grows, the coordination required to maintain this environment can become a heavy burden for development and infrastructure/operations teams.

At DoorDash, we looked towards applying consumer-driven contract testingas a technique to enforce compatibility testing. Conveniently, a mature and well-supported contract testing framework called Pact exists for enabling contract testing between REST services and consumers of services.

For our initial use case, we chose to adopt Pact to enforce the contracts between our mobile apps and the mobile backend-for-frontend (BFF).

Development Workflow with Pact

Pact works by connecting a set of unit tests. One set of tests for the service consumer and one set of tests for the service provider. The consumer tests contain specifications about an HTTP request and the shape of the response. When the tests are run, the Pact framework turns the specifications into a contract that is uploaded to a repository (known as a broker). The broker is a web service that contains a collection of contracts and associations between consumers and producers. With that in mind, the overall flow looks like this:

On the consumer:

Write unit tests specifying the request and the response shape
Run the tests, which, if successful, generate a Pact file
Upload the Pact file to the broker

On the producer:

Download the Pact files that involve the producer
Run the tests, which, if successful, generate verification results
Upload the verification results to the broker

In our case, the consumers are the iOS and Android mobile apps. Next, we’ll see how to specify the consumer API contract.

iOS App: Pact on Swift

Integrating Pact testing with an iOS application is pretty straightforward, consisting of a few simple steps:

Install the Pact mock service
Install the PactConsumerSwift library
Configure Xcode
Write your test
Upload the verification results to your Pact Broker

Install the Pact Mock Service

The Pact mock service spins up a local server that will generate Pacts when your tests are run. Your tests will provide a series of expectations (fields, http status, etc.) for an endpoint (for example, /v1/auth) that are compiled into a Pact file.

To install the Pact mock service, simply run the following:

https://gist.github.com/f9d02ca3df44c776ec17aa1c3bf5b8db

Install the PactConsumerSwift Library

The PactConsumerSwift library is used by your application to conveniently interact with the mock service to generate pact files. It provides objects and methods that can be used to easily define expectations and pass these expectations to the mock server.

To incorporate the service into your project, simply add the following line to your Podfile:

https://gist.github.com/240fcf009ceeda83d9a50e11ef85e119

Configuring Xcode

In order to have the local pact server spin up when your tests run, we’ll add pre-actions and post-actions to your project target.

Select your target and click Edit Scheme, select the Test phase, and add a new Run Script Action for both pre-actions and post-actions. It should look something like this:

For pre-actions, enter the following and save:

https://gist.github.com/c9e052a711167511b681393fc6c26373

This will spin up the mock service on port 1234.

For post-actions, enter the following and save:

https://gist.github.com/c54f320635e767c9f08324620daf6f06

This simply stops the mock service.

Write Your Tests

A sample test is as follows:

https://gist.github.com/68fe810fbab57b91fdcda3b0613f15d1

This sets up the services that will be making the network calls. The MockService, provided by PactConsumerService, is initialized with provider and consumer keys — make sure these match the ones you previously used to set up your Pact broker.

The BootstrapClient is the network layer you are using in your application. Configure it with the base url provided by the MockService (which will point network calls to your local server).

We use the MockService to start building our expectations. These are customizable and you can add or remove expectations that fit your particular use case. We can specify a response body that we expect, headers, HTTP status, and other related fields.

Make sure the request parameters are configured properly for the request that you want to test as these will be used by the broker to reconcile your Pact with the backend results.

This part of the test runs your network call against the local server (remember that earlier we initialized an instance of our network service with the base url provided by the MockService). We run it within a closure passed into the runfunction on MockService so that the MockService can perform any configuration that it needs before running the actual call.

And that’s it!

Upload the Verification Results to Your Pact Broker

If you run your tests and they complete successfully, you’ll see that a new Pact file has been generated for you in your /tmp/pacts folder at the root of your project. Now, all we need to do is to upload this to your Pact broker. We’ll do it manually with curl but this can be added to the post-actions of your test phase as well.

https://gist.github.com/8675cde291703d38c87a0aada1dabe9b

And that’s it! If the upload is successful, you’ll see the new Pact file in your Pact broker. Next, we’ll look at a similar example from the perspective of the Android client.

Android App: Pact on Kotlin

Add the Pact gradle plugin:

https://gist.github.com/7caf07ab8de1485d01ddaa49924f3b37

Add a task to let Pact know how to publish to the broker:

https://gist.github.com/92daa2c521b1a2cdf9e0418a2f201f6c

Add these 2 test dependencies:

https://gist.github.com/4041b6ccc8ca9004baff56d56afd9308

Define a Pact test which describe the interaction between the consumer and the provider. First, we define a Pact rule:

https://gist.github.com/ecc4fe5fefeda176d4d67104109f1ace

Create a test function to define the interaction:

https://gist.github.com/bd4234651d6d201ba0a92dc1153fd9f2

Finally, we can define a pact verification test for this pact:

https://gist.github.com/4e698cec81d9232355fe9f25afd1b9fb

Now that the consumer contracts have been defined, we can start to write tests to ensure the contracts are fulfilled by the mobile BFF.

On the Mobile BFF: Pact in Kotlin

Once you’ve defined the contract via the consumer tests, you can move on to verifying the contract. In order to verify the contract, we need a way to start the application and send it the HTTP requests that are part of the contract. In our architecture, the provider is the mobile BFF.

There are two main approaches to verification:

Run the mobile BFF via the command line
Run the mobile BFF via an integration test

Pact provides tooling to accomplish both approaches. Since the mobile BFF makes requests to downstream services, we chose to verify using an integration test, since it would be easier to mock the outgoing HTTP requests. We also considered verifying the contract via the command line, but that would have necessitated building in a “test” mode to the mobile BFF and we thought that would be too much effort compared to simply using an integration test.

To set up Pact for the provider, start by configuring build.gradle:

https://gist.github.com/c05e0014f9c6e3e6e7a2a29ff9e67fe2

This block of configuration sets up the system properties that Pact needs to find and download the necessary contracts from the broker. Since the contracts come from the broker, Pact will automatically try and upload the verification results when the tests are run. This should only happen on CI, otherwise, running the tests on your local workstation will overwrite the values stored in the broker.

Next, add the required dependencies:

https://gist.github.com/c98802ff8837de1fc82760b4192a3d2c

Then, we can start writing a test to verify the contract. Start by creating a JUnit test (we’re using JUnit 5 in this example) and add the necessary annotations to indicate that it is a Pact test.

https://gist.github.com/656f9b7b594f3fa59f33b841b2b49f1a

The @PactBroker annotation indicates that the test should fetch the contract from the broker using the system properties defined in build.gradle. @Provider tells Pact which provider this test is verifying. The last two annotations indicate that this is a Spring Boot integration test and instruct Spring to start the application on a random port. The selected port will be injected into the field annotated with @LocalServerPort.

The actual test is pretty simple.

We first configure the target in the PactVerificationContext. This is the URL where our application is running. Since the port field is injected at runtime, we use that port number to create the URL.

https://gist.github.com/8892f026571457ec50debbc56c92df59

Next, we perform one last piece of configuration. Since we expect the consumers to make authenticated requests, we add the Authentication header into the requests that Pact makes to the mobile BFF. Finally, we call context.verifyInteraction to execute the request.

https://gist.github.com/432035a17fabb0df829011f4d8b5ab05

It may seem strange that a test doesn’t actually have any test methods or assertions. This is because the tests themselves live in the Pact files. The contract created by the consumer is the actual test. It specifies what a request looks like and what the response is expected to contain. The Pact framework takes this specification and automatically turns it into a more traditional unit test. The benefit of this approach is that new contracts can be added by the consumer and they will automatically get executed by the provider when its test suite is run.

Conclusion

Contract testing is a strong supplement to traditional functional testing (see contract vs. functional testing) and stays true even if API documentation becomes outdated.

Pact is relatively easy to set up and well-supported within the community. Stay-tuned for follow-up posts on the expanding use of contract-based testing in our systems and how we expand the concept to gRPC-based services!

Overview

Introduction
What is the assignment problem at DoorDash?
What is reinforcement learning?
Reinforcement learned assignment
Moving forward
Conclusion

Introduction

DoorDash recently held our thirteenth hackathon. Hackathons are our opportunity to explore new technologies and moon-shot ideas; they help us stay up-to-date with the world and think 10x. At DoorDash, we’re constantly thinking of ways to improve the customer experience, from reducing delivery times to increasing Dasher efficiency. Algorithms and artificial intelligence help lay the foundation for delightful customer experiences, and for this hackathon, we wanted to see if more advanced techniques could further improve our product. Our team of six applied an artificial intelligence technique called reinforcement learning to the assignment problem at DoorDash, beating our production algorithm in a simulation environment with both quicker delivery speeds and higher Dasher efficiencies. In this post, we will describe what that means and how we accomplished it.

What is the assignment problem at DoorDash?

DoorDash provides a real-time platform for fulfilling consumer demand for merchant goods via a flexible Dasher fleet. Within DoorDash, the logistics team focuses on the magic that happens behind the scenes from when a consumer submits an order to when they receive that order at their door, including everything from supply/demand forecasting to automated delivery monitoring. The assignment problem is one specific area within logistics that deals with the following question: which Dasher should fulfill which delivery? We aim to make Dasher-delivery assignments that yield quick delivery speeds and healthy Dasher efficiency, where efficiency is number of deliveries performed by Dasher per unit time. In making these decisions, we need to consider many factors, including but not limited to:

Consumer quoted times
Estimated order ready times
Travel estimations
Routing (for multi-delivery assignments)
Dasher utilization (percentage of time a Dasher is actively working on a delivery over an entire shift duration)

Figure 1: Assignment problem at DoorDash

The assignment algorithm at DoorDash has been crafted over the years to consider all these factors and more, in order to best serve consumers, merchants, and dashers. However, given the breadth of the assignment problem, and the fact that we don’t have ground truths for what the best assignments are, improvements to the algorithm do not always come easily. When Hackathon XIII rolled around, we wondered, “Could an artificially intelligent assignment algorithm learn to improve itself?”.

What is reinforcement learning?

Reinforcement learning is one of the most popular and powerful artificial intelligence algorithms today. Instances of reinforcement learning have reached mainstream news, such as AlphaGo, the reinforcement learned computer program that defeated the world’s top Go player. In short, the goal of reinforcement learning is to learn the best action given a state of the environment, in order to maximize the overall rewards. Here are the fundamental concepts in reinforcement learning, summarized in Figure 2.

State: The current status of the environment. It represents all the information needed to choose an action.

Action: The set of all possible moves at a state.

Reward: The feedback as a result of the action. Note that rewards can be immediate or delayed.

Policy: The strategy used to choose an action at each state.

Agent: The entity that takes actions and tries to learn the best policy.

Environment: The world that the agent interacts with.

Figure 2: Conceptual overview of reinforcement learning

Here’s an example application to illustrate these concepts. Let’s say we are designing a delivery robot and teaching it to pick up delivery items while avoiding pedestrians. In this case the robot is an agent trying to learn the optimal policy, i.e. optimal action in each state. The states can be the area that robot is operating in, with the items and pedestrians at various locations, and the actions can be move left, right, forward, or backward. The robot receives a positive reward when it picks up an item, a negative reward when it runs into a pedestrian, and no reward otherwise. With these settings the robot can learn the optimal policy to pick up all the items in a way that maximizes the reward.

The goal of reinforcement learning is to find the optimal policy. This is not trivial since unlike Markov Decision Processes, the rewards and transition probabilities between states are unknown, as seen in Figure 3. For this reason, there are many techniques to either obtain this information (model-based) or obtain the optimal policy directly (model-free), such as Monte Carlo, Bootstrap, and SARSA. The most commonly used is Q-learning, which is a model-free, off-policy method that can directly give us an estimate of the Q function to find the optimal policy.

Figure 3: Markov Decision Process with Unknown Transition Matrix Pij

It is worth mentioning that to use Q-learning, we need to collect data by following an exploration policy. When we are more concerned with learning about the world than maximizing utility, we can choose a no-exploitation, all-exploration policy, which explores the action space by always choosing a random action. When we care more about utility, we need to balance exploration and exploitation, so that we can reap the benefits of what we’ve learned while still not being blind to opportunities that we haven’t yet encountered. One common approach in the second case is the epsilon-greedy algorithm, which explores with probability epsilon and exploits with probability 1-epsilon.

In practice, when there are a large number of states, the state space is featurized and function approximation is used to determine the Q function. This makes the Q function a model instead of a lookup table. Oftentimes, handcrafting features is difficult, so deep learning models like fully connected or convolutional neural networks are used to represent the Q function. This approach is known as Deep Q-Network (DQN) and is very useful when feature dimensionality is high and data volume is also high.

Reinforcement learned assignment

Now we will discuss how we applied reinforcement learning to the DoorDash assignment problem. To formulate the assignment problem in a way that’s suitable for reinforcement learning, we made the following definitions.

State: The outstanding deliveries and working Dashers, since they represent the current status of the world from an assignment perspective. Note that this means that the state space is practically infinite, since deliveries and Dashers individually can have many different characteristics (pick up location/time, drop off location/time, etc.) and there can be many different combinations of deliveries and Dashers.

Action: The most natural choice is the assignment of Dashers to deliveries. However, even with just 15 deliveries and 15 Dashers, the total number of combinations exceeds one trillion! Therefore, to significantly reduce the action space, we defined the actions to be different variants of the assignment algorithm with different parameters. This can be thought of as intelligent parameter tuning where the model learns which parameters are best for a given state of deliveries and Dashers.

Reward: Combination of delivery speeds and Dasher efficiency. We want deliveries to reach customers as quickly as possible while utilizing Dashers as efficiently as possible. This translates to maximizing delivery speeds and maximizing Dasher efficiency. The reward also includes a penalty for undelivered deliveries, to ensure that all the deliveries are actually assigned to Dashers and delivered to consumers.

With these concepts defined to fit reinforcement learning, we now needed to implement the two key components to actually perform reinforcement learning, the environment and the agent.

Environment: We need a system that can output states (deliveries and Dashers) and rewards (delivery speeds and Dasher efficiencies), as well as take in actions (variants of assignment algorithm) and subsequently update the states and rewards. Our production assignment system fits the bill, but we run the risk of hurting our actual deliveries and customers, since the agent might choose bad actions as it learns via an exploration policy. Fortunately, we already have a simulator for our assignment system that can take in an assignment algorithm and produce simulated assignments that mirror our production system. This assignment simulator is used for obtaining offline results before trying online experiments for the assignment system. Therefore, we used this simulator as our environment, allowing us to train our reinforcement learning model on states/actions/rewards that are accurate to our production system without impacting our customers.

Agent: We chose a deep neural network as the agent, since it makes sense to use a model for the Q function and featurize the states into high dimensional vectors, given our infinite state space. More details about this featurization and model will be covered in the next section.

A high level summary of our problem formulation can be found in Figure 4. In summary, the assignment simulator outputs the state, which the agent uses to choose a variant of the assignment algorithm. The assignment simulator runs the chosen algorithm and outputs the new state and the reward, which are both passed back to the agent. This cycle repeats at a preset time interval.

Figure 4: Reinforcement learning applied to the assignment problem

For the actual implementation, we wrapped the DoorDash assignment simulator into an OpenAI Gym environment and used Keras RL for the agent and training, as the two libraries are compatible out-of-the-box.

Deep agent

As previously mentioned, we used a deep neural network as our agent. Recall that the agent maps state and action pairs to rewards. Concretely, the network takes as input the state (represented as different features) and predicts the Q-value for each action, as seen in Figure 5.

Figure 5: Neural network as Q(s, a) approximators

For our problem, the features generated from the states are intended to capture the information about deliveries and Dashers that are useful for predicting the Q-value, which are future delivery speeds and Dasher efficiencies. Some examples of these features are the pick-up to drop-off distances, the ratio of deliveries to Dashers, and the estimated order ready times.

The model itself is a multi layer dense/fully-connected network. A few different model parameters were tried, but more thorough hyperparameter tuning and architecture exploration will be done in future work.

This model is trained using the deep Q-learning algorithm as presented by Mnih et al. in their Deep Reinforcement Learning paper. Details can be found in the paper, but the general idea is that the environment and agent are used to generate states, actions, and rewards that are stored as experiences. These experiences are then drawn from to create training samples, which the model uses to optimize towards the maximum Q-value.

Results

Evaluation was done by making assignments for one day of data in one region. We first obtained baseline results by running the assignment simulator with the default production assignment algorithm and obtaining the average delivery speeds and Dasher efficiencies.

Then to evaluate if our model did any better, we had our model pick the best assignment algorithm to use for each state over that same day and obtained the new average delivery speeds and Dasher efficiencies. These results show that the reinforcement learned model achieved on average a 6 second improvement in delivery speed and 1.5 second improvement in Dasher efficiency across all deliveries. This does not seem like much, but when millions of deliveries are involved, these seconds quickly add up. These results prove that reinforcement learning can help with the assignment problem.

Moving forward

We made some progress during the hackathon, but there is always more to do. Here are some ways we would like to improve the agent:

More hyperparameter tuning, e.g. learning rate, hidden layer sizes.
Adding more features, i.e. generating more features from the states.
Structuring features in the form of a 3D grid by placing deliveries/Dashers in the grid based on latitude and longitude. The analogy is an image, which has a height, width, and depth. We can then try convolutional neural networks, which are popular for image based tasks, as the agent.

Conclusion

We have seen how applying reinforcement learning to the assignment problem at DoorDash has yielded an enhanced assignment algorithm. We believe reinforcement learning is a powerful tool that we can use to improve our on-demand logistics platform, and we are excited at the opportunity to further delight our customers using advanced artificial intelligence.

We would love to hear about your production applications of reinforcement learning. If solving these types of problems excites you, come join the team!

Acknowledgments

Thank you to the entire hackathon team for working on this project: Anne Zhou, Gary Ren, Ivy Wang, Richard Hwang, Sifeng Lin, Yixin Tang. And thank you hackathon committee for organizing another fun hackathon!

Walking into the cafeteria on my first day, I could not help but notice five words written on big colorful boards: humble, thoughtful, bold, optimistic, and relentless. These represented the company’s values and I have to admit the first time I read them, I was confused. Of course, I knew what the words meant individually, but put together they almost seemed contradictory. After all, how could someone be both thoughtful and relentless? Or be both humble and bold? As a software engineering intern on the Dispatch team, I witnessed how people at DoorDash lived these values and how seemingly opposing values could exist in a single organization.

My main project this summer was to train a model to detect high-variance routes for use in assignment decision making which can help prevent bad batching and reduce unnecessary customer waiting time. After a couple of weeks, the performance of my model had still not improved above the baseline. Sensing my frustration, Raghav Ramesh, my mentor, told me that 9 out of 10 experiments we ran turned out to be worse than the current implementation. But while the experiments failed to make it to the product, they succeeded in helping us gain a better understanding of our marketplace and build confidence towards the next experiment that could push our product forward. So I persisted and learned the importance of iterating and “failing” quickly.

Another initiative I worked on was a new internal package to help data scientists train and evaluate models. Yixin Tang, the ML engineer who designed and led this project, recognized the shortcomings of the existing package we used and instead of hacking together a workaround for himself, architected a completely different system that could improve the productivity and flexibility for anyone who trained models at the company. This was quite a daunting task as it required all models (and there are a lot of them) to be migrated to this new pipeline. But when the product was nearing completion and Yixin was complimented on its design, he quickly stated that the credit belonged to all the individuals who gave feedback and contributed throughout the development. This quickness to share credit is a characteristic that I believe stems from the mutual respect each person has for one another. Everyone is incredibly smart but also acutely aware of the strengths of others. This awareness helps everyone, myself included, feel appreciated, learn and grow together.

Humble, thoughtful, bold, optimistic, relentless. These are the values that all DoorDash people possess but certainly each individual is beyond what these 5 words can summarize. I can’t adequately express how grateful I am for having the opportunity to intern at DoorDash and to learn from the amazing people on the Dispatch team. It’s been an amazing summer and I am excited about what’s to come!

A special thanks to my mentor, Raghav Ramesh, and the Dispatch team for an amazing summer.

Overview

Monitoring is hugely important, especially for a site like DoorDash that operates 24/7. In modern-day DevOps, monitoring allows us to be responsive to issues and helps us maintain reliable services. At DoorDash, we use StatsD heavily for custom metrics monitoring. In this blog post, we take a deeper look at how DoorDash uses StatsD within its infrastructure.

What is StatsD?

According to The News Stack:

“StatsD is a standard and, by extension, a set of tools that can be used to send, collect, and aggregate custom metrics from any application. Originally, StatsD referred to a daemon written by Etsy in Node.js. Today, the term StatsD refers to both the protocol used in the original daemon, as well as a collection of software and services that implement this protocol.”

According to Etsy’s blog:

StatsD is built to “Measure Anything, Measure Everything”. Instead of TCP, StatsD uses UDP, which provides desirable speed with little overhead as possible.

StatsD at DoorDash

At DoorDash, we use StatsD configured with a Wavefront backend to build a responsive monitoring and alerting system. We feed more than 100k pps (points per second) to Wavefront via StatsD, both from the infrastructure side and the application side. We use StatsD and Wavefront to monitor the traffic throughput of different API domains, RabbitMQ queue lengths, uWSGI and Envoy stats, restaurant delivery volumes and statuses, market health, etc. Wavefront’s powerful query language allows us to easily visualize and debug our time series data. The alerting system is flexible enough to notify us through email, Slack or PagerDuty based on severity. Well-tuned alerts helps us lower MTTD (Mean Time To Detect). We encourage our engineers to customize their own metrics to fully monitor their system’s health and performance.

As a startup company, scalable issues come with high growth rate. Last year, the scalability of our metrics infrastructure was challenged. Our Growth engineering team reported that the volume of our end-user activities was much lower than expected. After cross-referencing our tracking data in other sources, we confirmed that the problem lay within our monitoring system. Solving these scaling issues along the way led us to a more scalable infrastructure.

Monitoring infrastructure evolution

One StatsD on one AWS EC2

At the beginning, we had one StatsD process running on an eight core AWS EC2 instance.

Illustration of one StatsD on one AWS EC2

The setup was quick and simple; however, when incidents happened, the load average alert on this EC2 instance never fires even if the StatsD process is overloaded. We didn’t notice the issue until Growth engineering team got paged for false alarms. Even though we have eight cores on the StatsD EC2, the Etsy version of StatsD we are running is single threaded. Thus the overall instance average of the CPU utilization was not enough to trigger the alert. We also spent some time to see how the Linux kernel handles UDP requests and gained some visibility into the server’s capacity. By looking at /proc/net/udp, we can find the current queue size for each socket and whether it was dropping packets.

In the example above, there are lots of dropped packets and a high rx_queueof 1FBD (8125) at local_address 00000000:1FBD(0.0.0.0:8125)which is listened by StatsD.

Meaning for some of the columns:

local_address: Hexadecimal local address of the socket and port number.
rx_queue: queue length for incoming UDP datagrams.
drops:The number of datagram drops associated with this socket. A non-zero number can indicate the StatsD was overloaded.

More reference can be found here.

NGINX as StatsD proxy to distribute traffic

After understanding the problem better, we started to look for solutions. There are already plenty of blogs about how people scale their StatsD, but we also tried to explore other possible ways. For example, we tried to use NGINX as a UDP proxy. We did encounter some issues with the number of UDP connections. If we want to use NGINX, we also need to figure out a way to make consistent hashing for UDP requests so that metrics will always hit the same StatsD, otherwise counters won’t be able to accumulated correctly and gauge will be showing multiple StatsD. Also potentially unbalanced hashing will cause a certain StatsD node to be overloaded. So, we decided to pass the NGINX solution at the moment.

Local StatsD

A quick patch we did to mitigate the packet loss issue (due to maxing out a single CPU core) was to setup something we called local StatsD. Basically we installed StatsD on the EC2 instance itself and, in rare cases, inside each application containers. It was an OK short term solution, but it increased our Wavefront cost since metrics were not as well batched as before. Also the higher cardinality made Wavefront time series queries slower.

StatsD proxy and StatsD on one EC2 host

To reduce the Wavefront cost and increase the performance, we needed to aggregate our metrics before sending to Wavefront. Each StatsD proxy is assigned to different ports on the EC2. Looking at /proc/net/udp we know that it is useful if we can allocate more memory to the recv buffer so that we can hold more data during a traffic surge, assuming the StatsD process can consume the messages fast enough after the surge. There was some tuning we did with Linux kernel. We added the following configuration into /etc/sysctl.conf.

net.core.rmem_max = 2147483647
net.core.rmem_default = 167772160
net.core.wmem_default = 8388608
net.core.wmem_max = 8388608

Illustration of StatsD proxy and StatsD on one EC2 host

Sharding by application

Our Dasher dispatch system related applications pump a lot of matrices into Wavefront, which crashed our StatsD Proxy and StatsD EC2 often. So we started to hold three StatsD EC2 instances: dispatch StatsD for the dispatch system, monolith StatsD for our monolith application and global StatsD for all other microservices to share.

StatsD proxy + StatsD on multiple EC2

The previous solution worked for a while. But with continued growth, the sharded architecture could no longer enough to handle the traffic. And only a limited number of StatsD proxy processes and StatsD processes can run on a single host. We had a lot of dropped packets from the global StatsD server. Instead of putting multiple StatsD proxies and multiple StatsD’s on the same host, we built a tier of StatsD proxies fronting another tier of StatsD processes. In this horizontally scalable architecture, we can add more StatsD proxies and StatsD when there is any dropped data.

Illustration of StatsD proxy + StatsD on multiple EC2

Takeaways

Use AWS A record instead of AWS Weighted Round Robin (WRR) CNAME for StatsD proxy DNS records. One behavior we noticed with WRR CNAME setup was that when using the StatsD, multiple servers would resolve the StatsD proxy DNS as the same IP address during the same period of time. This is likely due to how CNAME round robin works in AWS. This will end up with the same StatsD proxy server causing the overload of a specific StatsD proxy. To evenly distribute the workload of the StatsD proxies, we decided to use an old school DNS round robin of multiple A records.
Turn on deleteIdleStats for StatsD configuration to handle missing metrics in Wavefront. DeleteIdleStats is an option that doesn’t send values to graphite for inactive counters, sets, gauges, or timers as opposed to sending 0. For gauges, this unsets the gauge (instead of sending the previous value). By turning this on, we can clearly identify “no data” metrics. And according to Wavefront, wrapping the alert condition with a default() function is an effective way of dealing with missing data.
Use Elastic IP for StatsD proxy servers, since the client would resolve the StatsD proxy DNS as IP using the StatsD library. Once the client connects to a StatsD proxy and doesn’t try to reconnect to the StatsD proxy, the client would cache the StatsD proxy IP until the process is recycled. If the StatsD proxy servers are relaunched during this period of time, the server will not connect to the right StatsD proxy anymore unless the IP of the StatsD proxy server stayed the same as before. So by using Elastic IP, we can reduce the misconnection between StatsD client and StatsD proxy servers and lower the data loss possibility. And from client side, by configuring max request limit for a WSGI server, it should be able to re-resolve the DNS within an expected time window, which is similar as a TTL and helps to reduce misconnection.
Continuous monitoring for StatsD EC2 and StatsD Proxy is imperative to avoid server crashes and mitigate potential disaster, especially when you are running services in the cloud. Some alerts based on the StatsD metrics are really sensitive, so we don’t want engineers to be paged because of missing data. Some of our StatsD metrics are used for our cloud resources’ autoscaling, so missing data will be a disaster.
CPU and memory analysis of the StatsD EC2 and StatsD proxy EC2 can help to choose the right size for the EC2 instances, which reduces unnecessary cloud resource cost.

Special thanks to Stephen Chu, Jonathan Shih and Zhaobang Liu for their help in publishing this post.

See something we did wrong or could do better, please let us know! And if you find these problems interesting, come work with us!

One of the core mottoes on the engineering team at DoorDash is:

We are only as good as our next delivery!

With high traffic loads and orders flying in, every engineering decision has a critical impact on what our customers, merchants, and dashers will experience. We have to pay careful attention to the details and performance of the system to ensure all three sides of the business are operating flawlessly.

The game of caches

Caching is one common, and well-practiced way to reduce load on database and improve latency for any particular service. This is usually effective for read intensive systems e.g. fetching a menu for a restaurant. In-memory data stores such as Redis, and Memcached are commonly used tools for such a task; though they introduce additional serialize/deserialize, and network overhead. This overhead can be reduced with the help of in-process thread-safe cache (usually an LRU hash map). This in-process cache serves as an L1 cache, while Redis serves as an L2 cache, and the DB serves as master.

Reads are optimized when the L1 cache has been populated with required entries. The problem arises when there is a cache miss, which can be due to cache expiration, deployment roll-out, or a server restart. In this case, new requests must go to L2 or DB to read the value again. Doing so during peak hours, and high load can result in multiple parallel duplicate reads. This behavior is usually known as Cache stampeding or Cache Miss Storm, which causes a spike in both network traffic and latency.

Existing work

There are some existing approaches that rely on locking, or an external system refreshing the cache, or probabilistic early expiration. At DoorDash we wanted to solve the cache stampede that could be caused by a L1 cache miss, resulting in parallel duplicate reads to L2 or DB. Using built-in constructs from Kotlin coroutines we solved the problem without inventing another complex library. Towards the end of this post we will share some numbers and results we achieved. We heavily rely on GRPC, Netty and Kotlin coroutines to keep the internal microservices performant. This article assumes that readers have some basic understanding of coroutines, or the equivalent in their technology stack (Go calls them go-routines, C# calls them tasks, Python 3 also calls them coroutines etc.). While the solution discussed here is more specific to Kotlin, the general idea holds true everywhere and works really well for any event loop based async system. For example, the same effect can be achieved using Node.js promise with a simple dictionary.

The debouncer approach

To solve the problem, we took inspiration from something front-end engineers use frequently. Debouncing is a common practice in the JS world to prevent duplicate events from firing and causing noise in the system. A well-known way to create a debouncing function looks something like this (using lodash or underscore.js):

let debouncedFetchData = _.debounce(fetchData, 100);
// use debouncedFetchData …

The above line creates a debounced function that delays invoking fetchDatauntil after 100 milliseconds have elapsed since the last time the debounced function was invoked.

We wanted something similar to debounced function, but instead of waiting for 100 milliseconds, the winning coroutine among racing coroutines should quickly return a Deferred, and the remaining coroutines should wait on the returned Deferred rather than spinning up their own read. Different technologies have different names for Deferred, JavaScript has Promise, C# has Task and so on. This debouncing can be grouped on an operation ID. For example, if we are trying to fetch a menu from Redis or Database with ID 701064, we can use restaurant-fetch-701064 as a key to uniquely identify the operation. This operation may internally use exponential back-offs, call another service, read L2, fall back to database, or it might end up reading multiple tables to produce one value; but it should uniquely identify an operation that we want to deduplicate.

Our solution relies on a coroutine-safe (just like thread-safe) scoreboard that tracks pending Deferred using an ID. After a coroutine has been scheduled to fulfill a Deferred against an ID, the subsequent coroutines with the same ID use that pending Deferred to wait for results. Once Deferred completes, it is removed from scoreboard. The reader code looks something like this:

https://gist.github.com/96f26f3a480cf8811e3581d48bf60ab1

The method getRestaurantMenus, when simultaneously invoked by many coroutines, will result in one of the coroutines winning the race condition and successfully entering the body to execute fetchMenuFromRemoteCacheOrDatabase. This debounce method immediately returns Deferred<List<Menu>> to all coroutines while the fetchMenuFromCacheOrDatabase executes. All of the coroutines then proceed to await in order to read the results.

How does the debouncer actually work? In the code above, CoroutineDebouncer(ConcurrentHashMap()) relies on computeIfAbsent to create or read a Deferred atomically (be aware of map implementation you use, make sure the implementation applies the function only once). The precise implementation is dead simple and looks like this:

https://gist.github.com/1be77c565d374f47980bfbe0f6d5e744

The computeIfAbsent allows us to launch an async callback that is scheduled for execution, and once completed we do remove from the pending board. For the ConcurrentMap parameter required in the constructor, we used a ConcurrentHashMap for simplicity, but this can be replaced with a NonBlockingHashMap for a higher performing lock-free map, or with your own custom implementation that guarantees atomic operations.

Comparing apples to apples

After applying the changes to our microservice, we benchmarked our new version against the old version. Our machine was MacBook Pro 2.2 GHz i7 with 16GB of RAM and JVM flags -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:+ParallelRefProcEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=60.

The GRPC endpoint being tested is performing a full read through operation ranging from an L1 cache (5 seconds TTL), an L2 cache (10 seconds TTL) and finally falling back to our Postgres database. We used ghz to benchmark service for 60 seconds with 2000 concurrent connections and no rate limit. We explicitly chose short expiration times to simulate multiple stampedes, and observed an overall effect during the 60 second window. Here are the results:

Cold boot

Without debouncer:

Summary:
  Count: 887495
  Total: 60059.11 ms
  Slowest: 6908.89 ms
  Fastest: 0.55 ms
  Average: 135.10 ms
  Requests/sec: 14777.03

Response time histogram:
  0.546 [1]  |
  691.381 [870160] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  1382.216 [7434] |
  2073.050 [4404] |
  2763.885 [2186] |
  3454.720 [209] |
  4145.555 [312] |
  4836.390 [559] |
  5527.225 [1056] |
  6218.060 [505] |
  6908.895 [669] |

Latency distribution:
  10% in 27.40 ms
  25% in 57.70 ms
  50% in 84.08 ms
  75% in 112.40 ms
  90% in 170.33 ms
  95% in 254.05 ms
  99% in 1549.95 ms
Status code distribution:
  [OK] 887495 responses

With debouncer:

Summary:
  Count: 1156274
  Total: 60041.89 ms
  Slowest: 1731.10 ms
  Fastest: 32.23 ms
  Average: 103.68 ms
  Requests/sec: 19257.79

Response time histogram:
  32.227 [1]  |
  202.115 [972011] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  372.003 [23286] |∎
  541.890 [2702] |
  711.778 [0]  |
  881.665 [0]  |
  1051.553 [0]  |
  1221.440 [0]  |
  1391.328 [43]  |
  1561.216 [942] |
  1731.103 [1015] |

Latency distribution:
  10% in 66.05 ms
  25% in 77.36 ms
  50% in 92.63 ms
  75% in 113.26 ms
  90% in 147.19 ms
  95% in 178.08 ms
  99% in 264.65 ms
Status code distribution:
  [OK] 1156274 responses

Warmed up

Without debouncer:

Summary:
  Count: 1053108
  Total: 60769.34 ms
  Slowest: 8058.86 ms
  Fastest: 0.38 ms
  Average: 114.43 ms
  Requests/sec: 17329.60

Response time histogram:
  0.382 [1]  |
  806.230 [982042] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  1612.078 [9147] |
  2417.926 [5000] |
  3223.774 [1944] |
  4029.621 [479] |
  4835.469 [953] |
  5641.317 [371] |
  6447.165 [1]  |
  7253.012 [3]  |
  8058.860 [59]  |

Latency distribution:
  10% in 23.74 ms
  25% in 48.83 ms
  50% in 78.63 ms
  75% in 98.37 ms
  90% in 122.91 ms
  95% in 158.75 ms
  99% in 1474.71 ms
Status code distribution:
  [OK] 1053108 responses

With debouncer:

Summary:
  Count: 1321340
  Total: 60064.00 ms
  Slowest: 578.69 ms
  Fastest: 36.04 ms
  Average: 90.77 ms
  Requests/sec: 21998.87

Response time histogram:
  36.045 [1]  |
  90.309 [574748] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  144.573 [401937] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  198.838 [16613] |∎
  253.102 [705]  |
  307.367 [0]  |
  361.631 [0]  |
  415.896 [78]  |
  470.160 [2159] |
  524.425 [2099] |
  578.689 [1660] |

Latency distribution:
  10% in 68.67 ms
  25% in 76.69 ms
  50% in 87.01 ms
  75% in 99.63 ms
  90% in 112.48 ms
  95% in 124.60 ms
  99% in 176.08 ms
Status code distribution:
  [OK] 1321340 responses

Conclusion

Due to the reduced memory and network loads, we observed an average difference of more than ~4K requests per second. More importantly the p99 latency was reduced from ~1550ms to ~267ms (almost 83% reduction) for cold boot case, and from ~1447ms to ~176ms (almost 87% reduction) for warmed up case.

In order to verify and visually see how many extra calls we were saving over time, we instrumented CoroutineDebouncer code above, and added markers to count the number of times computeIfAbsent invoked the callback vs the total number of calls to the debounce method. We ran our benchmark for 120 seconds with 4000 concurrent requests and a mix of repeated random IDs to simulate real load. The results were encouraging:

Number of time computeIfAbsent applies the function vs total number of calls

We’ve also put together a sample JS implementation that you can use to simulate results yourself. It follows the exact principle described above, you can fork and play around with different parameters.

There can be various approaches to evict Deferred, or having some sort of waiting timeout on debouncer, but the core idea remains the same. For systems running in a cluster, this approach relies on preventing a stampede from one instance, thus minimizing a cluster-wide stampede. In a very large cluster (thousands), we might still experience a stampede, which will require a different solution. So far this approach has worked well for us in production, and helps us deliver good latency numbers.

With high traffic loads on our systems, every improvement matters and contributes to a snappy experience for our customers, merchants, and dashers. Craftsmanship, and attention to detail helps us deliver our next order in a timely manner.

In May, DoorDash participated at the O’Reilly Artificial Intelligence Conference in New York where I presented on “How DoorDash leverages AI in its logistics engine.” In this post, I walk you through the core logistics problem at DoorDash and describe how we use Artificial Intelligence (AI) in our logistics engine.

LAST-MILE LOGISTICS IN A THREE-SIDED MARKETPLACE

Fulfilling deliveries at DoorDash requires effectively balancing the three-sides of the marketplace — the Merchants, Dashers and Consumers — and identifying the best Dasher to pick up a delivery from the Merchant and bring it to the Consumer.

Determining the optimal solution to this problem (referred to as the Vehicle Routing Problem) is NP-hard. The real-time, quick-turnaround nature of DoorDash introduces additional challenges: delivery requests come in continuously, Dashers constantly are in movement, and variance in restaurant operations and real-world events (traffic, weather, etc.) have pronounced effects on the solutions.

USING AI TO ACHIEVE OPERATIONAL EFFICIENCIES

To solve these problems more efficiently, DoorDash leverages various artificial intelligence (AI) and machine learning (ML) techniques to intelligently model the decision space and achieve near optimal solutions in seconds. Ultimately, across DoorDash’s tens of millions of deliveries, these techniques have led to shorter delivery times for Consumers, higher pay for Dashers, increased income for Merchant partners, and a better experience for all sides of the marketplace.

In my talk, I start with an overview of AI applications at DoorDash, then talk about how machine learning techniques complement traditional operations research techniques to enable logistics at DoorDash. In particular, I dive into two specific applications of machine learning: predicting delivery lifecycle timepoints and batching algorithms.

By Rohan Shanbhag and Wei Lin, Software Engineers

Most Android apps rely on network calls to a set of backend services. As an app grows, so does the complexity of network calls and data operations. Networking libraries like Retrofit and Volley provide all the functionality needed for basic API calls. However, for threading and sequencing of events beyond a single API call, you may end up writing messy, error-prone code. Enter RxJava, an extension of the Observable pattern that abstracts away the complexities of threading and synchronization and integrates seamlessly with Retrofit.

At DoorDash, we’ve experimented with RxJava and found that it has tremendous benefits. In this blog post, we’ll talk about how using RxJava allows us to handle network operations in a reliable, scalable way.

Let’s start off with a basic example in the DoorDash consumer app that makes an asynchronous API call via Retrofit to fetch the list of restaurants at a given location. Using Retrofit’s Callback<T> interface, we perform the request on a background thread and get the response in onResponse on the main thread.

https://gist.github.com/137dbbd8e4a58884c8c96f146ad11974

While this approach works fine for this basic use case, it does not scale much beyond that. Let’s consider a requirement change.

Chaining API Calls

Let’s say the requirement now changes to fetch restaurants based on the user’s default address on login. For this, we have to first make the API call to fetch the user info and then fetch the restaurant list. Using the previous approach, this is what it would look like:

https://gist.github.com/c63033d011bd498317d68378d0753504

Let’s look at the problems in this code:

It’s not readable because of multiple nested callbacks
It’s not trivial to make the API call aware of lifecycle changes, which can lead to crashes when updating the view
The API call to fetch user info is tightly coupled to the call that fetches the restaurant list

Let’s see how RxJava helps us with this problem:

https://gist.github.com/6daffe76b538bea24ee5989f6875ef0d

It addresses the above issues in the following manner:

Makes the code clean and readable
The fragment can subscribe to the Observable, which makes the API calls on a background thread and emits data on the main thread, as specified with subscribeOn and observeOn. It can unsubscribe from the Observable to stop receiving data from it, as done in onPause
Adding API calls to the chain is easy, without much overhead

Making Parallel Requests

Let’s say we want to run an A/B test to display photos for the fetched restaurants. Let’s assume we have an API call that returns a boolean value of the experiment for the current user, where “true” means we should show restaurant photos in the list. Using the same approach from before, we could just use another flatMap operator to first fetch the restaurants and then fetch the experiment value to display the correct experience based on the experiment value. However, ideally we should make the calls in parallel since they are independent of each other and combine their results.

Let’s see how we can do this with the the zipWith operator:

https://gist.github.com/2c59b83b8c777e1fc31c4d01f27a09ea

RxJava comes with a powerful set of operators that can be used to compose sequences together. For this use case, we make the two API calls in parallel and combine their results using the zipWith operator before emitting the final result.

Conclusion

Hopefully these examples showed you that RxJava can help you easily adapt to the changing needs of your app via it’s APIs and operators. Our Android apps have benefited from using RxJava for solving problems such as combining data streams, search-as-you-type, and polling. Let us know if you have additional thoughts on how RxJava can help with your applications or come join the team.

Customers come to DoorDash to discover and order from a vast selection of their favorite stores, so it is important to be able to surface what is most relevant to them. In a previous article, Powering Search & Recommendations at DoorDash, we discussed how we built our initial personalized search and discovery experience to surface the best stores for consumers based on their personal preferences.

There we showed how we were able to increase click through rate using recommendations by 25% versus a baseline of showing the most popular restaurants. By incorporating latent information, as well as preparing a training pipeline and a gradient-boosted machine setup we use in other systems at DoorDash, we’ve been able to see an increase in click through rate by another 5% in initial email tests and are in the process of testing and rolling out these changes more broadly in email and in-app.

Latent Information

At DoorDash, our recommendations problem differs from the typical e-commerce recommendations problem in that consumers only see stores that are close to them geographically. (See “How we Designed Road Distances in DoorDash Search”) Because of this sparsity in the matrix from consumers to stores, we started with a knowledge-based recommender system described in the previous article instead of using an approach like collaborative filtering.

However, we do want to include the kind of latent information from consumer and store similarity. To do this, we use a technique similar to the natural language processing technique of word2vec, in our case store2vec. With word2vec, the idea is that words can be encoded in a vector space to represent semantic properties. For example, using word2vec, if we have a vector for “king” and subtract “man” and add “woman”, we would get “queen”.

Encoding stores on DoorDash in a vector space holds the promise of semantically representing properties of stores that we don’t otherwise have information about, like is the store focused on providing sweet items, or is it a trendy restaurant, or is it a vegetarian restaurant.

For example, here are the most similar stores based on store2vec distance for Smitten Ice Cream in Los Altos and Darbar in Palo Alto:

Smitten Ice Cream: Baskin Robbins, Jamba Juice, Tin Pot Creamery, Krispy Kreme Doughnuts

Darbar: Amber Dhara, Amber India, Curry Up Now, Janta Indian Cuisine, Rangoon Ruby, Shiva’s

For store2Vec, we embed stores as vectors using the word2vec (CBOW) algorithm from gensim package with the following modification.

each store is a word in our vocabulary and
each sentence is a list of stores viewed together in a user session.

For word context, we found a context window size of 5 to work the best. As quality constraints, we enforce minimum thresholds on number of stores in a session and number of sessions a store appears in.

This gives us vectors for every store. Then to generate vectors for a consumer, we sum the vectors for each store they ordered from in the past 6 months or 100 orders. To then determine the distance between a store and a consumer, we take the cosine distance between the store’s vector and the consumer’s vector.

To illustrate this, here we construct an example consumer with order history consisting of 4505 Burgers & BBQ and New Nagano Sushi (marked [email protected] in the figure). We can see that burgers and sushi restaurants are some of the closest points, but interestingly, also some Korean restaurants. The points are plotted using t-SNE and the Tensorflow embedding projector. The distances listed on the right are the cosine distance between the consumer vector and the store vector.

Training Pipeline

This store2vec distance feature is one feature we added to our training pipeline for recommendations. The training pipeline consists of the following stages.

Positive and negative example generation: We sample past orders as positive examples. We extract data based on past data so that features match what they would have been at the time before the order occurred in order to maintain the integrity of the training / testing. To generate negative examples, we use the noise contrastive approach; we randomly choose another store that the consumer could have ordered from.

Feature generation: Based on data for consumer and stores, we extract many features having to do with the annotated data on consumer and stores such as categories, rating, popularity, and browse / click / order information.

Train/test split: We split 70% training and 30% test as a time split so that we are not testing on data that occurred before data we trained on.

Model training: We train logistic regression and gradient-boosted machine (GBM) models. For GBM models, we use LightGBM. These are the same frameworks we use for many other machine learning systems at DoorDash such as prep time prediction and batching prediction.

Model evaluation: The model is predicting P(order | consumer, store) and is a binary classifier. To evaluate it for this ranking problem, we use area under curve (AUC) of the precision/recall curve. This provides an evaluation metric that does not change if the score values are inflated or deflated but the ranking remains the same. We also output business metrics to check for the models such as average delivery fee, average rating, and check for example users with order history conforming to certain patterns in order to sanity check the output models.

Future Work

Developing a recommendations model including latent features is only the second major step here. Here are some areas we intend to explore in the future:

Generating recommendations with context: Generating a list of recommendations is helpful, but being able to show sections with descriptions can give people more confidence in the recommendations and allow personalizing more of the DoorDash app
Store2vec optimizations: There is more that can make these recommendations more powerful by enhancing store2vec. For example, we could include consumers in the same optimization process, meaning we would generated vectors for stores and consumers together instead of having a separate averaging step.
Freshness in recommendations: Based on impression data, we could adjust recommendations for a consumer as they use the product
New models: We have experimented with alternative models like the seq2seq deep learning models shown below, and expect to see gains in performance with integrating similar models.

Conclusion

Personalization holds promise for helping consumers using DoorDash to find what they want quickly and to help surface restaurants most relevant to them. By applying latent features we were able to improve our predictions. By applying our existing machine learning systems to this problem for GBMs we were able to get a large boost. Overall, we see approximately 20% increase in offline AUC and are currently testing these models in email and in-app where we see approximately 5% increase in click-through rate.

If you are passionate about solving challenging problems in this space, we are hiring for the data science and machine learning team as well as the search & relevance team. If you are interested in working on other areas at DoorDash check out our careers page.

¹Although we try to optimize for increasing orders, click through rate is the primary metric for recommendations and search as it is the direct metric.

To A/B or not to A/B, that is the question

Overview

On the Dispatch team at DoorDash, we use simulation, empirical observation, and experimentation to make progress towards our goals; however, given the systemic nature of many of our products, simple A/B tests are often ineffective due to network effects. To be able to experiment in the face of network effects, we use a technique known as switchback testing, where we switch back and forth between treatment and control in particular regions over time. This approach resembles A/B tests in many ways, but requires certain adjustments to the analysis.

Dispatch at DoorDash

The core responsibility of the Dispatch system at DoorDash is to power fulfillment of every delivery in Doordash’s three-sided marketplace of Consumers, Dashers and Merchants. To effectively achieve this, we focus on

the core assignment problem — deciding which dasher is the best suited to fulfill a delivery, in an efficient, robust and scalable way.
machine learning algorithms to predict the numerous timepoints in the life of a delivery — “when can this order reach the consumer”, “when will this order order be ready for pickup?”, etc.
algorithms to decide how and when to group multiple deliveries headed in similar directions at similar times
how to leverage various marketplace shaping strategies to balance supply and demand, including pricing changes

As we continuously iterate on these problems, we rely heavily on both simulation and experimentation to best serve our audiences on all three sides of the marketplace. Offline simulation is helpful for checking assignment algorithms, and offline evaluation and A/B testing helps us with iteration on prediction algorithms. However, it is no surprise that in dealing with marketplace problems online, network effects often make traditional A/B testing ineffectual. We’ll explain that with an example of SOS pricing below.

Marketplace Experimentation

SOS pricing is a demand shaping strategy used when we have too few dashers compared to the incoming delivery volume. In these scenarios, in order to avoid overwhelming the Dasher fleet with an intractable number of deliveries, thereby increasing Consumer delivery times, we enable SOS pricing, which increases delivery fee. With this increase, demand gets throttled and shifted to later times; meanwhile, dashers are motivated to dash more. When we introduce (or later, modify) the SOS pricing algorithm, we would like to experiment to understand the impact on customer retention and delivery times. If we were to test changes as an A/B experiment on Consumers, 50% of Consumers would see SOS pricing and 50% wouldn’t. In this case, the first set of Consumers get only half of the benefit of supply equilibration (by extension, reduced impact on Consumer delivery times), while the other half get the partial benefit without any extra pay, which makes our learning incomplete. The problem here is the network effect — both sets of Consumers share the same Dasher fleet — so adding a Consumer to the treatment group also affects the experience of Consumers in the control group and we can not establish independence between the two groups.

One way to get around network effects is to introduce the change and observe the impact before and after. For instance, we can compare Consumer delivery times before and after the introduction of SOS pricing. The main problem with this approach is the famous maxim that correlation doesn’t imply causation. Without a randomized experiment, we cannot be sure if the results we are seeing after our change are really because of that change, or just coincide with other things changing in the system. DoorDash is a dynamic company with a lot changing every day, so relying on a pre-post comparison is a risky proposition. At a minimum, relying on pre-post analyses would cause a dispatch bottleneck where we could only change one big thing at a time. Even this strategy, however, is insufficient for interpreting correlations as causal because occurrences outside dispatch’s control like consumer promotions and weather have a big impact on our normal dispatch metrics and might be the true cause of a pre-post observed difference. To take the most-cited line from the oft-cited Airbnb medium post on A/B testing, “the outside world often has a much larger effect on metrics than product changes do.”

Switchback Experimentation

Due to the limitations of A/B tests and the insufficiency of pre-post comparisons, the Dispatch team recently decided to switch to a new analytic framework for much of its experimentation — ‘switchback testing.’ Fun fact: switchback testing was originally employed in an agricultural context, specifically for cow lactation experiments. MOOOOO.

In switchback testing, the core concept is that we switch back and forth between control and treatment algorithms in a certain region at alternating time periods. For example, in the SOS pricing example, we switch back and forth every 30 minutes between having SOS pricing and not having SOS pricing. We then compare the customer experience and marketplace efficiency between the control time bucket and treatment time bucket metrics corresponding to the decisions made by the algorithm during the two periods.

Implementing Switchback Experiments

In implementing switchback experiments, we include two extra pieces of logic:

We randomize the variant used for each time window, rather than simply randomizing the initial variant and flipping back and forth deterministically. With this approach, each time unit is a randomized experimental unit.
We further split across different geographical units (regions) and randomize them independently, so region A could use the current algorithm for one window and the new algorithm for the next, while region B could use the new algorithm for the first window and the old algorithm for the next. This in a sense actually creates a series of ‘time-region units.’

In many ways, switchback testing is exactly like A/B testing, but instead of randomizing based on something like deliveries, we randomize based on time-region units.

The rest of the section explains the design of the switchback service we use at DoorDash.

The switchback service is responsible for three functions

Storing metadata on all experiments (Metadata)
Choosing the algorithm variant to use at certain point of time for any experiment (Bucketing)
Handling tracking to enable metrics calculation and analysis (Tracking)

The most important metadata of an experiment is the “switchback window”, i.e., the duration of time for which we persist a variant before switching. Other metadata include names and relative sizes of the bucket variants. Internally, the switchback service stores the following state — 1. The start of the current time window, 2. A map with region id as the key and the current bucket name as its value and 3. A unique identifier for the experiment unit (time + region)

To achieve bucketing and tracking, the switchback service exposes two endpoints

Registration

Based on the time passed by the registration endpoint, the switchback service checks and updates its internal state e.g. check if it is time to update the buckets and if so, flip a (multi-sided) coin for every region and determine the bucket for the new time window

GetBucket

Used by the experimenter system to ask for the current bucket (variant) for an experiment
the switchback service fetches the bucket value from its state and sends a tracking event (which includes experiment metadata and unit identifier) to the ingest pipeline.

A set of ETL jobs that combine the above tracking events and our metrics tables enable analysis of the experiment results.

Analyzing Switchback Experiments

In order to use something like a t-test (which is often used to analyze the results of A/B tests) for our switchback experiments, we analyze based on the same units we’re randomizing on: namely time-region units. This means we’re conducting our statistical tests on the average values of control and treatment time-region units, rather than the individual values of control and treatment deliveries. Therefore, the average value for the treatment and control groups is a simple average of all time-region units rather than a weighted average of those values (weighted by deliveries). This is illustrated with a simple example of 14 deliveries below. While the simple and weighted average of values usually converge over time, if they do not, it is an indication that your intervention has a different impact on units with a lot of deliveries than it does on units with fewer deliveries. In the future, we plan to handle these divergent situations by using multi-level modeling (MLM, also known as hierarchical modeling). MLM also has the advantage of likely being able to reduce the margin of error of our statistical tests.

Sample of 14 deliveries (numeric table in appendix) with illustration of weighted vs. simple average

Moreover, traditional A/B testing has an assumption of independent units, which means that if we know something about the behavior of one unit, it does not tell us anything about the behavior of the second. This assumption is clearly violated in the case of time-region units as the performance of one time-region is highly correlated and can influence the performance of the next. To give a more concrete example, the average time it takes to complete deliveries in one area in one 10-minute chunk of time is related and highly correlated to the average time it takes to complete deliveries in the same area in the next 10-minute chunk of time — much more so than in the case of a single delivery and the next delivery assigned. To account for this lack of independence in our statistical tests, we use an approach robust to a lack of independence called a sandwich estimator of variance (R-snippet of the code we use included in appendix). While strictly speaking this approach (or something similar) is required to avoid violating core assumptions, thus far when running A/A tests (a dummy experiment where there’s no actual difference between “treatment” and “control”) we’ve found the sandwich estimator’s effect on our variance calculations has been <10%.

Setting up Experiments and Doing Power Calculations

In creating our time-region units and determining how granular the geographic units should be and how quickly to switch back and forth, we really have to consider two factors: bias and margin of error.

Bias occurs when your randomization of time-region units is compromised, and the sorts of deliveries that are in treatment and control group are not the same on average. For instance, we might test a certain algorithm which is reluctant to assign deliveries with longer distances between stores and customers. As a result, this treatment algorithm may be observed to have a faster average time to completion even if it is not a better algorithm simply because it is cherry-picking short deliveries and leaving the control group to complete longer deliveries. Bias is more likely to occur if your regions get too small or you switch back too frequently; when we switchback less frequently, it forces the treatment group to complete many more of the longer deliveries and rely less on the control group to clean up its mess. More on health checks to guard against bias and check successful randomization in the footnote.

Margin of error is how much uncertainty exists in our estimate of the impact of an intervention. There are two main factors that influence the uncertainty in our estimate: the natural variation in the data we are observing and the number of units in our dataset. Without getting too technical (refer to sampling distributions for more information), the margin of error in our estimates is proportional to the natural variation in our sample divided by the square root of the number of samples. As time-region buckets get more granular, natural variation in a metric tends to go up, however more granular time-region buckets provide more units in our dataset which drives down uncertainty.

To understand the interplay between natural variation and number of samples, we ran a series of long-term A/A tests. These tests allowed us to look at how margin of error differed for the average value of a certain metric in a time-region unit as we switched back and forth more quickly or more slowly. The results for one metric are summarized in the table below (note, given the small impact of the sandwich estimator in most situations, it is ignored in this table):

We were able to do a similar sort of analysis using different levels of geographic granularity.

Summarizing the following above considerations about bias and margin of error: we do not want time-region units that are too small or we risk introducing bias and failing one of our randomization health checks; on the other hand, we do not want time-region units that are too large or we risk having large margin of errors and taking too long to get our learnings. Based on these analyses, we typically run our tests with 30 minute switchback periods using certain geographic divisions roughly at the city-level.

Validation of our Switchback System

An extra verification step we are interested in is “are the results from a switchback experiment reflective of reality?”. Concretely, if we see an X% increase in a metric in an experiment and we completely move over to the new algorithm, will we actually observe the gains in the metric in the real world? This is particularly important from a business perspective and for making reliable product decisions based on experimental results.

To do this, we observe the changes in metrics prior to and after the launch of the new algorithm. As described earlier, observing changes in a pre/post scenario is challenging for multiple reasons. However, over large time windows after product launches, we can reliably observe the impact on the metrics time series. So, what we look for here is just directional confirmation of the results we observe in the experiment.

Conclusion

As DoorDash tackled this problem, we were able to find some helpful resources like Lyft’s 3-part discussion of experimentation in a ridesharing marketplace. We felt there were still some questions left to answer and believe we’ve been able to make some additional observations particularly on the analysis and implementation details, and we hope our observations may help you if you’re considering similar issues. That being said, we fully admit we’re just getting started on unpacking exciting challenges like the ones discussed above. Special thanks to Yixin Tang and Viraj Bindra for their help in publishing this post.

See something we did wrong or could do better, please let us know! And if you find these problems interesting, come work with us!

Appendix

R-Snippet for Analysis with Sandwich Variance Estimator

¹ To guard against bias and ensure ‘successful randomization’, you can check to make sure that factors that are unaffected by your intervention have equal values between the treatment and control group: in our example, a good value to check would be the distance between restaurant and customer. At DoorDash, however, we often find that simply checking to make sure the expected proportion of deliveries are considered by treatment runs and control runs sufficiently guards against most instances of bias. To return to our example of an algorithm that is hesitant to assign long-distance deliveries, this would result in unequal proportion of deliveries being assigned the control group rather than the treatment group and this deviation would indicate the existence of bias.

² For example, average delivery times in a specific 5-minute window at a city level varies more widely than can average delivery times across all of DoorDash’s regions in an hour.

³ However, if we notice any bias, we shut down the experiment and start over with coarser time and/or geographic units.