Skip to content


Using a Multi-Armed Bandit with Thompson Sampling to Identify Responsive Dashers

March 15, 2022

Arjun Sharma

Arjun Sharma

Maintaining Dasher supply to meet consumer demand is one of the most important problems for DoorDash to resolve in order to offer timely deliveries. When too few Dashers are on the road to fulfill orders, we take reactive actions to persuade more Dashers to begin making deliveries. One of the most effective things we can do is to message Dashers that there are a lot of orders in their location and that they should sign on to start dashing. Dashing during peak hours can mean a more productive shift, higher earnings, and more flexibility in choosing which offers to accept. 

We need to optimize which Dashers to target with our messages because approaching Dashers with no interest in dashing at that time can create a bad user experience.  Here we will describe a bandit-like framework to dynamically learn and rank the preferences of Dashers when we send out messages so that we can optimize our decisions about who to message at a given time.

Finding the best way to alert Dashers about low supply

Currently we select Dashers to message by identifying who has been active in a given location and then selecting recipients at random. While this approach doesn’t overload specific Dashers with messages, it doesn’t improve the conversion rate of Dashers coming onto the platform after receiving a push notification.

We need to find a methodology that uses our information about Dasher preferences while avoiding spamming Dashers who wouldn’t be interested in receiving notifications at that time. This problem statement lends itself to finding a machine learning approach that can:

  • Identify current responsive Dashers who are more likely to convert when asked to dash now
  • Identify Dashers who aren’t interested in these messages so we can avoid spamming them
  • Identify new responsive Dashers so that we don’t overtax our existing responsive Dashers
  • Rank Dashers by their willingness to dash when contacted so we know how to prioritize who to message at each send

ML approaches we considered 

One possible approach is to treat this as a supervised learning classification problem. We can use past data that is labeled – for example, we see whether a Dasher historically has signed on to dash when invited – and try to create a model that predicts a driver’s probability of dashing now when sent a message under a given set of features. 

While this approach is easy to frame as a binary classification model, there are some issues with this approach. What if Dasher preferences change over time? For example, a Dasher who is enrolled in college could be very responsive during breaks, but largely unavailable once school resumes. This type of non-stationary behavior would have to be handled by the model trainer through retraining and heavily weighing more recent observations.

Another problem with this approach is that it only optimizes for the probability of dashing when a message is sent. With this approach, we would only be sending messages to Dashers we already know are likely to convert.  There would be no basis to send messages to other Dashers, giving them a chance to self-identify as responsive Dashers.

Because of our constraints and what we are optimizing for, there are multiple benefits to using a bandit algorithm instead of supervised learning. We can construct a bandit-like procedure that allows us to dynamically explore Dashers to message, over time identifying and optimizing on Dashers who respond to messages. This approach would allow us to dynamically allocate traffic to Dashers who are more responsive. 

As Dasher preferences change over time, the algorithm can relearn dynamically which Dashers would be most likely to convert. We can even easily extend this framework to use a contextual bandit; if Dasher preferences change based on time of day or day of week, the bandit can be given these features as context to make more accurate decisions.

Next, we need to select which bandit framework to use in order to allocate traffic to Dashers dynamically.

A trio of possible bandits

There are multiple factors involved in determining which bandit to use. We want an algorithm that explores enough to adjust to changing Dasher preferences and yet still sends messages to Dashers who we already know are responsive. Several algorithms come to mind as possible choices:

The Epsilon-Greedy algorithm defines a parameter – epsilon – that determines how much to explore sending messages to Dashers about whom we don’t know as much.

  • Pros:
    • Easy to understand and implement
    • Makes it easier to prioritize known Dashers based on their likelihood to respond to messages
  • Cons:
    • Because we have to define this constant epsilon percentage, it does not improve over time. We can explore too little early on and too much later in the process
    • Experimentation is not dynamic; no matter what we have learned about Dashers’ preferences, we are always exploring at a fixed percentage

The Upper Confidence Bound (UCB) bandit algorithm is based on the principle of “optimism in the face of uncertainty,” which translates to selecting the action that has the highest estimated reward.

  • Pros:
    • Finds the best-performing Dashers quickly
    • Once there’s enough data, starts to optimize sending messages to responsive Dashers instead of exploring 
  • Cons:
    • Difficult to communicate the strategy to stakeholders about why a specific action was taken
    • When there is an excess of new Dashers, this method could end up only messaging new Dashers until enough signal is received

Thompson Sampling takes a Bayesian approach to the problem. We assign a prior probability distribution to each Dasher that is updated to a posterior probability after reviewing observations.

  • Pros:
    • Intuitive approach that counts the successes and failures of each message sent to a Dasher 
    • Depending on the probability distribution used, we can take advantage of the conjugate relationship between prior and posterior probabilities and use a simple update rule to get the posterior probability
    • Easy to implement
    • Finds best-performing Dashers quickly
  • Cons:
    • Requires manually setting priors for new Dashers; an approach like UCB always includes Dashers we have not previously messaged 

Why we chose Thompson Sampling

Given these three frameworks, we selected Thompson Sampling for its intuitive approach and ease of implementation. 

We started by defining our target function: Determining what the probability is that a Dasher who receives a message will convert and sign on to DoorDash immediately. After this, we needed to compute a prior for each Dasher from which we could sample to decide who to message. A prior is a probability distribution that models the probability that a given Dasher will respond when messaged. Along with choosing an appropriate prior, we also need to have a method for updating it given new information. We used a beta distribution to do this because it directly uses the number of successes (alpha) and number of failures (beta) to create a distribution of success. By using the conjugate relationship between beta prior and posterior distributions, we developed an intuitive update rule – add to alpha if a Dasher converts or, if not, add to beta. As we update the distribution following each message, the variance of the distribution shrinks as we become more certain of the outcome. 

Our last decision when defining the prior was whether to start at pure exploration -- uniform distribution – or use past data to inform our prior. We chose to inform each  Dasher’s prior with previous messages and conversion data to speed up the convergence of the distributions. We apply a weight-decay parameter on previous observations to favor recent data over historical observations. This way, when we start the experiment, the bandit has a head start on Dasher preferences without biasing too heavily to old – and potentially stale – data.

Next, we needed to tune a set of hyperparameters vital to modeling the situation accurately. Among the steps we took were:

  • Consider the length of each observation – over what time period should we use to consider each observation? If it’s too short, we can’t accumulate enough reward/penalty for each run. If too long, it takes extra time to update the algorithm to find high-performing Dashers.
  • How stationary is the problem? Dasher behavior changes over time, so we must give greater weight to recent observations than those recorded in the past. If a previously responsive Dasher ceases to respond, we need to update our probability distribution quickly.
  • What prior should we give new Dashers? It’s important to add new Dashers to the algorithm without degrading our performance while still giving them a chance to be selected so that we can learn if they are a high-performing Dasher.
  • Given that there's an imbalance in data (– a majority of many more Dashers choose not to dash when messaged), – how much weight should we give success vs. failure? 

After defining our beta distribution, update rule, and these hyperparameters, we are ready to use the bandit procedure to decide which Dashers to message. In our experiment, whenever we are ready to send out a message, we let the bandit sample all prior distributions to give us the probability of converting when messaged. We then rank the Dashers in descending order by their sampled value and take the top Dashers whose sampled value is greater than a predetermined threshold so that we don’t message Dashers who the bandit has determined won’t convert. We define the number of Dashers to contact by first determining how many are needed to resolve the current shortage. We then divide that number by the average conversion rate for Dashers in that location. The bandit then can message the Dashers who it has determined are most likely to get on the road.


Currently, we are running experiments to test this bandit framework against our previous random sampling method. We are using a switchback experiment to measure the impact that improved message targeting has on the overall supply/demand balance for a given location. Using this testing framework, we not only see if there is an increase in Dashers who respond to messages, but we can also see what effect these additional Dashers have on the market supply. So far, we have seen an improvement in the conversion rate of messages sent in the bandit framework, which has allowed us to send fewer messages than required by our control variant. We are experimenting further to prove the impact.


While we have tailored Thompson Sampling to a specific Dasher scenario, this solution can work in many different scenarios. Companies seeking to provide a personalized experience to all of their customers may have limited data to figure out how to best accomplish that. Thompson Sampling can help demonstrate which options give the greatest reward in a non-stationary environment. The method works well in a quickly changing business environment where there’s a need to dynamically optimize traffic. With a single model, we get the advantages of velocity, dynamic traffic allocation, and a solution that handles changing behavior over time.

While what we have done to date works well, there are many ways we can improve upon this approach. Currently, we only consider whether a Dasher signed on after receiving a message. But additional data lets us know that Dashers’ preferences change based on their location, time of day, day of week, and much more. Over time, we can encode this information as contextual features so that the bandit can make even smarter decisions.


This post is based in large part on the great work of our intern Hamlin Liu. We are excited to have him join us full time in August!

Related Jobs

San Francisco, CA; Sunnyvale, CA; Seattle, WA
San Francisco, CA; Sunnyvale, CA
San Francisco, CA; Sunnyvale, CA; Los Angeles, CA; Seattle, WA; New York, NY
Sao Paulo, Brazil
San Francisco, CA; Sunnyvale, CA; Los Angeles, CA; Seattle, WA