As DoorDash transitioned from Python monolith to Kotlin microservices, our engineering team was presented with a lot of opportunities to improve operational excellence and continue our obsession with reliability. While there are many ways we can improve our engineering metrics via optimized hardware and suitable system designs, one direct lever we can pull as developers to contribute to the overall engineering excellence is to write better and cleaner code on a daily basis.

And one way for developers to write cleaner code is to adopt the paradigm of functional programming (FP) using Kotlin. As a multi-paradigm general-purpose programming language, Kotlin provides a lot of the necessary toolkits we need to leverage FP in our day-to-day coding.

In this post, we are going to talk about what functional programming is, what the benefits and potential downsides of it are, how it compares to the alternative paradigm imperative programming (IP), what Kotlin provides for developers to leverage FP, and examples of how we at DoorDash write FP-style code in Kotlin.

What is functional programming (FP)?

In a nutshell, FP is a programming paradigm in which programs are constructed by applying and composing functions. A typical program in FP works like this: Given some input, apply a series of functions (can be both big and small) on the input to get the desired output. This might seem trivial, but it has a lot of implications and rules behind how the functions are constructed and what is the scope of changes that can be affected by these functions. Together, these implications and rules are what make FP a great paradigm to consider.

Out of all the concepts in FP, the following three contribute the most to the benefits of adopting FP in our day-to-day programming. (We will discuss later in more detail how these concepts help us write better and cleaner code.)

Pure functions. By definition, pure functions have the same return values for the same input, and there are no side effects (such as updating other local variables and invoking I/O). For example, all mathematical functions, such as sum, max, and average, are pure functions.
Immutable states. Compared to mutable states we are familiar with–such as a variable that can be reassigned to any values or an array that we can insert or remove any values during runtime–immutable states are not modifiable after they have been created or assigned a value.
Function composition. As the word “composition” suggests, function composition refers to combining simple functions to build more complicated functions. In practice, the output of a function becomes the input of another function, which yields an output that is used for the input of another function, and so on.

It’s normal or understandable to not have heard of these concepts before. In fact, this is one of the few reasons why FP is not as widely used and adopted as other paradigms. It’s different from the other camp of programming paradigm, imperative programming (IP), which includes the sub-paradigms of procedural programming and object-oriented programming (OOP) with which most developers are familiar. Most computer science curriculums don’t cover FP as extensively as OOP, often it’s not covered at all. While many mathematical courses cover the core concepts behind FP, such as pure functions and composition, they rarely connect the dots between these concepts with how they can be leveraged in the programming world.

How does FP compare with IP?

While there are many areas of difference between FP and IP, we will expand on Microsoft’s explanation in comparing between FP and IP in the context of .NET and emphasize these three areas:

Programmer focus. IP requires programmers to think about how to perform the algorithms and track internal changes in state to meet the desired outcome. In FP, however, programmers focus mainly on three things:
- What are the inputs
- What are the desired outputs
- What transformations are needed to convert the inputs into outputs

State changes. There are basically no state changes in FP since immutable states are at the core of the paradigm. In IP, however, state changes are everywhere and crucial to the flow of execution because those state changes essentially are how the program keeps track of where it’s at and what to execute next.
Primary flow control. In FP, functions are used to apply to data collections such as arrays and maps to perform the desired transformations. Functions are first-class citizens; therefore, they can be assigned to values, passed as arguments, and returned from other functions. On the other hand, IP relies heavily on loops, conditionals, and function calls (can be pure or non-pure) to control the flow of the program and manipulate internal states to get to the desired end state.

Figure 1: the difference of programmer focus, state changes, and primary flow control between FP and IP

It’s apparent that not only are the methodologies different between FP and IP, but that in practice the way a programmer thinks during coding is also drastically different. That being said, FP and IP are not mutually exclusive. In fact, many programming languages, such as Kotlin, adopt a multi-paradigm mindset where programmers are free to use more than one paradigm in the same piece of code. For example, since Kotlin is designed to interoperate fully with Java, and Java is primarily an OOP language, one could expect that there will be a lot of OOP examples in Kotlin code. That does not stop programmers from applying FP-style functions on Java objects, which we will showcase more later.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Benefits of writing FP-style code

Now that we have seen the differences between FP and the more widely known IP, let’s look at what benefits FP brings to the table. In summary, there are three main advantages:

Side-effect-free executions
Easy iterations on existing functions
Increased testability

Side-effect-free executions

As previously mentioned, pure functions guarantee no side effects other than producing the desired output. Pure functions do not modify the state of any of their inputs, nor do they modify the state of any system wide parameters. In a highly complex system like the one DoorDash has, this property is highly valuable because, as a developer, it’s beneficial to expect a function to do exactly what it claims to do, and there won’t be any other side effects by calling the function. When multiple developers from teams across different departments work on the same code base, understanding the logic in the code becomes straightforward because it’s easy to read the series of functions being applied to the input and figure out what is being done without poking all the individual functions.

Easy iterations on existing functions

Because all functions written in FP-style will have no side effects, it’s much easier to make iterations on existing functions and logic. For example, suppose there are existing functions that perform a series of operations to calculate the base pay for a Dasher (our term for a driver) in a delivery. Let’s say that we want to add a new feature such that the base pay will be increased by 50% if the delivery was done during rush hour. This will be very easy to iterate on the existing logic; in fact, all we need is to add a new function to the end of the calculation funnel, which multiplies the input by 1.5 if the delivery was done during rush hour. In this case, the input will be the base pay calculated in the previous step. However, as a developer, I don’t need to worry about where the input is coming from and how the input is calculated. As long as we know that the task of this pure function is to compute a new value, it’s a very easy function to write.

Increased testability

When a function is a pure function, the output of the function is deterministic given the same input. This makes testing the function much easier because the test can be structured as a set of inputs and their expected output, and it’s guaranteed that running the function through these inputs will always yield the same expected output. For example, suppose we have a function that takes an array of integers and returns the second-largest number from the array. This operation is pure because:

The function does not depend on anything else other than the input
It doesn’t alter the input array or anything else in the system
Given the same array of numbers, the second-largest number will always be the same.

Therefore, the unit test for this function will be very straightforward because there is no need to mock any system variables or function calls, and the output is deterministic so there will be no flaky tests. Therefore, if we could all write FP-style programs, it would become much easier to write tests, especially for mission-critical applications.

Potential downsides to FP

It would be too good to be true if FP brought only benefits to the table without any potential downsides. One downside, depending on the programming language and the compiler, is that each function call could create a new call stack. Without optimization, these creations and destructions of call stacks could quickly become large runtime overheads for the application even when we are performing trivial operations. Luckily, this downside is not so bad, since Kotlin provides the ability to make a function inline, which resolves a lot of the problems if it is properly used. Simply put, instead of creating a new call stack and executing the code inside a function, an inline function basically replaces the function call with the actual content and places them in the body of the caller function.

Another potential downside of FP is its speed and memory usage. Since each function essentially creates new data from the existing data, these data creations can take extra time and space to be instantiated in memory. In IP, on the other hand, we mostly deal with mutable data structures that can be updated in place without allocating new memory. The problem of runtime speed can be mitigated by parallelism. Naturally, most pure functions in FP are highly parallelizable, which means we can run a large pool of functions without worrying about how they interact with each other or how they will affect the system variables. An effective strategy for running functions in parallel can potentially bring net-positive speed improvement to the program.

One of the most common operations in modern applications is Input/Output (I/O). When I/O is involved, it means the application is now dealing with the outside world. Examples of I/O include prompting the user for an input, invoking a remote procedure call (RPC) to another service, and reading data from a database. Because of the unpredictable nature of I/O tasks, they are most likely not pure, meaning both the input and output are not deterministic. When we are dealing with I/O tasks, writing pure functions forcefully to handle I/O is not the right approach. In fact, given the multi-paradigm nature of many modern programming languages like Kotlin, developers should choose the paradigm based on what’s best for the task at hand instead of strictly following one paradigm for the whole application. In the world of Kotlin, developers can use the standard I/O library from Kotlin, as well as the one from Java.

What does Kotlin provide for developers to leverage FP?

Before we get into the real actions of how to write FP code in Kotlin, it’s natural to wonder, is Kotlin even the right language for FP? The short answer is, definitely yes! In fact, one of the top FAQs from the official Kotlin language website states that “Kotlin has both object-oriented and functional constructs. Kotlin can use both OO and FP styles, or mix elements of the two.” So what features and tools does Kotlin have so that developers can write FP-style code?

Higher-order functions and lambdas

There’s a dedicated section in the Kotlin documentation that talks about this topic, so we won’t go over all the details. In summary, since Kotlin functions are first-class citizens, they can be stored in variables, they can be passed around in function arguments and return values, and they can define types around functions. With this capability, common FP functions such as the fold operation can be written easily in Kotlin because we can pass in any accumulative function to the fold function to combine the data.

On top of supporting higher-order functions, lambda expressions are neat ways to simplify the code without writing all the function declarations that usually cause a lot of mess in the code. In a nutshell, lambda expressions are functions that are not declared but are passed immediately as an expression. This makes reasoning and understanding the code much easier since we don’t need to jump through hoops to find out what the function actually does.

As a quick example, consider the following code snippet:

deliveries.sumOf { delivery -> delivery.customerTip }

In this snippet, sumOf is a higher-order function because it takes another function as an argument, and { delivery -> delivery.customerTip } is a lambda expression, which takes in a delivery object and returns the customer tip amount of the delivery. We will show more real-life examples of writing FP-style code in Kotlin in later sections.

Collection-based operations

Kotlin provides a powerful set of collection-based operations that can be used to facilitate FP-style computation. According to the Kotlin documentation, given a list of items, common operations fall into these groups:

Transformations: Transform all items in the data collection
Filtering: Return a subset of the items based on certain criteria
Grouping: Group them into smaller groups of items based on certain criteria
Retrieving collection parts: Return a subset of items in some fashion
Retrieving single elements: Return an item based on certain criteria
Ordering: Order the data collection based on certain criteria from each item
Aggregate: Returns a single value after applying some operations on all items

All the functions for collections from the standard library are in the Kotlin Collections API documentation. In later sections, we will see how developers at DoorDash usually utilize these common operations on a regular basis.

Comparing Kotlin with languages like Python, JavaScript, and C++

While Kotlin provides such a powerful set of tools for developers to write FP code, these tools and functions are not exclusive to Kotlin. In fact, many modern languages support FP-style development and provide similar sets of collection-based operations, especially in newer releases of these languages. The following table summarizes how Kotlin compares with these popular programming languages in terms of the availability of some features we’ve discussed so far.

	Kotlin	Python	Javascript/Typescript	C++
Higher-order functions	Yes	Yes	Yes	Yes (introduced in C++11)
Lambda expressions	Yes	Yes	Yes	Yes (introduced in C++11)
Function type	Yes	Partially (Dynamic Typing)	No in JS, yes in TS	Yes
Transformations	Yes	Yes	Yes	Yes (no map function, but has a transform function)
Grouping	Yes	No (not built-in, need to import other packages)	Yes	No
Aggregate	Yes	yes	Yes	No

While Kotlin supports all the features natively, other modern languages, such as TypeScript (which is the primary language for web clients at DoorDash), also have built-in library support. Thus, the knowledge of FP and common operations in Kotlin can easily be transferred to other modern languages in day-to-day coding.

Examples of how we at DoorDash write FP-style code in Kotlin

Now that we understand what FP is, what are the pros and cons of it, and what Kotlin provides for us to write FP-style code, it’s time to see FP in action. In all the examples below, we will use the following data classes as the context. Note that all the examples are hypothetical and for illustrative purposes only.

data class Delivery(
    val id: UUID,
    val dasherId: UUID,
    val basePay: Double,
    val customerTip: Double,
    val dropOffTime: Calendar
)

data class Dasher(
    val id: UUID,
    val name: String
)

Let’s start with an easy but very common example: Given a list of deliveries, return a list of total pay amounts where they are greater than $10.

Let’s first look at how we can do this in IP style.

val totalPayAmounts = mutableListOf<Double>()
for (delivery: Delivery in deliveries) {
    val totalPay = delivery.basePay + delivery.customerTip
    if (totalPay > 10) {
        totalPayAmounts.add(totalPay)
    }
}
return totalPayAmounts

For the sake of comparison, here’s the thinking process behind this code snippet:

Create an empty container that will hold the desired output
Loop through each delivery in the input
Compute the total pay
If the total pay is greater than $10, add it to the output container
Return the output container

Now let’s look at how we can write the same logic in FP style.

return deliveries
    .map { delivery -> delivery.basePay + delivery.customerTip }
    .filter { totalPay -> totalPay > 10 }

And the thinking process behind this code snippet:

Transform all deliveries into the total pay of each delivery
Filter and keep only those that have total pay greater than $10

From this example, it’s not hard to imagine how different the mindsets are between FP and IP. In the iterative style, the logic flows from top to bottom, and it uses a mutable state (totalPayAmounts) and a for loop to compute the end result. In contrast, FP focuses on how we deal with the input data by transforming and filtering the input data. In the FP-style code snippet, there are no additional states introduced, and no loops are being used. Instead, it uses Kotlin built-in collection-based functions map and filter, in conjunction with two lambda expressions to compute the final result list. Overall, it makes reading the logic easier and reduces additional states being created in the program.

Let’s look at another more elaborated example. Suppose we have a list of deliveries, and we want to keep only the deliveries that have customer tip greater than $5, find the latest 10 deliveries by the delivery drop-off time, and get the Dasher ID of these deliveries. As before, we’ll start with how we can write this in IP style.

val filteredDeliveries = mutableListOf<Delivery>()
for (delivery: Delivery in deliveries) {
    if (delivery.customerTip > 5) {
        filteredDeliveries.add(delivery)
    }
}

// Sort by delivery.dropOffTime descending
val sortedDeliveries = Collections.sort(
    filteredDeliveries,
   dropOffTimeComparator
)

val result = mutableListOf<UUID>()
for (i in sortedDeliveries.indices) {
    result.add(sortedDeliveries[i].dasherId)
    if (i == 9) break
}

For the same logic, here is the FP style code:

val result = deliveries
    .filter { it.customerTip > 5 }
    .sortedByDescending { it.dropOffTime }
    .map { it.dasherId }
    .take(10)

Here we use a Kotlin special identifier it, which is used inside a lambda expression to refer to its parameter implicitly. In the lambdas above, all its represent the delivery object in the list.

There’s no doubt how clean and elegant the FP-style code looks compared to the IP code. Reading the snippet is basically reading plain English:

Filter the deliveries to keep only those with customer tip greater than $5
Sort the list in descending order of delivery drop-off time
Transform the elements into the Dasher ID of the delivery
Take the first 10 elements from the list

While this example looks simple enough for illustration purposes, it’s not hard to see how flexible it is if we want to apply more complex logic to the collection. Suppose that multiple teams want to filter the list of deliveries based on their own logic, call them complexFilterFunc1, complexFilterFunc2, and so on. They can simply apply the filtering logic directly to the deliveries by calling the functions in a series. Since filter is a higher-order function, it can take other functions as the argument.

val result = deliveries
    .filter { complexFilterFunc1(it) }
    .filter { complexFilterFunc2(it) }
    .filter { ... }
    ...

Better yet, because these filtering functions are pure, they can be reordered and invoked in any order without changing the underlying logic.

val result = deliveries
    .filter { complexFilterFunc3(it) }
    .filter { complexFilterFunc1(it) }
    .filter { ... }
    ...

If passing it to all the filtering functions seems redundant, Kotlin has a way to pass in the function reference to a higher-order function using a double colon ::

val result = deliveries
    .filter(::complexFilterFunc1)
    .filter(::complexFilterFunc2)
    .filter(...)
    ...

By now we should be familiar with how to write FP-style code on a list of items and transform it to another list. What if we want to transform the list into other data structures like map? This is not only possible, but very common in our day-to-day coding as well. Let’s look at an example.

Suppose we have a list of deliveries. We now want to see for each Dasher, how much they earned in tips for each hour of the day. The end result will be structured as a map from Dasher ID to another map, where the key is the hour of the day, and the value is the total customer tip they earned. We will start by looking at how we will do this in IP style.

val dasherIdToDeliveries = mutableMapOf<UUID, MutableList<Delivery>>()
for (delivery: Delivery in deliveries) {
    if (dasherIdToDeliveries.containsKey(delivery.dasherId)) {
        dasherIdToDeliveries[delivery.dasherId]!!.add(delivery)
    } else {
        dasherIdToDeliveries[delivery.dasherId] = mutableListOf(delivery)
    }
}

val resultMap = mutableMapOf<UUID, MutableMap<Int, Double>>()
for ((dasherId, deliveriesByDasher) in dasherIdToDeliveries) {
    val hourToTotalTipMap = mutableMapOf<Int, Double>()
    for (delivery in deliveriesByDasher) {
        val hour = delivery.dropOffTime.get(Calendar.HOUR_OF_DAY)
        if (hourToTotalTipMap.containsKey(hour)) {
            hourToTotalTipMap[hour] = hourToTotalTipMap[hour]!! + delivery.customerTip
        } else {
            hourToTotalTipMap[hour] = delivery.customerTip
        }
    }
    resultMap[dasherId] = hourToTotalTipMap
}
return resultMap

This is definitely not a piece of clean code. It uses a double for-loop, two mutable maps, and two if-else blocks to get the final result. Now let’s look at how we can write this in FP style.

val result = deliveries
    .groupBy { it.dasherId }
    .mapValues { it.value
        .groupBy { delivery -> 
            delivery.dropOffTime.get(Calendar.HOUR_OF_DAY) 
        }
        .mapValues { hourToDeliveries -> 
            hourToDeliveries.value.sumOf { delivery -> 
                delivery.customerTip
            }
        }
    }

There are a few new functions being used, so we will explain what they do first before going through the code:

groupBy: Given a list of items, return a map from the key returned by the key selector (in this case, the lambda expression) to the list of items that has the corresponding key
mapValues: Given a map, return a new map with entries having the keys of the original map, and the values obtained by the transformation function
sumOf: Given a list of items, sum the list by the key selector

With these definitions in mind, the FP-style code reads like this:

Group the list of deliveries by Dasher ID
For each group of deliveries, group by drop-off hour
For each sub-group of deliveries (from grouping by drop-off hour), sum by customer tip

This example demonstrates the capability to group and aggregate a collection of data with FP in Kotlin. It’s very common to put nested collection-based functions in the intermediate collections created from the previous step and transform them into whatever new data type is needed. This is a very powerful capability as developers are not restricted to transform data to the same type as the input.

Conclusion

Functional programming is a powerful programming paradigm that can help developers easily write cleaner and better code for day-to-day programming needs. This is especially true when developers are working on mission-critical operations, large distributed systems, and intensive data transformation. Leveraging it along with other common paradigms like object-oriented programming can help achieve the best of both worlds, especially with the rich ecosystem that the Kotlin language provides. While FP has its potential downsides, with modern techniques and thoughtful designs, we can aim for increased simplicity, testability, and readability without sacrificing efficiency and speed.

Maintaining Dasher supply to meet consumer demand is one of the most important problems for DoorDash to resolve in order to offer timely deliveries. When too few Dashers are on the road to fulfill orders, we take reactive actions to persuade more Dashers to begin making deliveries. One of the most effective things we can do is to message Dashers that there are a lot of orders in their location and that they should sign on to start dashing. Dashing during peak hours can mean a more productive shift, higher earnings, and more flexibility in choosing which offers to accept.

We need to optimize which Dashers to target with our messages because approaching Dashers with no interest in dashing at that time can create a bad user experience. Here we will describe a bandit-like framework to dynamically learn and rank the preferences of Dashers when we send out messages so that we can optimize our decisions about who to message at a given time.

Finding the best way to alert Dashers about low supply

Currently we select Dashers to message by identifying who has been active in a given location and then selecting recipients at random. While this approach doesn’t overload specific Dashers with messages, it doesn’t improve the conversion rate of Dashers coming onto the platform after receiving a push notification.

We need to find a methodology that uses our information about Dasher preferences while avoiding spamming Dashers who wouldn’t be interested in receiving notifications at that time. This problem statement lends itself to finding a machine learning approach that can:

Identify current responsive Dashers who are more likely to convert when asked to dash now
Identify Dashers who aren’t interested in these messages so we can avoid spamming them
Identify new responsive Dashers so that we don’t overtax our existing responsive Dashers
Rank Dashers by their willingness to dash when contacted so we know how to prioritize who to message at each send

ML approaches we considered

One possible approach is to treat this as a supervised learning classification problem. We can use past data that is labeled – for example, we see whether a Dasher historically has signed on to dash when invited – and try to create a model that predicts a driver’s probability of dashing now when sent a message under a given set of features.

While this approach is easy to frame as a binary classification model, there are some issues with this approach. What if Dasher preferences change over time? For example, a Dasher who is enrolled in college could be very responsive during breaks, but largely unavailable once school resumes. This type of non-stationary behavior would have to be handled by the model trainer through retraining and heavily weighing more recent observations.

Another problem with this approach is that it only optimizes for the probability of dashing when a message is sent. With this approach, we would only be sending messages to Dashers we already know are likely to convert. There would be no basis to send messages to other Dashers, giving them a chance to self-identify as responsive Dashers.

Because of our constraints and what we are optimizing for, there are multiple benefits to using a bandit algorithm instead of supervised learning. We can construct a bandit-like procedure that allows us to dynamically explore Dashers to message, over time identifying and optimizing on Dashers who respond to messages. This approach would allow us to dynamically allocate traffic to Dashers who are more responsive.

As Dasher preferences change over time, the algorithm can relearn dynamically which Dashers would be most likely to convert. We can even easily extend this framework to use a contextual bandit; if Dasher preferences change based on time of day or day of week, the bandit can be given these features as context to make more accurate decisions.

Next, we need to select which bandit framework to use in order to allocate traffic to Dashers dynamically.

A trio of possible bandits

There are multiple factors involved in determining which bandit to use. We want an algorithm that explores enough to adjust to changing Dasher preferences and yet still sends messages to Dashers who we already know are responsive. Several algorithms come to mind as possible choices:

The Epsilon-Greedy algorithm defines a parameter – epsilon – that determines how much to explore sending messages to Dashers about whom we don’t know as much.

Pros:
- Easy to understand and implement
- Makes it easier to prioritize known Dashers based on their likelihood to respond to messages
Cons:
- Because we have to define this constant epsilon percentage, it does not improve over time. We can explore too little early on and too much later in the process
- Experimentation is not dynamic; no matter what we have learned about Dashers’ preferences, we are always exploring at a fixed percentage

The Upper Confidence Bound (UCB) bandit algorithm is based on the principle of “optimism in the face of uncertainty,” which translates to selecting the action that has the highest estimated reward.

Pros:
- Finds the best-performing Dashers quickly
- Once there’s enough data, starts to optimize sending messages to responsive Dashers instead of exploring
Cons:
- Difficult to communicate the strategy to stakeholders about why a specific action was taken
- When there is an excess of new Dashers, this method could end up only messaging new Dashers until enough signal is received

Thompson Sampling takes a Bayesian approach to the problem. We assign a prior probability distribution to each Dasher that is updated to a posterior probability after reviewing observations.

Pros:
- Intuitive approach that counts the successes and failures of each message sent to a Dasher
- Depending on the probability distribution used, we can take advantage of the conjugate relationship between prior and posterior probabilities and use a simple update rule to get the posterior probability
- Easy to implement
- Finds best-performing Dashers quickly
Cons:
- Requires manually setting priors for new Dashers; an approach like UCB always includes Dashers we have not previously messaged

Why we chose Thompson Sampling

Given these three frameworks, we selected Thompson Sampling for its intuitive approach and ease of implementation.

We started by defining our target function: Determining what the probability is that a Dasher who receives a message will convert and sign on to DoorDash immediately. After this, we needed to compute a prior for each Dasher from which we could sample to decide who to message. A prior is a probability distribution that models the probability that a given Dasher will respond when messaged. Along with choosing an appropriate prior, we also need to have a method for updating it given new information. We used a beta distribution to do this because it directly uses the number of successes (alpha) and number of failures (beta) to create a distribution of success. By using the conjugate relationship between beta prior and posterior distributions, we developed an intuitive update rule – add to alpha if a Dasher converts or, if not, add to beta. As we update the distribution following each message, the variance of the distribution shrinks as we become more certain of the outcome.

Our last decision when defining the prior was whether to start at pure exploration — uniform distribution – or use past data to inform our prior. We chose to inform each Dasher’s prior with previous messages and conversion data to speed up the convergence of the distributions. We apply a weight-decay parameter on previous observations to favor recent data over historical observations. This way, when we start the experiment, the bandit has a head start on Dasher preferences without biasing too heavily to old – and potentially stale – data.

Next, we needed to tune a set of hyperparameters vital to modeling the situation accurately. Among the steps we took were:

Consider the length of each observation – over what time period should we use to consider each observation? If it’s too short, we can’t accumulate enough reward/penalty for each run. If too long, it takes extra time to update the algorithm to find high-performing Dashers.
How stationary is the problem? Dasher behavior changes over time, so we must give greater weight to recent observations than those recorded in the past. If a previously responsive Dasher ceases to respond, we need to update our probability distribution quickly.
What prior should we give new Dashers? It’s important to add new Dashers to the algorithm without degrading our performance while still giving them a chance to be selected so that we can learn if they are a high-performing Dasher.
Given that there’s an imbalance in data (– a majority of many more Dashers choose not to dash when messaged), – how much weight should we give success vs. failure?

After defining our beta distribution, update rule, and these hyperparameters, we are ready to use the bandit procedure to decide which Dashers to message. In our experiment, whenever we are ready to send out a message, we let the bandit sample all prior distributions to give us the probability of converting when messaged. We then rank the Dashers in descending order by their sampled value and take the top Dashers whose sampled value is greater than a predetermined threshold so that we don’t message Dashers who the bandit has determined won’t convert. We define the number of Dashers to contact by first determining how many are needed to resolve the current shortage. We then divide that number by the average conversion rate for Dashers in that location. The bandit then can message the Dashers who it has determined are most likely to get on the road.

Results

Currently, we are running experiments to test this bandit framework against our previous random sampling method. We are using a switchback experiment to measure the impact that improved message targeting has on the overall supply/demand balance for a given location. Using this testing framework, we not only see if there is an increase in Dashers who respond to messages, but we can also see what effect these additional Dashers have on the market supply. So far, we have seen an improvement in the conversion rate of messages sent in the bandit framework, which has allowed us to send fewer messages than required by our control variant. We are experimenting further to prove the impact.

Conclusion

While we have tailored Thompson Sampling to a specific Dasher scenario, this solution can work in many different scenarios. Companies seeking to provide a personalized experience to all of their customers may have limited data to figure out how to best accomplish that. Thompson Sampling can help demonstrate which options give the greatest reward in a non-stationary environment. The method works well in a quickly changing business environment where there’s a need to dynamically optimize traffic. With a single model, we get the advantages of velocity, dynamic traffic allocation, and a solution that handles changing behavior over time.

While what we have done to date works well, there are many ways we can improve upon this approach. Currently, we only consider whether a Dasher signed on after receiving a message. But additional data lets us know that Dashers’ preferences change based on their location, time of day, day of week, and much more. Over time, we can encode this information as contextual features so that the bandit can make even smarter decisions.

Acknowledgements

This post is based in large part on the great work of our intern Hamlin Liu. We are excited to have him join us full time in August!

At DoorDash, machine learning (ML) models are invoked many millions of times each day. Each of them uses dozens or hundreds of features that take a dazzling amount of computational power to produce.

These models, which play many critical roles, including fraud detection, must meet stringent requirements of reliability and correctness in order to be put into production. We also need to be able to quickly adapt them to evolving business needs and improved understanding of the problems being addressed.

In this article, we describe the practice of “dark shipping” of ML models. This practice allows us to balance the tension between the needs of reliability and speed for model deployment, which can be challenging in some areas of ML application, such as for models that prevent fraud and abuse.

The challenges of launching ML fraud models

The challenges to successfully launching machine learning fraud models include:

Complex feature engineering
Scaling and availability
Correctness in production

Let’s start by examining them individually.

Complex feature engineering

Our anti-fraud specialists are in constant search for insights into how to identify and stop fraud, even as the fraudsters are in constant search of new ways to perpetrate fraud.

The insights produced by anti-fraud specialists then need to be implemented in ways that can be leveraged by machine learning algorithms. This is usually done through the process of feature engineering, which involves data scientists who create the new features, train, and evaluate different model variants, settling on the most promising features and modeling techniques.

These features and models then need to be fully trained and put into production by ML engineers, which leads us to the next challenge.

Scaling and availability

Once a novel fraud-fighting approach has been identified and validated by anti-fraud specialists and data scientists, it then needs to be delivered to production. DoorDash has a capable general-purpose machine learning platform. The anti-fraud ML model capability, while leveraging the DoorDash ML platform, is invoked in the context of the overall anti-fraud platform. Leveraging these two platforms allows us to address the challenges of scale and availability, while tying complex ML models into the context of fighting fraud.

As a result, hundreds of complex model features are computed in real-time and the models are invoked for nearly every interaction with the platform, resulting in activation of anti-fraud measures depending on decisions rendered by the models.

Ensuring correctness in production

In addition to meeting the challenges of scale and availability, we must meet the challenges of end-to-end correctness while invoking the models. Potentially, lots of things can go wrong, and even though we test the models at every stage during the model development lifecycle, the final answer to model correctness can be found only in production, with real, novel data.

This presents a conundrum: What if the new version of the model we shipped is less efficient than the previous model at stopping fraud? Even worse, what if the new model has a catastrophic defect, leading to the blocking of every attempted transaction? Another nightmare scenario: What if the model performs as expected but exerts prohibitively high load on our systems, due to expensive queries? At DoorDash volumes, a regression of that kind can result in systems quickly grinding to a halt under unexpected load.

Clearly, we cannot ship a model to production and just hope for the best.

A familiar challenge – change management

Generally speaking, change management is a familiar problem, particularly in large, business-critical software systems. In fact, the vast majority of production regressions and outages are caused by human-introduced changes, such as changes to code or configuration of the systems.

To meet the challenge of change management, the software industry has developed a large body of knowledge, skills, and tools when it comes to the rollout of code and configuration.

Modern large-scale software systems deploy continuously or nearly so. One of the techniques making it possible is shipping the new code darkly: The new code paths are “guarded” by feature flags and are not activated on deployment but are activated after deployment, usually gradually and under careful observation of relevant metrics. If a regression is observed, the offending code paths can be turned off quickly, without the need for code rollbacks or deployment forward hotfixes, as these usually take much longer.

ML adds additional complications of change management

However, as mentioned above, management of change for ML models presents additional complications, including:

Data quality: Both at the time of training and at the time of inference (production operation), we need to make sure that the data is extracted consistently, without errors.
Training stability: for example, sensitivity to hyperparameter values, consistency on retraining
Difficulty of automating verification: Writing good tests for code is hard enough. Writing similar testing suites for ML models is nearly impossible. Yet somehow we must control the quality of model scores and decisions.
Difficulty of sense-making: While the source code can be examined directly to find bugs and make sense of its workings, the ML models are less easily interpretable.

With ML models, even more so than with “regular” code, expectations of correctness can be verified only in production. But how to do it safely? By using a dark rollout.

Solution: Dark rollout of ML models

After a reasonable pre-production validation, we ship the model to production in a manner that allows us to fully validate it with real traffic before we allow it to make live decisions. Below is the sequence of steps developed and practiced by the DoorDash Anti-Fraud DSML team.

Step 0: Pre-production iterations

Before a model goes to production, it is iterated rapidly and extensively in the development environments, where it is updated, trained, evaluated, and tuned, with a turnaround time ranging from minutes to hours. Once the backtesting results look consistently good, it’s time to go to production.

Step 1: Production: Shadow traffic, 1% volume

If new model features require additional production code (for example, to integrate with novel data sources), it’s added as dark code paths, along with the model invocaction code.

These changes are highly standardized: They leverage the Anti-Fraud team’s rule engine and DoorDash’s ML service, together implementing a complete model lifecycle. The result is a trained model that can serve predictions reliably and at scale.

The rule engine provides important facilities for fault isolation, observability through logging and metrics, integration with data sources, as well as integration into overall DoorDash microservice architecture.

These facilities allow us to exercise the new model with “shadow” traffic (that is, without any business decision impact), with a volume as low as just a fraction of a percent.

At this time, the model is exercised safely (at low volume and with shadow traffic only), while in a true production environment, end-to-end. This allows us to verify multiple things:

There are no errors due to misconfiguration, missing data sources, timeouts, etc.
The model performance is within expected parameters.
All features are extracted correctly; that is, inference-time feature extractors produce the same values as training-time feature extractors.
There are no anomalies in system metrics, such as high latencies, memory consumption, CPU utilization, etc.

These checks are performed with both the specialized tools (for example, for feature extraction consistency) as well as with standard observability and alerting stack (using time-series dashboards, log monitoring, alerting, and paging services).

Step 2: Production: Shadow traffic, 100% volume

We can now ramp up the shadow traffic to 100% of the volume, which serves two purposes:

We can analyze model performance without risking any adverse business impact.
We can make sure there’s no undue deterioration of system metrics due to additional load.

Step 3: Experiment: Incumbent model vs. new model

By now, we are reasonably confident that the model will perform well. But will it do better than the previous champion model? To find out, we use the DoorDash Curie experimentation system, setting up an experiment that compares the performance of the old and the new models in a rigorous evaluation. Once we see statistically significant improvement, the new model is ramped up to receive 100% of the live traffic – until a newer version arrives to challenge the champion!

Conclusion

The practice of shipping ML models darkly enables us to iterate on production ML deployments quickly while minimizing risk of regressions. This is achieved by applying production change-management practices borrowed from modern software engineering and adapted for the specifics of machine learning. We encourage ML practitioners to explore this and other techniques that bridge the gap between applied ML and modern production engineering.

When DoorDash was on a monolithic application, developers’ end-to-end (e2e) testing needs were solved by sandboxes but, when DoorDash moved from monolith to microservices, we needed a more scalable approach to production testing. E2e testing is a technique to verify a product from beginning to end to ensure the application flow behaves as expected in a production environment.

Given that sandboxes were not meeting our needs, since running all the microservices requires a lot of hardware resources, which is hard to satisfy by one machine, we initially set up a staging environment for developers to e2e test their features before deploying or enabling them in a production environment. But the staging environment solution didn’t solve all the developer e2e testing needs. Moreover, the growth of DoorDash made it difficult to operate and maintain the staging environment. The staging environment is an isolated environment that is intended to be a replica of a production environment without personally identifiable information (PII).

We had also noticed that developers started using custom workarounds in the production environment for e2e testing. But the nonstandard custom workarounds were hard to build and maintain and were, in fact, slowing down developers.

To address the limitations of the staging environment and bring standardization to e2e testing, we moved e2e testing into the production environment through multi-tenancy, establishing guardrails and building developer tooling to speed common tasks so that developers can easily e2e test their features in a production environment. As a result, the new solution increases software reliability, developer velocity, and production environment safety.

Why e2e testing is not reliable in the staging environment

We noticed that e2e testing in the staging environment was becoming unreliable due to several limitations, such as lack of observability/alerting and the lack of a good way to simulate the production environment data. These limitations resulted in the staging environment deviating from the production environment, resulting in development flows where the staging environment was either ignored or misused.

Lack of observability and alerting over the e2e functionality in the staging environment

The staging environment was missing critical observability and alerting over the e2e functionality. Even though individual services had good test coverage, observability, and alerting, it was not sufficient to ensure that all the services worked in unison for the expected e2e behavior. Since the staging environment doesn’t impact real user functionality and the constant focus on the production environment, it became hard over time to develop such observability and alerting over the e2e functionality in the staging environment.

Lack of tooling to simulate production data in the staging environment easily

For proper e2e functionality of the staging environment, staging databases need to be populated with data that is a good simulation of the production environment data. This data simulation is also important to make the e2e behavior in the staging environment similar to the production environment. Otherwise, it is hard to gain confidence that changes verified in the staging environment work similarly in the production environment. A naive approach to copy the production environment data into the staging environment would easily result in personally identifiable information (PII) ending up in the staging environment, violating our compliance requirements. These compliance requirements make the production environment data simulation difficult at the scale of DoorDash.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Development flows where the staging environment is either ignored or misused

Due to the above-discussed limitations, the staging environment became increasingly divergent from the production environment. Given the lack of observability over the e2e functionality with the staging environment, many teams began to ignore the staging environment and directly roll out features to the production environment. This development flow resulted in the staging environment missing critical functionality compared to the production environment. Additionally, given there are no service-level objectives (SLO) with the staging environment, many developers ended up deploying new, unstable service versions in the staging environment for individual service testing. This development flow resulted in the staging environment becoming even more unsuitable for e2e testing.

Why we decided to move e2e testing into the production environment

Because the staging environment had become less beneficial for e2e testing and fixing staging involved a huge effort, we decided to move e2e testing into the production environment.

Why not fix staging? Here were our reasons:

Adding observability and alerting over the e2e functionality in the staging environment involves an entire overhaul of all the services in the staging environment. Given there are hundreds of microservices in our stack, this requires huge development work and collaboration across the entire engineering organization.

There is no readily available tool to simulate the production environment data in the staging environment. Moreover, a naive approach to copy the production environment data into the staging environment would easily result in PII ending up in the staging environment, breaking our compliance requirements.
The development flow needs to be enforced so that the staging environment uses stable service versions and configurations that are the same as the production environment. Even though this is relatively easy to implement, this is a very significant cultural shift.

In contrast, we noticed the following advantages with the production environment:

The production environment has a very high focus on observability and strict service-level agreements (SLAs) available for e2e testing use cases.
Using the production environment for e2e testing with necessary isolation would avoid data duplication between the environments and eliminate the risk of breaking compliance requirements.
The production environment has enforcements in place to use stable service versions only.

Further, we noticed that developers had been using the production environment for e2e testing based on their custom workarounds rather than maintaining the services and simulating production environment data in the staging environment. Custom workarounds included:

Allowlisting/blocklisting one’s employee account to customize the e2e functionality.
Ad hoc scripts to create test users and use multiple runtime configurations to change the behavior for these test users.
Building custom logic to filter test users and data in analytics.

Due to the nonstandard nature of these workarounds, developers spent a lot of time building/maintaining them, which motivated us to move e2e testing into the production environment with a focus on production environment safety and developer velocity.

How we moved e2e testing into our production environment

Once we decided to move e2e testing into the production environment, we identified the following requirements for the new solution based on the properties of the isolated staging environment.

The staging environment is isolated and not accessible to the external world (i.e., outside VPN). We required the same property in the new solution for security reasons. This requirement implies the PII generated by e2e testing is isolated and not accessible to the outside world.
To simplify and standardize the custom workarounds, we needed a standard mechanism to customize software behavior for e2e testing in the production environment while sharing most of the existing software behavior with the production environment.

Using multi-tenancy to unlock e2e testing in the production environment

The above requirements pointed us to a solution using multi-tenancy, a concept in which the same instance of the software is shared with different user management, data, and configuration.

Multi-tenancy is an architecture paradigm where one software application and its supporting infrastructure are designed to serve multiple customer segments, also called tenants. We addressed our e2e testing challenges in the staging environment by adopting the multi-tenancy model and introducing a new tenant named DoorTest in the production environment. The multi-tenancy model enabled us to easily isolate test data and customize software behavior for the DoorTest tenant.

How we incorporated multi-tenancy in our stack

To isolate DoorTest tenant data and customize software behavior for the DoorTest tenant, we first needed a standard way for all the services to identify the tenancy of the incoming requests.

First, we defined a convention for the tenant value to bring standardization. The tenant value has two levels. The first level, defined as L0, represents a high-level product vertical (e.g., DoorDash, DoorTest, Drive, Storefront). A second level defined as L1 represents a further classification necessary within the given product vertical (e.g., developer sandbox). The tenant value identified by the string <L0:L1> is propagated across all the services through OpenTelemetry. Services can isolate data and define custom behavior based on the L0 and L1 tenant values.

We provided an option to set the tenant value, i.e., <L0:L1> string in our mobile and web apps. This tenant value is attached as an HTTP header to all DoorDash API calls. This header gets transformed to baggage format at the CDN layer and propagated across all our services through OpenTelemetry. This process enables our apps to operate with DoorTest tenant users and data.

*Figure #1: Mobile client has an option in the debug panel to set L0 and L1 tenant values. Once these values are set and the app is restarted, the app will operate with DoorTest tenant users and data.*

How we isolated test data and customized software behavior for the DoorTest tenant

Through the above-discussed process, all the services in our stack can identify the tenancy of the incoming requests. We followed the following steps to isolate test data and customize behavior for the DoorTest tenant:

We created a new database for the DoorTest tenant and added a new column to all user-related tables called tenant-id, which stores the string <L0:L1>. This step introduces physical isolation of the DoorTest tenant (L0) data and logical isolation of the L1 tenant’s data.
We added a query routing layer to pick the correct database. All the queries are updated to add an extra filter for the tenant-id column based on the incoming request tenant information.
We enforced a unique non-null constraint on the phone_number column canonical to the real-world behavior in the production environment database. We relaxed this constraint by making the phone_number column nullable for the DoorTest database. This relaxation simplified the creation of test users since we don’t need to find unique phone numbers for each test user.

Once we isolated the test data, we noticed that more guardrails were necessary to enhance production environment safety against e2e testing.

Figure #2: Mobile and web clients have an option to set DoorTest tenancy. The tenant information is attached as an HTTP header to all outgoing DoorDash API calls. The CDN layer transforms the HTTP header into baggage format. The DoorTest tenant context is propagated across all the services through OpenTelemetry. Each individual service applies data isolation based on the tenancy context.

Building guardrails to enhance production environment safety against e2e testing

We leveraged the multi-tenancy model to build guardrails to enhance production environment safety against e2e testing:

Since the new tenant DoorTest is only applicable for internal usage, we needed the new tenant to be accessible only on VPN. To achieve this, we added a safeguard to our CDN layer to inspect incoming traffic and remove the tenant header if the value is DoorTest and the client IP is not in VPN.
E2e testing in the production environment shouldn’t impact real users, so we built the following guardrails by using the tenancy context of the incoming requests in the backend services:
- Test consumers cannot place orders with real stores. Similarly, real consumers cannot place orders with test stores.
- Test Dashers (DoorDash’s term for delivery drivers) cannot get orders assigned from real stores. Similarly, real dashers cannot get orders assigned from test stores.

With the above guardrails, we no longer needed the custom workarounds that are hard to maintain, thus improving developer speed. In addition, the multi-tenancy model paved the way for building more such guardrails in a standardized fashion.

Building tooling for developers to speed e2e testing in the production environment

Along with building the new tenant DoorTest with the guardrails, we identified the need for creating UI tooling that automates the most common tasks to speed e2e testing. The new tool enables the following functionality to speed up e2e testing in the production environment:.

Automatic creation of test users (consumer, dasher, etc.) in the new tenant
Simulating test user’s address or geographical location
Easy access to a list of pre-created test stores and the ability to automatically place orders from them
Ability to create Dasher shifts and automatically assign test orders to them

This internal UI tool is backed by a microservice that exposes the same functionality through gRPC API so that developers can leverage the same functionality in automated e2e tests. Through this tooling, developers can create various scenarios reliably and quickly for their e2e testing purposes.

Conclusion

Lots of companies develop using staging environments, but we found that our staging environment was hard to maintain and operate in a large distributed system. So we developed a solution that moved e2e testing into the production environment using multi-tenancy with guardrails.

By moving e2e testing to the production environment, we observed the following results:

Given the production environment has a high focus on observability and alerting, the reliability of e2e testing is very high in contrast to e2e testing in the staging environment.
The multi-tenancy model provides a standard mechanism for services to do data isolation and customize behavior for e2e testing. This eliminates the need for copying production data into the staging environment and the need for custom workarounds that are otherwise needed for developers to perform e2e testing in production.
As a result of the standardization brought by the multi-tenancy, it became possible to build developer tooling that automates the most common tasks to speed up e2e testing.
Since this multi-tenancy model provides isolation and safety for e2e testing in the production environment, we developed shift-left tooling through Signadot. This tooling allows developers to e2e test their features during the development phase to gain more confidence over their code changes in addition to unit/integration tests. We will follow up with a detailed blog post explaining our journey with this shift-left tooling.

Acknowledgements

Thanks to Marco Chirico, Adam Rogal, Amit Gud, Jessica Mckenna, Carlos Herrera, Ivar Lazzaro, Maria Sharkina, Ignacio Scopetta and many other individuals who contributed to this effort.

During Women’s History Month and throughout the year, DoorDash works to broaden access to opportunity, highlight women-owned businesses on our platform, and spotlight women who are leaders in their communities. As a company, we’ve also taken intentional steps in all aspects of our hiring, pay equity and promotion efforts to empower equal outcomes.

The powerful legacies of women are too often forgotten and, to promote a more equitable society, we all must highlight and celebrate the work of women. That’s why in addition to our external spotlights, we also took time internally to celebrate, connect and grow together through the stories of present and past women leaders in collaboration with our Women@DoorDash Employee Resource Group (ERG).

We kicked off Women’s History Month with a meet and greet, offering time for our leaders and community members to break off into small groups, make connections and engage in conversation. Our women leadership team, featuring Imani Grant (General Manager, Southeast), Katie Egan (Chief of Staff), and Liangxiao Zhu (Vice President of Engineering) also took part in a fireside chat, sharing how they’re changing the tide and leaving their mark in traditionally male-dominated fields.

“Women@DoorDash has been a platform for me to elevate my understanding of our business, think critically about my career goals, and engage with and learn from women across our organization. I’m passionate about education as a lever for women in tech to level the playing field, and Women@ is an incredible platform to facilitate that education.
‍
“As a three-sided marketplace business, there are many teams at DoorDash that I don’t cross paths with organically in my work. Through Women@, I’ve been able to learn from women across our Dasher team, Engineering team, Leadership team, and many others, which has given me the ability to scale as an operator and business partner.”
_–_{Aislinn McNiece, Senior Associate, S&O}

To close out the month, we were joined by several women business owners who have partnered with DoorDash to share challenges they’ve experienced and lessons learned as women in the restaurant business.

“Through our Women@ ERG, I have had the opportunity to meet and listen to women who are in spaces I would have never otherwise interacted with,” said Marissa Patton, Support Trainer. “It has given me a community of people to learn from. I think that is irreplaceable not only in a business, but life as well.”

The university recruiting team also hosted a panel session for women and non-binary students to connect with leaders from the Women@ ERG and gain awareness about our university program, learn about opportunities offered at the company and gain a perspective into what life is like at DoorDash.

With these insights, we hope to help pave the road ahead for women and allies to help claim influence and power in the workplace.

As we continue to celebrate the contributions of women, we recognize there is still work to be done in establishing gender equality. In our long-term goal to bridge that gap, we’ve also established programs to support women-owned businesses and identified partnerships to help us in the recruitment of talented women. ‍

How we’re supporting women-owned businesses

DoorDash is supporting women entrepreneurs in raising capital through business loans with loan matching opportunities and Entrepreneurship & Access Programs in the United States and Canada.

Throughout Women’s History Month, we spotlighted women-owned businesses on our platform in partnership with Re:Her and Bumble. We also highlight businesses through an in-app “Women Owned” store banner throughout the year.

Click here to learn more about these initiatives and our partnerships.‍

How we’re recruiting talented women

Women Who Code‍
For four years, DoorDash has sponsored and supported Women Who Code, an international non-profit organization dedicated to inspiring women to excel in technology careers. They provide an avenue into tech and empower women with skills needed for professional advancement.

Women in Product
DoorDash has been an annual sponsor of Women in Product, which strives to equip women to thrive in product management careers at all levels.

Grace Hopper Celebration (AnitaB.org)
The GHC celebration was founded in 1994 and since then, it has been the premier gathering for women in technical roles, bringing together professionals from all levels of experience to network, professionally develop, and find their place in tech. DoorDash will be a sponsor at their upcoming conference in Orlando.

Sistas in Sales
DoorDash has sponsored the organization and their annual summit. The organization unites women of color in professional sales careers through networking, panel discussions, and socials. In 2021, we brought in DoorDash speakers to present at their annual summit. We also hosted a virtual event to highlight open roles on our various sales teams.

InHerSight
DoorDash is a proud partner of InHerSight, a platform that measures how well companies support women from entry level to the c-suite. Their website allows women to anonymously rate and review their workplace against metrics such as inclusive policies, pay equity, and culture. The platform not only allows us the opportunity to highlight the phenomenal women in our organization, but to learn how we can be a more inclusive and equitable place for women to excel.

The Muse
Founded by women, the Muse is the go-to destination for the next-gen workforce to research companies and careers. DoorDash partners with the Muse to help job seekers understand our company culture, workplace, and values through our employee stories.‍

Women’s history, women’s futures

At DoorDash, our work to support women-owned businesses is an ongoing effort that is reflected in our platform, voice, and resources. As we celebrate women’s history and the contributions they’ve made, we’re equally excited and looking forward, optimistically, to a more equitable world.

“In today’s work-from-home world, it can be easy to miss the fact that there’s a big, growing community of empowered women and allies at DoorDash,” said Margot Dale, Strategy & Ops Manager, New Verticals. “My favorite part of being a leader for Women@ ERG is getting that constant reminder that there are people from all corners of the company that care about my success and wellness. And I love providing that same reminder to others.”

To learn more about how we’re making room at the table and to find open roles, click here

It’s never been easier to get into iOS development than it is today. The tools are largely free, and there is a plethora of quality educational content available online, including some from Apple itself. As anyone who has done iOS for a long time will tell you, there is a magical feeling of starting a new project and quickly creating something from scratch.

However, as the size of a codebase increases and the number of contributing developers multiplies, that feeling of rapid progress can degrade into a slow trudge. Software development is a very iterative process, so the amount of time between making a small code change and being able to test/experience that change is very important to overall developer productivity. At DoorDash, we’ve been working to minimize this latency, keeping our development fast and our engineers happy.

We’ll talk about a few of the approaches we’re taking to increase development velocity, but the first one is also the easiest: buy your iOS engineers new M1 Max Macbook Pros!!

New M1 MacBooks make everything faster

Apple released the consumer-grade M1 chips in the Mac Mini and MacBook Air in late 2020 (with impressive performance), and was widely rumored to be releasing the new MacBook Pros soon after. We wanted to get ahead of the game, so we started to make a business case for upgrading our entire iOS team to these new laptops as soon as possible.

From a business standpoint, the math is straightforward: An investment in developer productivity has a “payback period”. It’s simply a function of the expected time savings and the engineering cost. Instead of recovering the investment in dollars, it can be recovered in engineering time that would have been otherwise lost. In our case, that required estimating how much time these new laptops would likely save the average engineer each day and multiplying out appropriately.

Based on rough usage patterns, we assumed an average iOS engineer does around five clean builds and 30 incremental builds each day… We multiply that by our average build times to get a sense for how much time each engineer spends “waiting for the compiler” each day. A certain xkcd comic comes to mind.

Similarly, we needed to measure how much faster the new laptops are, to calculate time saved. Doing this in advance of the hardware being available, we had to do some guesstimates based off of the consumer-grade chips as well as Apple’s marketing information—with a bit of a haircut applied.

In our case, we were able to estimate that upgrading an iOS engineer would result in about 35 minutes of “saved time” each and every day. Multiplying that by the average cost of employing an iOS engineer, we found a “payback period” of less than two months!

*Figure 1: Plotting the cost of “time spent compiling” of the old vs. new laptops shows the one-time cost of the hardware upgrade quickly pays for itself.*

Importantly, this rough napkin math was so overwhelmingly positive that we didn’t need to account for a bunch of the less-obvious or harder-to-measure advantages from these faster machines. For example, we didn’t quantitatively take into account how engineers would be less frustrated, how Xcode would be more responsive, that auto-complete would work faster, or that there would be reduced context-switching while waiting for compile times.

With a solid business case, we placed an order to upgrade all current and upcoming iOS engineers to a 16-inch MacBook Pro with M1 Max and 64 GB RAM.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

What are the actual results?

Fast forward a couple of months, and we’ve finally got our first shipment of these new machines. We wasted no time setting up a side-by-side build speed comparison between the new and old laptops and eagerly awaited the results.

*Figure 2: Timing multiple “clean” builds of our benchmark project on a 2019 Intel i9 MacBook Pro and a 2021 M1 Max MacBook Pro showed consistent performance improvements.*

we made in our pitch. All told, our benchmark build went from 7:46 min. to 3:48 min. That’s almost twice as fast!

Twice. As. Fast.

We found that to be especially impressive. If you think through what other changes you might make to a large codebase to speed it up by the same percentage — a bunch of micro optimizations that can be done here, some obsolete code to be removed there, etc.— you would not see anywhere near the same relative speedup, even with months of work.

However, not everything is good news. With the strong demand for these laptops and the ongoing global supply chain issues, we have yet to receive the replacement laptops for all of our engineers—but they’re en route.

How else to speed up compile times?

Even with all this extra horsepower, it still takes minutes for the average build to complete. Not terrible, but not exactly the rapid iteration we’re aiming for. And as our team and codebase continues to grow, our time spent compiling will continue to increase — probably faster than Apple can decrease it with faster laptops. So what else can we do?

We can modularize our codebase. Instead of having to compile a million lines of code, we break the app down into many smaller chunks of code, mapping out the dependencies between them. Individual engineers and teams can then work on subsets of the codebase, dramatically improving development iteration speed. This is still a work in progress for us, but we’re starting to see some of the benefits and will have more to say about it in the future.

In parallel with our modularization effort, we’re also adopting new technologies like SwiftUI and Xcode Previews. These technologies allow us to almost entirely remove the tweak-compile-and-run loop when developing user interfaces. Instead, we can define our preview data and “live code” the actual view’s code, getting nearly instantaneous feedback right in the Xcode canvas. While this only helps “view” development, we’ve found that to be an area where the fast iteration is especially helpful.

We hope to have more articles that go into depth on these alternative methods of speeding up our iOS development so stay tuned.

Conclusion

At the end of the day, if you’re doing professional development in Xcode all day, there is little doubt that upgrading to one of these new M1 MacBook Pros is a great choice. And if working in a large iOS codebase with a bunch of other passionate developers sounds interesting to you, come join us. We’ll have an M1 Max MacBook Pro waiting for you.

Finding the right balance between engineering and marketing has always been a challenge. In engineering projects, one focuses primarily on the user experience (UX), and subsequently, the developer experience (DX) that it takes to get there. But when marketing is added to the equation, the experience of marketing stakeholders, including the content team and designers, needs to be factored in.

Historically, marketing at DoorDash has been divided by DoorDash’s three-sided marketplace with Consumer, Dasher (delivery drivers), and Merchant, all with their own websites, content, design, and external agency support. As the number of external marketing websites grew, the quality, consistency, and manageability of these external websites began to suffer.

As the first engineering hires within the Growth Marketing group at DoorDash, we started up with a thesis that speed, performance, unification, and bringing engineering best practices into marketing are important principles. We spent 2021 building a scalable, highly performant platform that would allow marketers to standardize and scale up internationally – without the need for engineers or designers after initial setup – all on one multi-tenant team-agnostic, unified platform.

Simplifying, then adding lightness (legacy CMS woes)

Marketing at DoorDash in September 2020 was a simpler time. The company had a few months before its initial public offering (IPO) and COVID-19 had completely changed the restaurant landscape for the worse. The initial charter was to build a web team and manage the merchant acquisition website (get.doordash.com), responsible for more than 50% of restaurant sign-ups for the company.

The legacy CMS was being pushed to its limits. Issues included:

Minimal templatization and reusability – components, navs, headers, and footers were out of sync and inconsistent from years of manual edits
No brand or UI/UX consistency across websites
99% of content was not translated
Analytics were hit and miss; duplicative tags and console errors
Events and queries were not documented
Pages were cloned and manually hardwired to Salesforce campaigns

In addition to the merchant acquisition website, there were nearly two dozen other websites in a similar state within DoorDash. It came as no surprise that teams had formed a habit of using external agencies to launch siloed microsites on a dizzying array of external platforms.

All these teams were trying to do far too much, too manually, and too slowly. We needed to switch gears and build a platform that would put in UI/UX guardrails and promote engineering best practices, helping the company scale its marketing websites globally.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Building a Ferrari (or static site generation with Next.js)

DoorDash is a React and TypeScript shop, with a Design Language System and shared UI component library called Prism, managed by our design team. We’d want to choose a stack that would be consistent, manageable, and aligned with the values of other engineering teams – remember, the goal of this project is unification, not division.

We started with the merchant acquisition website and were determined to build a “Ferrari” that would be optimized for speed and performance – and we’d need to build a platform that we could reuse and leverage for other orphan websites down the line.

With the Google algorithm updates (Web Vitals) looming on the horizon, there would be a renewed focus on performance, page speed, and user experience. Static site generation was the obvious path forward, and our options quickly narrowed down to Gatsby and Next.js.

We staged proofs-of-concept for both frameworks and were impressed with the speed of static site generation. Still, the developer experience of Next.js felt far more like a natural extension of React, and it was love at first test drive, for several reasons:

Production-ready (we’re in good company with other enterprise users)
Large community support
TypeScript support

We would dub this project internally as “Project Ferrari.” Next, we’d need to find a content management system (CMS) that could keep up.

Navigating the content management landscape

Putting the stack aside, we knew we’d need a headless CMS for marketers and copywriters to create and manage content without the help of an engineer and designer – they couldn’t commit directly to a GitHub repo, could they?

We pared the list of candidates down to a handful of content management systems, with only a few meeting our stringent security and compliance requirements – even fewer felt as well-rounded, documented, and tenured as Contentful with the following features that made the decision a simple one:

Translation and localization support (i18n and l10n)
SAML SSO integration for centrally managed account permissions
Security certifications (ISO 27001 and/or SOC2 Type II)

Once we defined the tech stack and content management system, we needed to align on the technical project goals.

Defining technical project goals

We needed to ensure that the platform we were building would be easily reusable and scalable beyond just the merchant acquisition website that we were using as our initial pilot. Thinking about our long-term goals for the project, we knew we needed to:

Create a multi-tenant, team-agnostic website platform
Centralize content management across all core marketing web properties
Focus on templatization and sharing components for a cohesive UI/UX
Standardize and document analytics and event tracking
Optimize for performance – Google Web Vitals, page speed, and SEO
Live and breath accessibility (a11y), internationalization (i18n), and localization (l10n)

Internationalization (i18n) as inclusiveness

When we started this project, DoorDash was officially live in the United States, Canada, and Australia, and we pushed marketers to start thinking about internationalization as inclusiveness – all campaigns launched in the United States needed to include both English and Spanish. At the same time, Canada required both English and French (with strict enforcement in Québec).

In 2021, we also launched in Japan and Germany – our language requirements on the web were growing exponentially, far faster than the team could manage manually on the old platform.

Fortunately, Contentful has built-in connectors for popular services such as Memsource or Smartling, making it easy for marketers to manage translations and localization directly in the CMS (see Figure 1 below). In the code, we went with an extremely lightweight package (<300 bytes, including dependencies) called Rosetta (Next.js code example here).

**Figure 1:** The two-way flow of translation strings between Contentful and our translation service

Setting up CI/CD and hosting

At this point, we were trying to figure out where to host our Next.js project. Suppose we were to leverage existing infrastructure at DoorDash. In that case, we’d need to dockerize Next.js, integrate with existing Jenkins CI/CD pipelines, and find a way to get it hosted on AWS – and that would require hiring additional DevOps engineers to help set up and manage this infrastructure – a task and scope far larger than our small Marketing Engineering team could handle.

There had to be an easier way. We simply wanted to commit code to a repo, trigger a build (automatically with webhooks), and deploy a website in minutes.

We also wanted our team to spend more hands-on time working on marketing projects instead of getting in the weeds managing CI/CD infrastructure. More importantly, there was an urgency to see our work directly impact restaurants and merchants faster (we were now in the pandemic winter surge of 2021 and restaurants were closing their doors again).

Deploying to the edge

By chance, we stumbled upon a discussion on Hacker News on a Cloudflare blog post called, “Introducing Cloudflare Pages: the best way to build JAMstack websites.” In the comments, there were engaging replies by Cloudflare product managers and engineers. After some sleuthing, we reached out asking if they could squeeze us in for early beta access to Cloudflare Pages.

Cloudflare Pages takes care of the build and deploy process with a simple “git push” and we already know that static pages generated by Next.js are fast… But what if we could build and deploy static pages directly to the edge of Cloudflare’s global network (see Figure 2 below)?

**Figure 2**: Deploying static content to Cloudflare’s edge network helps us reach our visitors globally at breakneck speeds.

Exactly 18 minutes after we reached out to Cloudflare, we received a reply from a product manager with a green light and test accounts! And that’s been our experience and relationship with Cloudflare ever since – fast-paced, energetic, collaborative, and overwhelmingly positive. The road ahead was finally clear.

Measuring performance and Web Vitals

February through April were wide open throttle (WOT) and a complete blur. Our team took charge of the UI/UX design and mockups ourselves, leveraging Prism. But soon, we realized that our small team wouldn’t be able to build and migrate thousands of marketing pages by summer.

Understandably, our stakeholders wanted to see proof of what kind of lap times our so-called Ferrari could actually put down (in production) before they committed to the entire migration and encouraged us to run an A/B test.

To test and track our real-world performance, we would need to standardize on analytics events and measure user experience and signup conversion on the website. We use the “track” API call from Segment.io with the following payload from the “web-vitals” project:

const WebVitalNameMap = {
 ['FCP']: SegmentEventName.WEB_VITALS_FCP,
 ['LCP']: SegmentEventName.WEB_VITALS_LCP,
 ['CLS']: SegmentEventName.WEB_VITALS_CLS,
 ['FID']: SegmentEventName.WEB_VITALS_FID,
 ['TTFB']: SegmentEventName.WEB_VITALS_TTFB,
}
 
export function trackWebVital(name: string, id: string, value: number): void {
 const segment = useSegment()
 const eventName = WebVitalNameMap[name]
 
 if (eventName) {
   segment.track(eventName, {
     version: SegmentEventVersions.NewMxWebsite,
     page_name: window?.location?.pathname || '',
     page_url: window?.location?.href || '',
     id,
     value,
   })
 } else {
   console.warn(`Web Vital Name Map Missing: ${name}`)
 }
}

We then pipe event data into Amplitude (see Figure 3), where we can easily create charts to track performance. Note the red, downward trend is a good thing here (we’re adding lightness and shaving off lap times from our Core Web Vitals metrics with each and every code release):

**Figure 3:** We use Amplitude fed from Segment.io data to monitor our Web Vitals (LCP, FID, and CLS) for each and every marketing deployment

A/B testing with Cloudflare Workers

Now that all the pieces were in place, it was time to release the first version of our new website in a controlled rollout. We created an A/B test for our highest-traffic page and the entry-point for merchant onboarding on get.doordash.com.
But how could we effectively split traffic 50/50 between our old CMS against a new CMS and platform on a completely different stack, while seamlessly using the same URL structure and subdomain? Cloudflare Workers and the DoorDash Traffic Engineering Team to the rescue:

// https://developers.cloudflare.com/workers/examples/ab-testing
 
function handleRequest(request) {
 const NAME = "experiment-0"
 
 // The Responses below are placeholders. You can set up a custom path for each test (e.g. /control/somepath ).
 const TEST_RESPONSE = new Response("Test group") // e.g. await fetch("/test/sompath", request)
 const CONTROL_RESPONSE = new Response("Control group") // e.g. await fetch("/control/sompath", request)
 
 // Determine which group this requester is in.
 const cookie = request.headers.get("cookie")
 if (cookie && cookie.includes(`${NAME}=control`)) {
   return CONTROL_RESPONSE
 }
 else if (cookie && cookie.includes(`${NAME}=test`)) {
   return TEST_RESPONSE
 }
 else {
   // If there is no cookie, this is a new client. Choose a group and set the cookie.
   const group = Math.random() < 0.5 ? "test" : "control" // 50/50 split
   const response = group === "control" ? CONTROL_RESPONSE : TEST_RESPONSE
   response.headers.append("Set-Cookie", `${NAME}=${group}; path=/`)
 
   return response
 }
}
 
addEventListener("fetch", event => {
 event.respondWith(handleRequest(event.request))
})

In a matter of weeks, the DoorDash Analytics team waved the checkered flag and confirmed that our new platform resulted in a +12.12% increase in lead conversion and a +21.33% increase in the closed won conversion rate (at 99% statistical significance).

Beyond signup conversion, we improved time spent on site by +95.5% and lifted Web Vitals overall, decreasing LCP by 30.5% (see Figure 4). The Largest Contentful Paint metric, according to Google, “is an important, user-centric metric for measuring perceived load speed because it marks the point in the page load timeline when the page’s main content has likely loaded…”

**Figure 4:** Migrating to our new platform pushed Web Vitals into the green for the first time

The results of the A/B test far exceeded expectations and were a much-needed validation of months of hard work. Now that we had proven ourselves and our platform, we would move on to the long, arduous process of creating, testing, and deploying a library of reusable UI components, documentation, and training the marketing team on how to use Contentful to build pages on the fly – without the need for an engineer or designer to create custom code or mockups for each and every project.

Spinning up a new website is now only a matter of minutes – simply add a project to the monorepo, provision Contentful accounts, setup a Cloudflare Worker, and you’re off to the races! Did we somehow manage to capture the bleeding-edge performance of a Ferrari with the long-term vision, scalability, and reliability of The Toyota Way?

Scaling up the platform

Given the success of the initial website, it became clear that the overarching goal of the Marketing Engineering team would be to extend and manage our multi-tenant, team-agnostic platform that could help all our separate marketing teams unify their web presences and scale-up.

As the platform grew (see Figure 5 below), we helped the business expand beyond restaurants into other verticals (such as grocery, alcohol, convenience, flowers, etc.) and launch into new international markets (Japan and Germany) – effectively stress-testing each strategic pillar that we build into every component in our codebase:

Analytics and reporting – using Segment.io for event tracking and piping the data into Amplitude so marketers and engineers alike can easily create meaningful charts; we also use Google Analytics, Google Tag Manager, and a third-party tool for GDPR compliance
Forms and attribution – integrating with an ESP and Salesforce CRM both for email and campaign-specific routing and attribution
Performance and Web Vitals – optimizing the codebase, reducing third-party scripts and pixels, measuring Web Vitals, and developing for mobile-first
Internationalization (i18n) and localization (l10n) – integrating our translation service directly in Contentful
Accessibility (a11y) – using Lighthouse Accessibility Scoring (part of Web Vitals), Web Accessibility Evaluation Tool (WAVE), color contrast, keyboard navigation, etc.
A/B testing and CRO – leveraging Optimizely to help quickly launch A/B and multi-arm bandit tests

**Figure 5:** Overview of the Marketing Engineering platform and integrations, including Cloudflare Pages and Contentful

We’re proud to have been hands-on across the company, having an outsized impact on improving onboarding processes, pricing transparency, privacy and GDPR, reliability of our APIs, and doubling down on accessibility, internationalization, and localization for all.

What’s next for Marketing Engineering?

For 2022, we’re starting to share the platform and components for multi-tenancy by empowering other teams like Dasher, Caviar, Chowbotics (and more) to move onto our infrastructure.

Our small but mighty team is also growing (there are three of us engineers in marketing now, with the support of two Salesforce analysts and a data science and testing strategist). The need for Marketing Engineering to help the company scale internationally and for businesses beyond restaurants is now stronger than ever. We haven’t slowed our momentum or taken our foot off the gas pedal since.

At DoorDash, we rely on experimentation to make decisions regarding model improvements and product changes because we cannot perfectly predict the results in advance. But experiments conducted by the Dasher team aimed at improving delivery quality and Dasher experience face challenges that arise from conflicting goals, including minimizing interference effects while simultaneously capturing learning effects and maximizing power.

Challenges in marketplace experimentation

In short, there is no one-size-fits-all solution for designing experiments that meets every demand for results. Our experiments walk a fine line between three key goals:

Reduce interference (network) effects. The fact that DoorDash represents a three-sided marketplace introduces complex dependencies. In an order-level A/B test, treatment and control orders that occur simultaneously in the same geographic area will not be independent because they rely on the availability of a shared Dasher fleet. One way to solve this problem is to design a switchback experiment in which we randomly assign treatment vs. control to region/time window pairs instead of deliveries. More details about network effects and switchback can be found in this blog post.
Capture learning effects. While switchback may solve the network effects problem, it cannot measure Dasher learning or other long-term effects such as Dasher retention because it cannot provide a consistent treatment experience for Dashers. Even if we can sometimes detect an effect with relatively high power in a short time period, we may need to run a longer experiment because it takes time – possibly several weeks – for Dashers to adjust their behaviors in response to a treatment.
Maximize power. If we try to design experiments that capture learning effects and minimize network effects, we would end up with lower power tests that may struggle to detect changes in key metrics. For example, we could run a market-level A/B experiment to overcome network effects and capture learning effects, but it would have comparatively low power because the small number of markets relative to e.g. the number of Dashers limits our effective sample size. Alternatively, we could run two separate experiments – say, a switchback for network effects and a Dasher A/B for learning effects. While these two experiment types have medium and high power, respectively, running two separate experiments adds complexity and could extend the time required to obtain meaningful results.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

The randomization trilemma

In most of our experiments, we are interested in multiple business metrics around Dasher experience and delivery quality, including such things as delivery duration and lateness, which involve both network effects and learning effects. Therefore, we may need to design each experiment differently based on the characteristics of different randomization units. The most common three randomization units that our team uses are switchback, Dasher A/B, and market A/B. The trilemma is outlined in Figure 1.

Randomization method	Pros	Cons
Switchback	Minimal network effects, Moderate power	Cannot capture learning effects
Dasher A/B	High power, Can capture learning effects	Severe network effects
Market A/B	Little to no network effects, Can capture learning effects	Low power

Impact of randomization on power

Different randomization methods produce different power because deliveries in the same switchback units, Dashers, or markets are not independent. Consequently, we use cluster robust standard errors, or CRSE, to prevent potential false positive results. More details can be found in this blog post. As a result, the experimental power is not determined by the number of deliveries but instead by the number of clusters, which are the number of switchback units, Dashers, or markets in the three randomization methods described above. Various switchback units, Dashers, or markets may have different numbers of deliveries, and this volume imbalance can further reduce the effective power.

Note that we can increase the power of a switchback experiment by reducing the time window or region size, but these reductions come at the cost of potential increases in network effects. Switchback experiments are not completely free from network effects, they can still exist around the geographical and temporal borders between clusters. The smaller the switchback units, the more severe the network effects will be because a greater percentage of deliveries may be near time windows or geographical boundaries where network efforts may still be present.

It is also important to note that because of the limited number of markets, a simple randomization for market A/B may lead to pre-experiment bias between treatment and control for some metrics. As a result, we usually need to use stratification and diff-in-diff methods to better account for potential pre-experiment bias and ensure that the parallel trend assumption holds; synthetic control will be necessary if this assumption is violated.

Creating the optimal experiment design

In order to maximize use of our experiment bandwidth, we dynamically design experiments based on the expected trade-offs between network effects, learning effects, and power for each intervention. We may choose to run a single test or multiple test types in parallel to triangulate the true impact while still maintaining power.

Trade-offs between power and network effects

Not all of our experiments affect all of the orders in the system. There are experiments that only affect a subset of total orders. For instance an experiment that only impacts flower delivery may only directly impact 1% of total volume. Even if we randomize such a test at the order level, the network effects could be so small that we can ignore them because it’s unlikely these treatment and control deliveries will compete for the same group of Dashers. In this situation, we may opt not to run a switchback experiment, instead running a Dasher A/B or other A/B experiment to significantly increase power.

When we expect only strong network effects or learning effects

In this easy scenario, we can run a switchback experiment to mitigate network effects if we want to test a new delivery duration estimation model, or we can run a Dasher A/B experiment to address Dasher learning effects with high power if we want to test a Dasher app product change. We can follow the regular power analysis to determine the minimum experiment duration – usually rounded to weeks to avoid day-of-week effect – or calculate the minimum detectable effect, or MDE, given a minimum required experiment duration required by metric definition, such as Dasher retention. If there is no conflict with other experiments, we can run it in all markets to maximize the power.

When both strong network effects and learning effects are expected

This is the most common scenario that we face. In general, we could run a market A/B diff-in-diff experiment or two separate experiments – one switchback and one Dasher A/B – either by time or by markets.

A single market A/B diff-in-diff

A market A/B experiment is appropriate when it’s reasonable to expect that the power will be sufficiently high – in other words, when the expected effects from the treatment are significantly larger than the calculated MDE for the test. Running this single experiment avoids conducting two separate experiments, which adds complexity and could delay learning and decision making. However, because of the small number of markets, simple randomization may not give us comparable separation between treatment and control; therefore, we tend to use stratified randomization to split the markets to maximize power. For example, if we have two metrics – X1 and X2 – we may split all markets into, say, nine strata (high/medium/low based on each metric), and then randomly assign an equal number of markets in each stratum to treatment and control.

Two separate experiments

Market A/B, however, usually does not give us enough power. Instead, we have to conduct two different experiments: a Dasher A/B to measure Dasher retention and a switchback to measure delivery quality metrics. Because the two experiments cannot run simultaneously in the same markets, we have to allocate our limited experiment bandwidth either sequentially or geographically as illustrated in Figures 2 and 3.

1) Geographical separation

One way to run two experiments in parallel would be to allocate traffic geographically between a switchback and a Dasher A/B experiment. In such a geographical split, we can efficiently divide the markets between the two experiments in a way – for instance, 70/30 – that requires roughly the same number of weeks based on power analysis (see Figure 2a). An inefficient market split would require one experiment to run for four weeks while the other one needs to run for eight weeks, as illustrated in Figure 2b. Such an inefficient design delays the final decision-making for this experiment. Regardless of the optimal split ratio, we always recommend splitting the markets through stratified randomization to reduce any potential noise introduced by the split.

*Figure 2: An example of a geographical separation*

2) Sequential separation

We often prefer a sequential allocation over a geographical allocation to fast fail and iterate (see Figure 3b). This is particularly useful if we believe that aspects of the intervention require tuning before we commit to a longer-term learning test where impacts, by definition, will take multiple weeks to observe.

*Figure 3:* an example of a sequential separation

Consider an experiment where we want to measure both Dasher retention and delivery quality, but market A/B does not have enough power. We can run a switchback experiment first to measure quality effects and then run a Dasher A/B experiment to measure retention effects. If quality effects do not meet our expectations, we can quickly iterate and improve the treatment design to run another switchback experiment without having to wait for a long-term Dasher A/B experiment to complete. Once we are good with the quality metrics and finalize the treatment design, we can start the long-term Dasher A/B experiment to measure retention.

Conclusion

In DoorDash’s data-driven culture, experimentation drives how and when to ship product changes and iterate on model improvements. As a result, we need to develop rapid, high-power testing procedures so that we can quickly learn and make decisions. Because of the complex interactions between consumers, Dashers, and merchants on the DoorDash marketplace and platform, however, most of our interventions involve trade-offs between network effects, learning effects, and power. No single experimental framework can serve for all of the various interventions that we want to test. We therefore came up with a set of experiment design solutions that can be used under a variety of scenarios. By applying these methods, using variance reduction techniques, and running multiple experiments simultaneously, we are able to significantly improve our experimentation speed, power, and accuracy, allowing us to ship more product changes and model improvements confidently. These approaches are practical, easy to implement, and applicable to a wide variety of marketplace use cases.

Resolving customer problems at scale is not an easy task, which is why the most feasible solution is implementing a chat feature; however, making these chat experiences unified and cohesive is a challenge in itself.

DoorDash customers often want to be able to talk to a customer support representative immediately. One of the mechanisms that we feature on our site to facilitate this communication is a text-based chat feature. In a previous blog post, we outlined how we integrated chat into the DoorDash app. Building on this foundation, we have extended that to include both live and automated support.

The general problems with our existing chat solution

Our legacy chat was fragmented across our different types of users (consumers, Dashers–our name for delivery drivers–and merchants), using multiple applications running on the different platforms (iOS, Android, Web). This division of customer applications also included support apps, which created a lot of duplication of effort and made it harder to adopt best practices in a centralized way.

One of the biggest issues with our legacy chat solution was its inability to pass along user data to a support agent. A user would be asked to fill out a form, then would be put into a queue to connect with an agent. The agent would then end up asking for the same information, even though the user had already filled out the form. We needed a way to pass along this context to the agent.

We also wanted to automate a lot of the support process so customers would not necessarily even need to talk to a support agent but could immediately get the information they needed through the app.

Building a better chat solution

We wanted to build one chat support platform for all of our customers in a way that would allow heavy reuse across platforms and ensure a consistent user experience.

Creating a better chat application involved building:

Common UI components for all chat implementations at DoorDash
An extensible backend system that allows for integrating with different third-party services
A system to allow for high scalability and resilience

Building common UI components

We wanted to create a single chat implementation that would work with support as well as the Consumer to Dasher chat that is offered for each order. Recently, we upgraded our Consumer to Dasher chat using Sendbird. Before that change, when a Dasher wanted to reach a consumer they would send a message that the consumer would receive as an SMS message. This messaging was migrated to utilize Sendbird and embedded directly in the Consumer and Dasher app. We built on this work to create a single chat implementation, allowing us to have a consistent experience across all chat at DoorDash. It also allowed both flows to benefit from improvements and functionality made for the other one.

By defining all the components of past platform applications into a single library we could create a UI component system. We could then codify these into our application. While there was effort to create a UI for each of the different platforms (Web, iOS, Android), once created, we could reuse the component across the multiple use cases where chat is displayed for that platform. Whenever additional customization was needed we could rely on each platform having a single module that can be imported into the separate apps and customized as needed. This allowed us to build out the multiple use cases faster than if we had to create each use case independently.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Building an extensible backend system

The backend system was built with multiple layers, allowing us to split responsibilities between internal services and third parties. Utilizing third parties for functionality such as chat, natural language processing (NLP), and agent ticket management allowed us to move quicker without having to build functionality that does not differentiate ourselves directly. We kept pieces that are core to our business logic, such as determining the expected reason for reaching out or determining possible resolutions to the customer in-house in our decision-making engine, leveraging existing work that had been done by other teams. By keeping control of the integration and customer journey, we are able to have visibility into the different chats and their usage and easily make updates.

To allow the platform to handle processing at common layers, we created a new API gateway for all the different callbacks from third parties. Since the processing does not need to be synchronous, we are able to push the messages to a messaging queue for processing to ensure we process the messages reliably. The downstream processing is then agnostic of the source and data-driven based upon the different events.

*Figure #1: Data flow through centralized chat service, integrating with 3rd parties*

Solving problems through automation

As the improved support chat is now more deeply embedded into the core DoorDash ecosystem, we started to implement the automation mentioned earlier. We can look at the path the user took to open the chat as well as the current state to provide some immediate feedback to the user, which may both reduce their wait time to get desired information as well as reduce costs involving a human agent.

By integrating chat with our Decision Engine platform, we enabled operations to build in any functionality, including providing updates on the status of an order, issuing credits and refunds, and providing custom messaging. This will allow DoorDash to rely more upon automation and less on manual human operators. Of course, users can still escalate at any point in time and get the DoorDash support they expect.

In addition to allowing a user to select a predefined set of problems, we also let them type in their reason for contacting us. This then feeds to an NLP engine, which can allow kicking off appropriate workflows or escalating. This should also allow us to capture things such as the user sentiment and identify common issues, providing us with data on common followup scenarios that we should add.

Quick response pills were also provided, allowing the user to indicate common answers.

*Figure #2: Quick response pill offered in a DoorDash support chat*

Rolling out the migration

We wanted to ensure a smooth rollout of the new chat solution so we started by rolling out with minimal functionality to a small subset of users, and then adding more functionality as well as users in subsequent ramps.

The first scenario we started with was the web experience, for the following reasons:

There had been no chat functionality post-delivery, so we did not have to worry about breaking existing functionality while adding more functionality to the site
The volume on web is lower than mobile, so there was lower risk
We could iterate faster since we were not tied to mobile release cycles

Once it was live, we monitored multiple key performance metrics, including:

Feedback from agents and customers
The error rate
A CSAT metric (customer satisfaction)
How long customer problems took to solve (AHT = average handle time)
The rate at which they would contact us again for the same issue (FCR = first contact resolution)
The rate at which they would reach out to an agent (MTO = manual tasks per order)

Once we were confident in the experience and our performance metrics were doing well, we ramped automation and additional functionality quickly. These included uploading images, supporting deliveries while they are still ongoing, embedding into Cx and Dx mobile apps, allowing agents to transfer chats, and automation.

We utilized feature flags to ramp each independently so we could ramp up slowly or ramp down quickly in case any issues were discovered. Metrics were captured by our analytics system, and we were able to verify behavior.

*Figure #3: Traffic being ramped. Ramping additional chats both up and down after initial rollout.*

Conclusion

We are already seeing benefits now that the updated support chat has been ramped on the web, is live for Dashers on mobile, and automation has been added around some of the responses.

We have reduced escalation rates and improved CSAT.
We are now able to automate away certain conditions, reducing the number of manual touch points needed per delivery.
Customers are able to solve problems faster and without manual intervention, while still having the ability to escalate to a human.
We have reduced the time needed to ramp additional automation from what had been there previously with common platform processing.
With the common UI layers, we now provide a consistent branding / look and feel, making chat feel much more embedded into our applications on all the different platforms.

Starting a career in software engineering can be overwhelming at first, but recognizing and overcoming imposter syndrome helps build confidence. When I started my career in tech, I felt underprepared and overwhelmed despite having studied computer science in college and doing multiple internships at prominent tech companies.

From the beginning, there were new coding languages to learn, huge codebases to navigate, unfamiliar tools for observability to use and the code review process to grasp. Different companies had different requirements and expectations, depending on their size and scale. The combination of learning new skills and moving into unchartered territories was a lot to take on all at once.

Besides the technical challenges, I was learning what it was like to collaborate within a larger team and with cross-functional partners. Overwhelmed by the demands of my new job, I was also dispirited by how easily my teammates seemed to navigate all the tools, languages, codebases and processes. Comparing myself with colleagues hurt my confidence and made me question whether I had what it takes to succeed in this industry.

What I was feeling was the textbook definition of imposter syndrome; the uneasiness caused by feelings of inadequacy had the potential to derail my career before it had even really started.

What is imposter syndrome?

Imposter syndrome is described as “feelings of self-doubt and personal incompetence that persist despite your education, experience and accomplishments.” Given the demanding transition from school into the tech industry, it is not hard to see why a majority of tech workers feel imposter syndrome at some point in their career.

The effects of imposter syndrome

Imposter syndrome prevents people from performing their best by reducing their confidence in their own abilities. When consistently feeling self-doubt and incompetence, people are less likely to take on the challenges and risks that are necessary to grow their career and subsequently move up the career ladder.

Imposter syndrome can be initiated and compounded by negative treatment within the workplace. Imposter syndrome is externally validated when experiences like the following occur:

Being interrupted in meetings
Peers assuming someone is not the lead on a project they do in fact lead
Being assigned tasks no one else wants to take on, such as taking meeting notes or planning team events

While imposter syndrome affects many types of people, the above situations most commonly arise for women, nonbinary people and underrepresented talent, producing more barriers to their career growth.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

What I learned combatting imposter syndrome

I worked with my mentors on ways to overcome my imposter syndrome. Over time, I became more confident in what I could accomplish. This personal growth led me to not only be successful in my career, but also to help others new to industry in the same way my mentors helped me.

Two tactics that I leveraged and recommend to others are:

Manage expectations
Partner with a mentor

Let’s take a closer look at each of these and discuss the kinds of resources that new engineers can take advantage of using examples from my experience mentoring junior engineers at DoorDash.

Managing expectations

Many aspects of imposter syndrome arise when people have inaccurate or unrealistic expectations about how they are supposed to perform in their job. By managing expectations, people can better frame their performance in a positive way and feel confident that they are doing their best. These expectations tend to fall in the following buckets:

The journey vs. the progress
Pacing: a sprint vs. a marathon
Setting appropriate goals
Everyone starts somewhere

The journey vs. the progress

For many engineers I work with, a common issue is focusing more on how far they have yet to go instead of how far they have already come. While there are many benefits associated with looking forward, it’s important to appreciate the progress that has already been made to avoid triggering imposter syndrome. Many people I work with discount the headway they have been able to make despite all the obstacles in their path; instead, their eyes are always looking ahead to what is next. Many successfully graduated from top higher education institutions and secured highly coveted jobs in the industry.

Despite these incredible accomplishments, many junior engineers I work with only think about what they have left to accomplish and how long it will take to achieve their goals. Not only does the next goal seem daunting, but the slow pace to reach those goals can also feel like a barrier. It can be beneficial to focus on how far they have already come to remind them of what they have accomplished and the numerous challenges they have already overcome. With the right perspective, it is easier to appreciate prior achievements and use this to fuel the journey ahead.

Pacing: a sprint vs. a marathon

Another important frame of reference I tell my mentees to use is to think of their whole career as a sprint vs. a marathon. In a sprint, runners expend all their energy in a short amount of time, pushing the limits of how fast they can possibly go. In a marathon, on the other hand, runners pace themselves to be able to get to the distant finish line as fast as they reasonably can. Careers are long and the end is often not in sight. By setting a steady pace, like a marathon runner, new engineers will be able to travel far while also picking up interesting information or new skill sets along the way that they might otherwise have zoomed past.

I have seen many new engineers try to ramp up quickly by glossing over important learning steps, such as fully understanding how certain code flows work and the reasons behind the way certain code is structured. This ends up hurting their overall progression because they do not write code that fits into the current structure of the codebase or rewrite existing functionality. If they had taken more time at the beginning to better understand the foundations of the codebase, they would have been able to make progress faster.

Feeling like their current pace of performance is not sufficient, many people sprint in their jobs and burn out when they do not save enough energy for the next task or challenge. This sprint mentality can come from internal pressure on oneself or from pressure within the company to deliver projects – or both. While these quick wins can provide instant gratification, repeatedly exerting too much effort to deliver projects faster can increase the feeling that one is not cut out for the job, amplifying the impression that too much effort is required to be successful.

In reality, it can take many months if not years to build up the proper skillset to reach one’s full potential. While in school, doing an all-nighter or working super hard to solve a challenge may work, it’s not sustainable over a multi-decade career. Therefore, planning for the long term rather than focusing on the short term allows junior engineers to sustain an extended career in tech and develop a more realistic perspective about what they can reasonably get done in a given amount of time.

Setting appropriate goals

In my mentoring experience, I see many new engineers enter the industry sprinting because they are chasing a goal they may not have fully thought through, which can have detrimental effects if it is not attained. Setting an audacious goal like getting promoted as fast as possible or becoming a manager in record time may actually negatively impact someone’s drive or ambition if the goal is not realistic. Lofty aims can create high expectations that people cannot satisfy. Instead of chasing an unrealistic goal and feeling imposter syndrome when such high expectations do not come to fruition, I tell new engineers to focus on developing their core skills and finding the primary interests that will guide them for the rest of their tech career. After they’ve identified these, they will be able to understand what they are trying to achieve and avoid constantly feeling inadequate.

Everyone starts somewhere

When my mentees feel imposter syndrome because peers seem so much more suited to their jobs, I remind them that everyone started somewhere. Everyone was new to the job at some point and dealt with the steep learning curve involved in entering the workforce. It’s important to recognize the value of experience. Over time, experience helps people develop pattern recognition as they solve similar issues, allowing them to come to a resolution faster the next time such an issue comes up. This experience may not always be easy to recognize, which is why people can feel overwhelmed if they compare their performance to others. It’s important to understand that their peers started at the beginning just as they are doing now; they overcame challenges and developed strong skill sets over time. Instead of comparing one’s performance to more-experienced colleagues, it is better to acknowledge each personal success as progress toward the experience required to conquer any roadblock.

Partner with a mentor

I believe overcoming imposter syndrome is easier with the help of a mentor who has had similar experiences and seen comparable situations. Because imposter syndrome can be improved with a shift in perspective, a trusted advisor can help a mentee see the larger career picture. My own mentors articulated for me what a new engineer typically experiences, pointing out universal situations versus those that were unique to me; this process allowed me to understand whether what I was experiencing was normal. My mentors also helped me to differentiate areas of growth from actual competence. This showed me where I could work to improve so that I could meet expectations and course correct to get back on my career trajectory.

Mentors can illustrate a successful career path, helping to define growth well beyond the initial phases of a career. For women and underrepresented groups in particular, mentors can serve as role models when there may not be many examples of success to look up to. I personally had mentors who showed me my potential future opportunities in tech simply through the positions in which they excelled.

Taking advantage of company mentorship opportunities

DoorDash offers mentorship and career development programs that help overcome imposter syndrome by creating a supportive work environment. DoorDash has a program called NeoBuddy/FemBuddy that pairs new engineers with seasoned colleagues from a different team. This lets new engineers reach out beyond the team to learn about how others handle problematic situations and grow their network at the company. There is also an Engineering Technical Mentorship program to pair engineers pursuing a new skill, such as machine learning or Android, with a volunteer mentor who has experience in this area within the company. This is a great opportunity for new engineers to work one-on-one with expert engineers in a specific area to improve their skills in a safe learning environment.

DoorDash also offers programs to support underrepresented groups in tech. I participated in WOLF – Women’s Leadership Forum. WOLF is a cohort-style program focused on encouraging emerging women leaders through coaching, mentorship and improved access to senior leadership. I enjoyed meeting with other women in technical roles across the company and learning how to develop my career with sessions covering topics including networking, “Influencing Without Authority” and “Being Strategic.” Programs like WOLF allow participants to find a community of like-minded people who can be a support group.

I have grown my DoorDash support network through the NeoBuddy/FemBuddy, Eng Tech Mentorship and WOLF programs. Any time I need to shift my perspective or manage my expectations, I can reach out to this network for support. These strong support groups provide people I can turn to as I seek to accomplish more goals in my career.

Conclusion

Imposter syndrome is more a symptom of a lack of confidence than anything else. For that reason, I remind new engineers to believe in themselves because without confidence, they cannot make mistakes and grow from these experiences. If instead they believe they can learn from every situation, attain the skills needed to be successful and bounce back from any setback, there is little downside in taking on challenges and risks; in this way, the worst case scenario becomes a learning experience rather than a personal catastrophe. Mistakes are best viewed as learning experiences. I ask new engineers not to let imposter syndrome get in the way of taking risks, freeing them to see their true potential.

It is not easy to overcome imposter syndrome, but acknowledging it frees anyone to create a plan to overcome it step-by-step. By managing expectations and partnering with a mentor, anyone experiencing imposter syndrome can overcome it and achieve their full potential.

What is functional programming (FP)?

How does FP compare with IP?

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

Benefits of writing FP-style code

Side-effect-free executions

Easy iterations on existing functions

Increased testability

Potential downsides to FP

What does Kotlin provide for developers to leverage FP?

Higher-order functions and lambdas

Collection-based operations

Comparing Kotlin with languages like Python, JavaScript, and C++

Examples of how we at DoorDash write FP-style code in Kotlin

Conclusion

Finding the best way to alert Dashers about low supply

ML approaches we considered

A trio of possible bandits

Why we chose Thompson Sampling

Results

Conclusion

Acknowledgements

The challenges of launching ML fraud models

Complex feature engineering

Scaling and availability

Ensuring correctness in production

A familiar challenge – change management

ML adds additional complications of change management

Solution: Dark rollout of ML models

Step 0: Pre-production iterations

Step 1: Production: Shadow traffic, 1% volume

Step 2: Production: Shadow traffic, 100% volume

Step 3: Experiment: Incumbent model vs. new model

Conclusion

Why e2e testing is not reliable in the staging environment

Lack of observability and alerting over the e2e functionality in the staging environment

Lack of tooling to simulate production data in the staging environment easily

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

Development flows where the staging environment is either ignored or misused

Why we decided to move e2e testing into the production environment

How we moved e2e testing into our production environment

Using multi-tenancy to unlock e2e testing in the production environment

How we incorporated multi-tenancy in our stack

How we isolated test data and customized software behavior for the DoorTest tenant

Building guardrails to enhance production environment safety against e2e testing

Building tooling for developers to speed e2e testing in the production environment

Conclusion

Acknowledgements

How we’re supporting women-owned businesses

How we’re recruiting talented women

Women’s history, women’s futures

New M1 MacBooks make everything faster

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

What are the actual results?

How else to speed up compile times?

Conclusion

Simplifying, then adding lightness (legacy CMS woes)

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

Building a Ferrari (or static site generation with Next.js)

Navigating the content management landscape

Defining technical project goals

Internationalization (i18n) as inclusiveness

Setting up CI/CD and hosting

Deploying to the edge

Measuring performance and Web Vitals

A/B testing with Cloudflare Workers

Scaling up the platform

What’s next for Marketing Engineering?

Challenges in marketplace experimentation

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

The randomization trilemma