Skip to content

Understanding the contents of a large digital catalog is a significant challenge for online businesses, but this challenge can be addressed using self-supervised neural network models. Product discovery in particular becomes difficult when a digital catalog gets to a size that is too large to manually label or analyze.

For DoorDash, having a deep understanding of our catalog can help with product recommendation, search, promotional campaigns, and operational intelligence. While we worked in the past on building a human in the loop system to tag our items, we need a generalizable way of associating items in a semantically meaningful way to power machine learning use cases.

In this article we describe an approach to train high-quality generalizable embeddings by using techniques in self-supervised learning on our internal search data. We also discuss trade-offs with alternative methodologies and go over the details of the model training and evaluation process for our selected solution.

The problem with a large, growing online catalog

The DoorDash catalog is extremely large and constantly getting larger as we add new partners and verticals. As ML powers more core aspects of DoorDash’s platform, we need to provide a way for teams to be able to process the catalog without building bespoke models. Note that unlike our previous discussion around tagging, which was focused on human-interpretable labels for the catalog, here our goal is to develop a representation of the items in the catalog that can be used by ML systems to fulfill many use cases.

Understanding the contents of the catalog is important in order to operate the business and power many consumer-facing and internal applications such as:

  • Recommendations of new stores based on consumers’ known preferences
  • Recommending items to a consumer when they interact with a new store
  • Retrieving relevant stores and items for a search query
  • Automatically suggesting promotions for stores similar to a consumer’s recent order history
  • Understanding what kinds of items consumers purchase after a search 

The above use cases span multiple separate teams at DoorDash, but we need to find a common way to represent items in the catalog that is usable by all teams.

How to represent items in the catalog

One way to formalize the problem is to think about how we can represent items in the catalog in a manner that preserves good metric properties, meaning that similar items should have similar representations. A natural representation in this case would be to use embeddings that preserve intuitive relationships between items. For example, we would expect “tacos” and “burritos” to be more similar to each other than to “pad thai” because the former are both Mexican foods and pad thai is Asian food.

Figure 1: An example of queries (green) and item (yellow) representations in the same latent space. We want to learn an embedding representation where lines of the same color have high cosine similarity (have a small angle between them) and lines of different colors have a small cosine similarity (large angle between them). Note this means we need to be able to encode queries and items into the same space and learn high quality representations for both of them.

In a search retrieval context, we also want to be able to create a query embedding which can be compared to item and store embeddings in order to retrieve the most relevant results. Our model needs to embed both queries and items into the same latent space (Figure 1) in order to make them comparable. For example, once we have embeddings for the query “mexican” and item “taco” we would be able to measure the cosine similarity between the query embedding and item embedding to know that “taco” is a relevant result.

We can also easily build embeddings that capture store cuisine types and consumer preferences by treating stores and consumers as bags of item embeddings. This method keeps the store and consumer embeddings in the same latent space, and thus comparable to items. This allows us to use our embeddings to include catalog knowledge to store recommendation and personalization models.

Figure 2: By defining consumer embeddings (blue) as the average of their item embeddings (green) we can learn consumers’ different preferences. In the above diagram a consumer who regularly purchases Mexican food will have an embedding closer to Mexican dishes than a consumer who frequently purchases Asian food. A consumer who purchases both would have an embedding between the Mexican food and Asian food clusters.

The main challenge we have to solve is how to use our limited labeled data to effectively train embeddings on possibly very rare classes. Our solution was to train embeddings by leveraging self-supervised methods on DoorDash’s large volume of search. However, we’ll review some of the more traditional techniques to train embeddings to understand why they don’t work for our problem.

A review of standard techniques to build embeddings

There are several standard approaches to training embeddings that do not work well for our use case. Traditional approaches include Word2vec training on item IDs or training deep learning classifiers and taking the output of the last linear layer. More recently, it has also become common in natural language processing (NLP) to finetune a large pre-trained model like BERT. However, for DoorDash’s problem of large, sparse catalogs that are continuously evolving, these methods have a few disadvantages:

Alternative 1: Word2vec embeddings on entity IDs

Word2vec embeddings can be trained on any set of entity IDs using customer behavior such as views or purchases. These embeddings learn the relationships between IDs by assuming that entities a customer interacts with in the same session are related to each other, similarly to the Word2vec distributional hypothesis. In fact, at DoorDash we already train these kinds of embeddings regularly for stores and consumers to use in recommendations and other personalization applications. See Figure 3 for an example architecture for this on item IDs.

Figure 3: The architecture of a CBOW style Word2vec model trained on Item IDs. Given a set of context item IDs we will attempt to predict a candidate item ID that belongs to the context.

However, Word2vec embeddings suffer from some drawbacks for the purpose of preserving semantic similarity for a large catalog. First, they require regular retraining as new entities get added to the catalog. Because millions of items are added daily, retraining these embeddings daily is computationally expensive. Furthermore, embeddings trained using this method are prone to suffering from sparsity issues, because IDs that customers interact with infrequently do not get trained well.

Alternative 2: Embeddings from deep neural networks trained on a supervised task

It has been observed empirically that deep neural networks that have low training error on classification tasks can learn high quality representations of the target classes. The output of the last hidden layer of the network can then be treated as an embedding of the original input. With a diverse and large high quality labeled dataset this approach can be very effective at learning high quality embeddings to reuse for classification tasks.

Figure 4: This is a standard architecture for a text classification model. We can use the output of the last linear layer as our embedding.

However, this method of training does not always guarantee good metric properties for the underlying embeddings. Because our priority is ease-of-use for downstream applications, we’d like these embeddings to be easily comparable using simple metrics like cosine similarity. Due to this method being supervised, the quality of the learned metric depends heavily on the quality of the annotated training set. We need to ensure that the dataset has hard negative samples to ensure that the model can learn to discriminate between closely related labels. This problem is especially exacerbated for rare classes that will have limited data samples. Our described solution will circumvent this issue by automatically generating samples from an unlabeled data and learning a representation for the label.

Alternative 3: Fine tuning a pre-trained language model such as BERT

With recent advances in training large models in NLP on large corpora, it has become popular to fine tune these models to learn embeddings for a specialized task via transfer learning (see Figure 5 for a sample architecture). A popular pre-trained model is BERT and this approach can be straightforwardly implemented using popular open source libraries. This approach can often overcome the problem of data sparsity and for general NLP problems provides a very strong baseline.

Figure 5: An example of using BERT to fine tune our classifier. Note the similarity to the architecture in Figure 3, but by leveraging the large corpus BERT has been trained on, the quality of the output embeddings is significantly better. 

While BERT embeddings are a significant improvement on the baseline, it suffers from slow training inference time due to model size. Even using a distilled model such as DistilBERT or ELECTRA can be much slower than custom models which are much smaller. We’ve also observed that with enough domain-specific data, even if it is unlabeled, self-supervised methods have substantially better metric properties for our task compared to pre-trained language models.

Our solution: using self-supervised learning to train embeddings 

After eliminating the above approaches we went with self-supervised methods to train embeddings based on the item name and search query. By using subword information, such as character-level information, these embeddings can also be generalized to text that was unseen in the training data.

In order to ensure good metric properties, we use a Siamese Neural Network ( also called a Twin network) architecture with triplet loss. The triplet loss attempts to force similar examples together and push dissimilar examples apart in the latent space. We use Twin networks to ensure that the encoders used for query and item text both embed into the same latent space in a way that preserves distances between similar examples.

Constructing a dataset

In order to train with a triplet loss we need a dataset with the structure <anchor, positive example, negative example>. For our problem we define the anchor as the raw query text and we consider “relevant” and “irrelevant” for the query as positive and negative samples respectively.

To construct this dataset (see Figure 6 for a sample), we need to develop a set of heuristics to formulate the training task. The following heuristics were used to determine relevant and irrelevant items which correspond to a positive and negative training sample respectively:

  • An item X is relevant for a query Q, if a user searched for query Q and immediately purchased X afterwards in the same session and X is the most expensive item in the basket

This heuristic for positive samples ensures that we only take the main item in a cart, which we assume is likely the most relevant

  • An item X is irrelevant for query Q, if X was purchased in a query R where the Levenshtein distance of Q and R is > 5

This heuristic for negative samples guarantees that items purchased for similar queries (e.g., “burger” and “burgers”) are not treated as irrelevant. Note that generating hard negative samples can be crucial for preventing mode collapse. In our case we noticed even this simple heuristic and natural variation in the text was sufficient for training. In the future we hope to investigate more sophisticated mining techniques.

Figure 6: This is a sample of the training dataset we use for training with triplet loss. We have items related to the anchor in the “positive” column and irrelevant items in the “negative” column. Note that our samples are noisy (e.g., “thai fresh rolls” are not “sushi”), but our training process is robust to this, because it is only trying to learn that the positive sample is more similar to the anchor than the negative sample.

Furthermore, we did minimal normalization on the inputs, only lower casing all strings and removing punctuation. This allows the trained model to learn to become adaptable to spelling errors and other natural variations in language.

Raw inputProcessed input
Chicken Burrito[‘chi’, cke’, ‘n b’, ‘urr’, ‘ito’]
Burger + salad[bur’, ‘ger’, ‘ sa’, ‘lad’]
Figure 7: Sample inputs and their processed trigram outputs. Note that we retain space characters to be able to identify word boundaries.

In order to ensure our model can generalize to samples with out-of-vocabulary tokens, we used character trigram sequences to process the inputs (Figure 7). We experimented with multiple alternative tokenization schemes (word ngram, bytepair encoding, WordPiece, and word + character ngrams) but found trigrams had similar or superior predictive performance and could be trained more quickly. We also found that by using a bidirectional LSTM to process our inputs in the encoder layer, we removed most of the need for sophisticated tokenization.

Laying out the model’s architecture

The model is a Siamese network (Figure 8) that uses encoders composed of deep neural networks and a final linear layer that outputs the embeddings. All weights are shared between encoders. Because the weights are shared between encoders, we ensure that the encodings for all heads go into the same latent space. The outputs of the encoders are then used to calculate a triplet loss.

Figure 8. The above architecture diagram shows the general architecture of the Siamese network. We attempt to encode a positive, negative example, and anchor (e.g., query) and minimize triplet loss with respect to that. The encoders share weights, and the goal of the learning task is to learn the weights for the encoder. We will take the outputs of the last layer of the encoder (typically a linear layer) as the embeddings for an input.

A triplet loss (with margin) is defined as:

L(a, p, n, margin) = max(d(a, p) -d(a, n) + margin, 0)

Where a is the anchor, p is the positive sample, n is the negative sample, and d is some distance function (typically taken to be euclidean distance).

Figure 9: For the “Mexican” query (red) the triplet loss tries to pull the embeddings for the positive items (yellow) closer and push the negative items (grey) further apart. After training, the embeddings for similar items should be clustered together.

Intuitively, minimizing this loss brings positive samples closer to the anchor and pushes negative samples further away from the anchor (Figure 9).

class SiameseNetwork(torch.nn.Module):
   def __init__(self, learning_rate, transforms, model, **kwargs):
       super().__init__()
 
       self.learning_rate = learning_rate
       self.transforms = transforms
       self._encoder = model(**kwargs)
       self.loss = torch.nn.TripletMarginLoss(margin=1.0, p=2)
 
   def configure_optimizers(self):
       return torch.optim.Adam(self.parameters(), lr=self.learning_rate)
 
   def _loss(self, anchor, pos, neg):
       return self.loss(anchor, pos, neg)
 
   def forward(self, anchor, seq1, seq2):
       anchor = self._encoder(anchor)
       emb1 = self._encoder(seq1)
       emb2 = self._encoder(seq2)
       return anchor, emb1, emb2
Figure 10. A code sample of the Siamese neural network architecture. We abstract away the encoder details here to demonstrate how the forward pass and loss is calculated.
Figure 11. The actual encoder architecture is a bidirectional LSTM followed by a feed-forward network. The LSTM is responsible for processing a sequence of character trigrams into a vector and we use the projection head to further improve the quality of the learned embedding.

The encoder (Figure 11) is a bidirectional LSTM followed by a feed-forward network as a projection head. We find that using a feed forward network with ReLU units adds additional modeling power. We take the output of the final layer of the projection head (represented here separately as a linear layer) as our final embedding which is used to compute the loss.

class LSTMEncoder(torch.nn.Module):
   def __init__(self, output_dim, n_layers=1, vocab_size=None, embedding_dim=None, embeddings=None, bidirectional=False, freeze=True, dropout=0.1):
       super().__init__()
       if embeddings is None:
           self.embedding = torch.nn.Embedding(vocab_size, embedding_dim)
       else:
           _, embedding_dim = embeddings.shape
           self.embedding = torch.nn.Embedding.from_pretrained(embeddings=embeddings, padding_idx=0, freeze=freeze)
 
       self.lstm = torch.nn.LSTM(embedding_dim, output_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout, batch_first=True)
       self.directions = 2 if bidirectional else 1
 
       self._projection = torch.nn.Sequential(
           torch.nn.Dropout(dropout),
           torch.nn.Linear(output_dim * self.directions, output_dim),
           torch.nn.BatchNorm1d(output_dim),
           torch.nn.ReLU(),
           torch.nn.Linear(output_dim, output_dim),
           torch.nn.BatchNorm1d(output_dim),
           torch.nn.ReLU(),
           torch.nn.Linear(output_dim, output_dim, bias=False),
       )
 
   def forward(self, x):
       embedded = self.embedding(x) # [batch size, sent len, emb dim]
       output, (hidden, cell) = self.lstm(embedded)
       hidden = einops.rearrange(hidden, '(layer dir) b c -> layer b (dir c)', dir=self.directions)
       return self._projection(hidden[-1])
  
Figure 12. An example of the encoder architecture. We use a simple LSTM followed by a feedforward network here.

There are also alternative approaches to self-supervised learning we have explored, such as contrastive learning, but we found the sensitivity to batch size led to unstable training. We’ll continue to explore more alternatives in this space, as this is a fast-advancing area in ML research with significant successes in computer vision. Other methods amenable to large datasets with limited labels such as GraphSAGE are also alternatives we are currently exploring to train embeddings that better incorporate customer behavior.

Model performance evaluation

We evaluate the model according to both qualitative metrics like evaluation of an embedding UMAP projection and quantitative metrics such as F1-score on a baseline.

We evaluated qualitative results by looking at UMAP projections for the embeddings (Figure 13). In particular we can see that similar classes are projected near each other, meaning that the embeddings capture semantic similarity well.

Figure 13: The UMAP projection of the embeddings on a labeled dataset. Notice the clustering of similar classes, which implies good embedding quality.

Given the promising results of the qualitative evaluation, we also did a more rigorous benchmarking of the model on some baseline classification tasks to understand the quality of the embeddings as well as potential gains from using them in other internal models.

Model typePerformance
FastText Baseline
LSTM Classifier (cross-entropy loss)+15%
Siamese Neural Network+23%

In terms of quantitative metrics, our model improved over an F1-score baseline (a FastText classifier trained on trained class labels) by ~23%. This is a substantial gain, especially since the Siamese neural network is evaluated on a zero-shot classification task and the baseline is trained on labeled data.

Furthermore, we also noticed that using these embeddings as features for downstream classification tasks leads to significant improvements in sample efficiency. While training tagging models, we observed a need for greater than three times the existing labeled data to train comparably accurate models using a FastText classifier. This suggests that the learned representations carry substantial information about the content of the text.

Given the substantial improvement in both F1 performance and sample efficiency when using these embeddings in classification tasks, we’ve begun to deploy the embeddings as features available for consumption by other models at DoorDash.

A walkthrough of a sample application enabled by catalog embeddings 

Here we’ll describe one simple application of these embeddings to give an example of the new product use cases we can enable via catalog embeddings.

In order to improve content recommendations to consumers, we would like to programmatically generate carousels based on the user’s most recent orders. For example, if a consumer has recently ordered from “Papa John’s Pizza” other fast food pizza chains might be a good recommendation. To populate this carousel we want to retrieve stores which are similar to the store the consumer most recently purchased from. 

Without embeddings we would need to build a dedicated model that takes into account <consumer_id, last_store_id> and attempts to predict the probability of conversion on every candidate store_id. With embeddings we can instead use a two stage process:

  1.  Use a filtering step to retrieve the stores most similar to last_store_id 
  2.  Do a personalized ranking of filtered candidates for each consumer, using a pre-existing ranker.

Because computing the filter is fast via cosine similarity and we do not need to collect any data for the dedicated ranker, this process is relatively fast and simple to implement. See Figure 14 for more details on this process. Also note that generating a semantically similar store is straightforward by averaging the item embeddings on each store’s menu and can be done in a batch process to reduce real-time system load.

Figure 14: (A) The architecture of a traditional model to predict and rank a set of stores related to the consumer’s previous purchase. (B) In contrast, we can compute store embeddings from item embeddings (step 0) to change this ranking problem to a two stage process of retrieval (step 1) and ranking (step 2) where we first filter relevant stores and then rank them using an existing conversion-optimized ranker.

The effort needed to train a dedicated ranker is substantially higher than using this kind of pre-computed embedding. We can iterate much faster on product ideas like this and prove their impact on the user experience prior to investing in dedicated rankers. Furthermore, these embeddings can be used directly as model inputs to improve recommendations. 

Conclusion

Above we have discussed the problem of training item embeddings that preserve semantically meaningful relationships. With these embeddings we have immediately unlocked opportunities that are otherwise time-consuming and expensive to support.

These types of embeddings and self-supervised methods in general are especially helpful to develop immediately re-usable ML products at companies with fast-growing catalogs. While other ML approaches might be more suitable for specialized tasks or with less automatically generated text, we’ve found self-supervised embeddings still can add strong baseline performance to tasks requiring high quality representations of text data. We also observe that generally domain-specific embeddings work better for internal applications such as search and recommendations compared to off-the-shelf embeddings like FastText or BERT. 

We have already begun to test and deploy these embeddings across multiple surfaces in recommendations and programmatic merchandising. For these use cases, we have seen immediate substantial improvements in the performance of models using these embeddings and we’re looking to deploy them in more applications.

Further Reading

[1] Siamese Neural Networks for One-shot Image Recognition. https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf 

[2] A Simple Framework for Contrastive Learning of Visual Representations. https://www.cs.toronto.edu/~hinton/absps/simclr.pdf 

[3] Deep Metric Learning with Triplet Loss. ​​https://arxiv.org/pdf/1412.6622.pdf 

[4] FaceNet: A Unified Embedding for Face Recognition and Clustering. https://arxiv.org/pdf/1503.03832.pdf

In the tech industry, known for high stress and long hours, burnout is a significant problem that often goes unnoticed, making it difficult to address. With the advent of the COVID-19 pandemic, many developers, especially those with children, have seen increased levels of burnout, which is disastrous for their health and productivity. Overall, burnout is difficult to detect because it’s hard to distinguish from merely being tired, and people with a strong work ethic often ignore it while putting their team’s needs first. Fortunately, burnout can be alleviated by better communication, rest, establishing better work-life boundaries, and utilizing available company resources. 

This past year I had to deal with my own bout with burnout. If I had known more about it and taken advantage of available resources sooner, it might not have become such a severe situation. This article will discuss how to detect burnout and what to do about it. These ideas are based on my personal experience diagnosing burnout and overcoming it.

What is burnout and why is it hard to detect 

Burnout is defined as an extreme state of exhaustion. The World Health Organization (WHO) offers this description:

“Burnout is a syndrome conceptualized as resulting from chronic workplace stress that has not been successfully managed. It is characterized by three dimensions:

  • feelings of energy depletion or exhaustion;
  • increased mental distance from one’s job, or feelings of negativism or cynicism related to one’s job; and
  • reduced professional efficacy.

Burnout refers specifically to phenomena in the occupational context and should not be applied to describe experiences in other areas of life.”

Burnout is extremely hard to detect because it feels like a lot of familiar things, namely fatigue and stress. Additionally, highly motivated people who get burnt out often fail to recognize it because they have a strong work ethic, making it hard to acknowledge the problem. 

Why the pandemic has increased burnout  

Before we jump into the topic, it’s important to mention that, while burnout has always been an issue, the COVID-19 pandemic and its necessary restrictions accelerated it for many. Most tech companies switched from going to the office to working remotely, and for parents, this shift was especially difficult, because it meant balancing new hours and childcare responsibilities at the same time. 

What are the main causes of burnout

Burnout can often arise from three main factors: not realizing you are experiencing burnout, prioritizing the team, and lack of team awareness. Let’s go through each issue and discuss any underlying psychological components. 

Denial is often the biggest challenge 

For many people who are suffering from burnout the biggest challenge is actually realizing it. Burnout often just feels like normal tiredness, not an unfamiliar feeling at all. Many people may have trouble drawing the line when trying to decide if they are doing too much and need to take a break. This kind of burnout denial is fairly problematic since it allows what starts as a minor problem to get much worse. Psychology tells us that when presented with a challenge like burnout, many people will be inclined to deny they even have a problem rather than start working to overcome it. Because the ability to work at peak productivity can be central to a successful career, many people are unwilling to recognize how it might be affecting their health.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

When a strong work ethic gets in the way 

The second issue that often arises is that stoicism or a strong work ethic can often lead to burnout. Many people are committed to doing a great job and may become overwhelmed in the process. Because they are driven to succeed they may not let anyone know that they are struggling and in the process get burnt out. Many people are primed to not ask for help or show weakness publicly. Often people will be concerned about the latent effects of an actor observer bias, where people are likely to attribute their own actions to short term environmental factors and others actions to long term dispositional factors. It’s reasonable that people will worry that not pulling their weight on the team will make them look bad in front of their peers, when in reality the team would likely empathize with their situation and acknowledge they are working too hard. Because of people’s reluctance to ask for help or publicly admit they need a break, they will choose stoicism instead, which can prevent them from taking the measures that would avert the effects of burnout. 

The issue of the lack of communication 

Lastly, while it may seem clear to the person who is getting burnt out, without proper communication it’s likely that their manager or peers will not respond unless the issue has been raised directly. Many people are inclined to take people at their word and proceed accordingly. Adding the element of remote work makes this even more difficult because managers and colleagues can only interact with you over the phone or on Zoom, where it’s easier to keep up appearances. According to psychology, people have two things working against them here. First is a general bias towards the truth, meaning people generally trust what others are saying to them. When someone says something like, “yes, everything is fine, I’m just a little tired,” we instinctively believe them, even if we have doubts that that’s actually the case. Additionally, people often fall victim to the idea of transparency. We would assume that our emotions are written on our faces and people close to us should be able to read our facial expressions and body language to know when we are struggling, even if we say otherwise. The reality is that we are not transparent at all, and observing people’s expressions and behaviors is  just not a reliable way to learn how they truly feel. Therefore, there is just no easy way to know that intervention is needed without it being asked for. 

Tackling burnout with communication, rest, and better boundaries  

Like many parents during the pandemic, I suffered from burnout. Here is my story and how I was able to overcome it. All of this started for me when the shelter-in-place orders were put into place. Previously I maintained a good work-life balance where I went to work, dropped off my kid at daycare during the day, and spent plenty of quality time with my family, friends, and improved my fitness at the gym. 

That all changed with the shelter-in-place order. My three year old daughter could not attend daycare anymore, forcing me to take on unusual work hours to care for her and help out  my pregnant wife as she approached her due date. This situation created all the elements that bring on burnout. Initially I did not notice that I was suffering. I was tired from working long hours and taking care of my household. Working unusual hours, from 7am-12pm and then from 7pm-11pm, meant that I was working all the time, since I also had childcare, household chores and to respond to emergencies or questions during the day. This work schedule meant that even after the workday was finished I would still dream about coding. While I was clearly not giving myself enough time to rest I thought I was just tired from working hard and nothing more. 

When things started to intensify I suffered from the other cause of burnout, stoicism. While I was already working a lot, I would still do on-call shifts and take emergency calls outside of my scheduled working hours because I wanted the team to be successful. While I was helping the team there was a high personal cost to constantly taking emergency calls and never really taking a break from work. 

The last issue was that, until I told my team, they were unaware of my predicament and thought I was handling all my responsibilities. As the consumer Android app tech lead, I led the team to the successful completion of a full app rewrite, which is a huge endeavour on top of our responsibility to maintain the original app for millions of consumers. While we were focused on delivering this project, entirely remotely, it would not have been easy for my team to realize I was struggling. The team pinging me for emergencies was making my burnout worse, since I never got a real break, but the team did not know I was not in a state to handle these requests because I never brought it up. 

How to tackle burnout 

Once you realize that you are experiencing burnout, it can be tackled by communicating about it, taking time off, establishing work boundaries, and utilizing all available resources. Let’s dive into each one of these, which I utilized to get over my bout of burnout. 

Why communication is key 

For anyone dealing with burnout, the first step to overcoming this issue is to be able to talk about it openly, including acknowledging to yourself that you are burning out. In my experience, things only started to get better once I actually acknowledged the problem and talked to my manager and team about it. The fact that I was so stoic and focused on being a team player actually let me down in the end. If I had simply explained to my coworkers not to reach out to me outside of my working hours, or told my manager I was burning out and needed a break, then the  team would help me take on some of the tasks I was handling.  

Although DoorDash is always keen on executing quickly, our teams practice a lot of empathy and would have understood my situation and been able to help. The missing piece was just that they were not aware of how bad my stress and burnout really were. For example, the moment I had my breaking point the team was totally willing to let me stop doing on-call shifts even if that meant that each of them would need to pick up my slack. The team and my manager realized that I had been pushing too hard for too long and were understanding, not judgemental. 

Taking time off is the best way to heal 

Another key aspect to getting over burnout is to take the time off you need to recover. When I told my manager that I was suffering from burnout, he did not hesitate to have me follow my doctor’s suggestion and take two weeks off. Having two weeks to not think about work and refresh my batteries really helped me through this whole ordeal. Because DoorDash has unlimited PTO, my manager recommended that I prioritize my health, take the time to recover, and not worry about my work goals. The team was happy to accommodate my situation, and when I returned, I was energized and ready to hop back in and make more of an impact. 

The importance of establishing a work-life balance

A key element to preventing additional burnout was ensuring that I stuck to my own work schedule and that teammates respected it. Part of what made me suffer from burnout was not separating myself from work. By working long hours outside my typical pre-pandemic work schedule, and responding to pings constantly, I never gave myself the break I needed. It did not help that when working from home there is no physical separation from work, which made it even easier to never turn off and rest. 

After taking time off, I committed to maintaining a more manageable work schedule. Rather than just saying I was unavailable, I would turn off all non-emergency notifications, and only check work emails or notifications during working hours. I made sure to push back on items that were out of scope and set expectations with my team, to ensure project plans were realistic and not overwhelming. This enabled me to get all my work done, without taking work home or pushing myself to the limit. 

Previously, I had thought that being available all the time was a big asset to the team. In reality, it was better to be more professional, set firm boundaries, and protect my mental health. I was pleasantly surprised that the team totally understood this and was happy to work within the boundaries I had set up. 

Taking advantage of other resources 

In addition to the strategies above, taking advantage of any available resources will help get through burnout. While I primarily leaned on DoorDash’s unlimited time off policy, I took advantage of other resources that many people don’t realize their company offers.  

Utilizing employee resource groups 

Considering that some of the biggest contributors to burnout are lack of communication and stoicism with the team, one of the best resources available was to talk to someone else who would understand my situation. Fortunately for me there were a bunch of employee resource groups (called ERGs at DoorDash). I joined the Parents@ ERG, which gave me a safe space to talk to other parents about the struggles I was dealing with. Not only did it help me open up about my burnout but it was also a safe space to discuss other issues where I was not worried about being judged or seeming like I wasn’t being a team player. Ultimately, working with the Parents@ ERG helped me realize I had burnout and ultimately talk to my team and manager about it. 

Utilizing other resources 

Another benefit I took advantage of was the expanded wellness reimbursement program. Before COVID-19 I had used the gym to unwind, but with all gyms closed losing this outlet made it harder to separate my work life and personal time. Luckily, I was able to take advantage of the expanded wellness program to buy a subscription to an online meditation app, which helped me calm down after stressful days at work. 

There were also a number of other potentially helpful resources I did not utilize. For example, I could have taken advantage of DoorDash’s free decompression sessions, which are led remotely by mental health professionals. The lesson here is that while you are burning out, it may seem like you’re alone but there are probably resources at your disposal that you can utilize to approach the problem in the best way for you. 

Conclusion

The COVID-19 pandemic introduced new stresses into many of our lives, but it’s important to note that even as vaccinations increase and reopening occurs, these additional stressors won’t necessarily go away entirely. With DoorDash and others moving to hybrid work models in the months ahead, we will still need to balance these new variables, and can take learnings to better shape the way we work going forward. All of us need to watch for burnout because it’s hard for us as individuals, and our teams, to detect. Ultimately, the best way to mitigate burnout for yourself and others is to communicate actively and openly with our teams and managers. Leaning on your support system at work and enforcing your own boundaries will help you maintain a healthier work-life balance, and continue showing up as your best self.

Forecasting is essential for planning and operations at any business — especially those where success is heavily indexed on operational efficiency. Retail businesses must ensure supply meets demand across volatile changes in seasonal preferences and consumer demand. Manufacturers need to ensure they have the right amount of supplies and inventory in order to fulfill orders without locking up money in idle or unused resources. Other industries rely on forecasting for staffing, vendor commitments, and financial planning among a host of other applications.

Similarly, DoorDash has many forecasting needs covering everything from deliveries to marketing to financial performance. Ensuring that our internal business partners get the information they need, our Data Science team developed what we call our Forecast Factory, a platform allowing all our teams to set up their own forecasts without the help of a dedicated team of data scientists. We’ll discuss the general characteristics of forecasts and the challenges managing them at scale then explain how we were able to overcome these challenges by building our Forecast Factory. 

What forecasting solutions cover 

Despite the different operational processes forecasts support, there are many commonalities among the implementations:

  • Benchmarking current course and speed
    • If a static operation is maintained, what would demand be? How long would our inventory last? How many customer support tickets will our agents face? 
  • Scenario planning
    • If we take a specific action, how would demand change? How would the timing of ordering new inventory reduce the likelihood of running out of inventory?
  • Granular decision-making
    • Can we staff additional support agents only for the vendors that will be overloaded? Can we pull forward restocking only for SKUs that are at risk? Can we focus resources on only those business lines most at risk for missing the plan?

As a logistics provider operating multiple business lines at international scale supporting thousands of merchant partners, forecasting implementation is especially key for DoorDash. The health of our marketplace requires us to constantly manage supply and demand between Dashers (our term for delivery drivers), merchants, and consumers. Acquisition is tied to longer term demand projections. Support agent staffing needs to be matched to expected volume of support tickets to maintain quality outcomes for our merchants, Dashers, and consumers. The Merchant team needs to understand cuisine preferences to ensure the right selection is available for our consumers. Every aspect of our business is underpinned by a need for continuously updated, reliable forecasts to enable top-notch operational efficiency across thousands of diverse geographies. 

Alleviating the challenges of managing thousands of forecasts

Supporting quality forecasts for even one of these applications could easily demand the attention of several data scientists. Scaling to thousands of forecasts with dedicated resources would be untenable with that level of support. Other challenges (especially with diffusely managed forecasts) include: 

  • Incompatible and inconsistent formats housed in different mediums (Excel, database, business intelligence (BI) tools) making forecasts difficult to compare
    • Example: Volume forecasts are weekly totals housed in a database while the support team uses daily task ratios against volume housed in Excel.
  • Lack of a central location for upper management to analyze against operational outcomes
    • Example: Daily financial reporting requires SQL for forecasts housed in databases, manual entry for Excel, and downloading from dashboards for those managed with BI tools leading to heavy integration efforts.
  • Inefficient handoffs to dependent business partners 
    • Example: Demand planning locks its forecast Wednesday mornings but supply planning needs to make decisions for the week ahead on Monday.
  • Inability to easily incorporate relevant business knowledge 
    • Example: Hundreds of local market operators have flexibility over marketing promotions with an aggregate material effect but the overhead of collecting hundreds of individual actions prevents incorporation to a central forecast.

DoorDash faced many of these challenges. As a matrixed organization, data scientists and engineers are spread across verticals, such as consumer, merchant, logistics, Dasher, operational excellence, and marketing, all with separate needs. Compounding the problem is that the marketplace is active 24 hours a day, business strategy is heavily dependent on matching outcomes to operational targets, and the entire process has to be optimized at a sub-ZIP Code level.

Solving enterprise forecasting with a centralized platform

DoorDash created Forecast Factory, a centralized forecasting toolkit that can accept human-in-the-loop adjustments and business knowledge, to solve these operational pain points. Forecast Factory enables operational teams to onboard critical forecasts for managed execution, presentation, and analysis. 

This platform has the following benefits: 

  • Scalability: Teams now have an easy way to plug in their data, get a forecast, submit adjustments, do scenario planning, finalize a model, and lock and distribute the results.
  • Consistency: Ensure that forecasts have a consistent format and centralized location with defined accuracy metrics.
  • Timing: Scheduling, parallelization, and automatic resource allocation mean forecasts are ready in time for operational processes on different schedules to consume.
  • Accuracy: Access to a suite of best-in-class machine learning algorithms.
  • Access: Partners have a consistent interface and growing suite of visualizations to disseminate forecasts and targets.

Designing the Forecast Factory 

Our centralized platform is centered around a modular forecasting toolkit allowing teams to customize pipelines to their specific needs with moderate technical knowledge.

The main toolkit components are:

Figure 1: The Forecast Factory toolkit combines best-in-class time series algorithms with dynamic processing of the series to alleviate arduous manual work that teams often use to prepare a series such as outlier removal and accounting for external factors like holidays and promotions. This entire process can be wrapped in a configurable grid search for parameter selection along any time dimension and prediction cadence. Dynamic processing techniques can be selected for by the grid search based on historical accuracy or explicitly locked.
  1. Historical Data: A SQL query pulls a series with a target column and a date column that are mapped to the system through a configuration file. The toolkit can produce reliable forecasts with as little as 28 days of data for a daily series. The interface was designed to make this as easy as possible for our end users and allow them to update the query with direct changes.
  1. Time Series Data Slicer: This method enables training the algorithm over different time units, horizons, frequencies, and optimization periods. It also allows teams to match forecast characteristics to their objectives, even on the same series. This component handles all these date transformations throughout the training and prediction process.
  • Unit: hourly, daily, weekly, monthly, etc.
  • Horizon: number of units ahead for predictions (e.g. seven days, six weeks, etc.)
  • Frequency: refresh schedule (e.g. daily, weekly, etc.)
  • Optimization period: critical time frame for accuracy in units that training optimizes for
  1. Data Preprocessing: This step handles necessary adjustments to the input series, each of which can be selected for by the algorithm based on accuracy. The toolkit has ready-made options to remove the effects of outliers, holidays, outages, promotions, and other events that deviate from the current course and speed of the series. These can be added directly as features in later steps, but allow an important interface to control for these effects in time series models (e.g. exponential smoothing) that only operate on the series itself. Preprocessors (and prediction processors) can include models built on the residuals of the time series to provide a further bridge for using external features to estimate and control for difficult-to-specify effects like weather. 
  1. Time Series Algorithm: The toolkit is designed to be agnostic to modeling algorithm choice. Right now, most forecasts are built on time series models combined with processing to control for external effects. However, other models, such as Gradient Boosted Machines and Prophet, can be utilized through the ability to pass feature sets to the model through the Time Series Data Slicer to make the right features available at the right time.
  1. Prediction Processing: This step is similar to the preprocessing step except done on the predictions themselves for the future. Similar adjustments outside of current course and speed are done here for time series models. Adjustments for known promotions, calculated holiday coefficients, and residual adjustments based on weather models are examples of steps currently implemented.

Component Selector/Grid Search: This object allows the user to specify which components and parameters to backtest over a given time window. Different combinations of external parameters such as preprocessors, lookback windows (limiting input data range), and postprocessors are specified here along with internal model-specific parameters such as additive/multiplicative seasonality for exponential smoothing algorithms. This provides the toolkit the ability to learn the best algorithm parameters but also make adjustments like shortening the input time series during periods of rapid change and dismissing holiday adjustments when the pattern seems to have deviated from history. The ability to specify custom loss functions also means we can tailor selection to specific team objectives.

Forecast Factory infrastructure

In order to operationalize the toolkit, Forecast Factory is embedded within an architecture that allows user input, interactive exploration of candidate and locked forecasts, accuracy summaries and the submission of on-the-fly adjustments:

Figure 2: The platform infrastructure manages Forecast Factory job runs, accepts user input and stores/visualizes results.

The Forecast Factory’s infrastructure relies on Dagster for job orchestration and Databricks for execution. Standard visualizations allow users to explore various candidate forecasts–including the one marked as best by the toolkit–and decide on which to lock as final for their use case. Accuracy visualizations then allow users to see how various forecasts performed over time and during which periods so they can make more informed selections for the future.

The main pipeline components are the:

  1. Base ETL: This is the query supplied by the end user to provide the input series for forecasting.
  1. Pre-Forecast Code: Any pre-forecast steps such as features that need to be calculated,  a residual predictor that needs to be trained prior to use in preprocessing or prediction processing, or a check for any reported outages in previous days that code is executed here.
  1. Base Forecasts: The champion model selected by the algorithm as well as any other candidate forecasts (e.g. the team typically likes to observe a version without holiday effects to see how the algorithm is accounting for the impact, a version with a longer lookback window to compare long and short term, etc.).
  1. Select Adjustments: Users can select specified adjustments to the input series or predictions. There is a standard adjustment database that allows users to create their own preprocess or prediction process adjustments directly via a Python API or by importing from a Google sheet. This ensures users with business knowledge (a promotion next Wednesday) can supply that input without the need to go through the Data Science team.
  1. Adjusted Base Forecasts: Some adjustments (removing a past promotion from the input series) may require the forecast algorithm to be rerun. This step accounts for that and produces the adjusted forecast.
  1. Collect Candidates: Candidates are stored in a database with a schema that attaches metadata (e.g. internal and external parameters) and controls which forecast is locked as final and exposed to the end user process. An exploration visualization allows users to see candidates side-by-side and compare against past actuals and a suite of metrics.
  1. Select Final Forecast: Once the final forecast is selected as final and marked as such in the data model then metadata around the forecast is stored, any hierarchies that need to be built (e.g. normalized market level forecasts to top level forecast), and accuracy information is populated. Historical accuracy for the locked forecast (and candidates) can be viewed via an Accuracy Dashboard.

Conclusion 

The toolkit and pipeline infrastructure above provide a modular structure that enables generalization to a wide range of business partner use cases while shortening onboarding cycles. Partners can directly alter the query as changes in underlying data or process happen and/or new geographies or levels are added. Business users can quickly and intuitively submit adjustments without having to involve a data scientist to apply the changes manually. If data scientists from specific focus areas want to include their own algorithms, custom loss functions, or otherwise, these components can be extended without having to rewrite the whole system.

Acknowledgements

Thank you to everyone who has contributed to the Forecast Factory project! DSML: Lauren Savage, Qiyun Pan, Chad Akkoyun Platform: Brian Seo, Swaroop Chitlur, Hebo Yang and all of our amazing partners.

In our previous article, Using Display Modules to Enable Rapid Experimentation on DoorDash’s Homepage, we discussed the concept of Display Modules, and how we built them to speed up development and implement a more flexible experimentation paradigm. Although the Display Module system partially solved some of the challenges involved, we felt we could improve DoorDash’s product development velocity even more. 

The improvements we wanted to make were:

  1. Release faster, by relying more on backend releases rather than mobile releases 
  2. Removing inefficiencies in both backend and frontend codebases
  3. Fixing problems caused by our previous server-driven UI attempt

We’ll go over these problems in more detail in the next few sections.

Mobile releases take a lot of time

Mobile releases are often an intensive process for the following reasons:

  • App store reviews take time, as opposed to backend releases, which only need internal approval.
  • More rigor (and hence time) is required for testing and during the release ramp, compared to backend releases, because mobile releases cannot be rolled back instantly. The only option with a faulty release is to send out a new version that solves the previous release’s bugs.
  • Not all users upgrade their applications. There are always situations where a small percentage of users run older versions of the app. This can be tedious because subsequent backend releases will need to support, in addition to new versions, these old versions for as long as users are still using them.

Redundant implementations

We also wanted to eliminate the need for similar logic to be implemented across the stack. One issue was that we had duplicative implementations of the same business logic across multiple clients. Often, this leads to slightly different functional behaviour. With additional rigor, it is possible to ensure that parallel client behaviors remain in sync, but this rigor takes time, and doesn’t allow for fast execution. Additionally, there might be some business logic performed on the client-side that should have been done on the backend, causing errors, as the information the user saw was not based on the source of truth in the database.

The second inefficiency was that client-side components that looked very similar were difficult to reuse, meaning that even incremental changes took a lot of development time. As DoorDash went through a hypergrowth phase, the time it took to coordinate and communicate across different teams to converge on a good design (from both UI and engineering perspectives) also increased. It became harder to continuously engineer for the optimal separation of business logic and view logic, leading to many duplicative implementations of very similar views.

Iterating on our previous server-driven UI attempt 

Our previous attempt achieved a lot of our prior goals, and also brought our attention to some other potential areas of improvement.

Firstly, UI components were strongly coupled with data models. Any deviation from existing data models required changes to multiple microservices, not even counting client-side changes. This was an area where we saw an opportunity to improve our execution velocity.

Secondly, our response format was a heterogeneous array containing different components. Each of the components had certain constant envelope fields and differing content fields based on the type of the components. While in theory this was designed to be flexible, in practice we observed this was prone to deserialization errors.

Thirdly, while we were no longer using a static API response shape, and we were able to find enough flexibility to dynamically rank within component types, we wanted universal capabilities to rank content and unlock more avenues for personalization to better serve what our customers want. 

Designing generic UI components 

Our analysis of the problems listed above helped us come up with the following broad design requirements:

  • Decouple the backend responses from backend data models, and instead couple them to the views as much as possible. Removing business logic in the client reduces the overall complexity, and enables views to be naturally reusable.
  • Define new components whose implementation details would be completely defined and owned by the client teams. Moreover, we wanted new UI components to be easily shareable across different product use cases. This would reduce incremental effort if utilizing a previously defined component.

By investing in this framework, we hoped to unlock faster development speeds, allowing us to launch new features faster. As a platform team, we wanted to provide our product teams the ability to change feed experiences on the fly, which would open the door to exciting projects in the future. Our team wanted to unlock new avenues for personalization and relevance and improve the ease of experimentation.

Reviewing previous literature

We read through existing technical articles and reviewed videos about other folks implementing similar frameworks. We came across John Sundell’s talk on the subject and Spotify’s open source HubFramework which, while deprecated, was a good starting point. We also took a look at similar efforts by Instacart. We did not take a look at Airbnb’s server-driven UI, but upon doing more research while working on iterations, we were pleasantly surprised to come across articles and videos about its efforts, similar to our own solution.

Building the minimal viable product (MVP)

We wanted to build an MVP and test it out in a production environment in a not-so-critical product. This kind of testing lets us get real 360-degree feedback from customers and cross-functional teams, as well as client and backend engineers.

Designing the Facet

We designed a Facet, a building block based on the design principles described above, which is meant to map one-to-one to a view on the screen. In order to communicate view logic rather than business logic, we decided to define the UI components in terms of UI primitives (i.e. as view models, or models that bind one-to-one to various aspects of the view) instead of data models. Therefore, we created the following primitives to describe an individual view:

// The Guest of Honor.
message Facet {
  // ID of this instance of this element.
  google.protobuf.StringValue id = 1;

  // Which component should this information be rendered in the form of?
  Component component = 2;
  
  // All text fields pertaining to this element.
  Text text = 3;
  
  // All image fields pertaining to this element.
  Images images = 4;

  // Any data that doesn't fit other fields in this proto
  google.protobuf.Struct custom = 5;

  // Events related to this element, for example: clicks, selection etc.
  Events events = 6;

  // Nesting
  repeated Facet children = 7;
  
  // Only contains event data. The event name is implicit based on the placement of the "logging" field.
  // This one refers to clicks and views
  google.protobuf.Struct logging = 8;

  // Layout Data
  Layout layout = 9;

  // facet level style
  Style style = 10;

  // Component information
  message Component {
    // Maps to an Id in the component library - https://docs.google.com/document/d/1IWSggUGns5fMTUq6ysVP3ZpRB0jZqIjhpUvH90_ncZI/ 
    google.protobuf.StringValue id = 1;

    // This can in the future to define fallback component group for old(er) app versions
    google.protobuf.StringValue category = 2;
  }

  // Text fields
  message Text {
    // Means different things in the context of different components. Refer to component library for details.
    google.protobuf.StringValue title = 1;

    // Means different things in the context of different components. Refer to component library for details.
    google.protobuf.StringValue subtitle = 2;

    // Means different things in the context of different components. Refer to component library for details.
    google.protobuf.StringValue accessory = 3;

    // Means different things in the context of different components. Refer to component library for details.
    google.protobuf.StringValue description = 4;

    // Any other text
    map<string, google.protobuf.StringValue> custom = 5;
  }

  // Image fields
  message Images {
    // Means different things in the context of different components. Refer to component library for details.
    Image main = 1;

    // Means different things in the context of different components. Refer to component library for details.
    Image icon = 2;

    // Means different things in the context of different components. Refer to component library for details.
    Image background = 3;

    // Means different things in the context of different components. Refer to component library for details.
    Image accessory = 5;

    // Means different things in the context of different components. Refer to component library for details.
    map<string, Image> custom = 4;
  }

  // Image type
  message Image {
    // URI for image
    google.protobuf.StringValue uri = 1;

    // Placeholder text / string
    google.protobuf.StringValue placeholder = 2;

    // Placeholder local asset identifier (in case)
    google.protobuf.StringValue local = 4;

    // Display Style
    Style style = 3;

    // Display Style Enum
    enum Style {
      // If unset
      STYLE_UNSPECIFIED = 0;

      // Rect image w/ rounded corners
      STYLE_ROUNDED = 1;

      // Circular image
      STYLE_CIRCLE = 2;
      // Add more here as needed
    }
  }

  // Event
  message Events {
    // Click option definition
    Action click = 1;
  }

  // Action
  message Action {
    // Action name
    google.protobuf.StringValue name = 1;

    // Action - related data
    google.protobuf.Struct data = 2;
  }
}
  1. Text

Most individual views have a text hierarchy. We created a text model that semantically describes the importance of each text field. Depending on the type of view, such as row, page header, carousel, or information view, it would lay out based on the implementation of the view.

"text": {
    "title": "Frosty Bear",
    "subtitle": "Pizza, Vegetarian, DashPass",
    "description": "4.8, 14,400+ ratings, Free delivery over $12",
    "annotation": "$10 off, DashPass only"
}
  1. Images

We had a similar concept for images, where we created an image model that semantically describes importance and placement across a view.

"images": {
    "main": "https://img.cdn4dd.com/media/XXX.png",
    "background": "https://img.cdn4dd.com/media/YYY.png",
    "accessory": "dashpass-logo"
}
  1. Events

We wanted to come up with a model that semantically describes how a user interacts with the view. Currently, we only use “click”, but our design keeps the option open to introduce other types of actions, such as drag or swipe.

  1. Logging/analytics data

Previously, the clients had to do a lot of data-massaging to log attributes for view impressions and clicks. This introduced a decent amount of complexity where views might need to know a lot of business logic that might be unrelated to the view itself, such as item cells where we wanted to log its parent store, the consumer submarket ID, or the delivery fee. 

Because the clients were going to be much more data-agnostic, this logic couldn’t be gleaned anymore, so we started sending the attribute keys and values in a simple [String: String] dictionary that the clients would log with the views. This approach made client-side views cleaner because we wouldn’t have to worry about business logic in views purely for data analytics. It also allowed us to define a consistent logging workflow, i.e. always log action and impression events with the attributes defined in this field with some standard client-side additions like session ID, user ID, client version, or OS version. One minor downside is that this approach bloated the response, as we would’ve added attributes for logging to each view and resulted in needing to send a lot of repeated information in the same API response.

  1. ID

Each Facet contains an ID field which is intended to be unique within the scope of a response. The ID needs to be unique because it is used to diff the view tree for clients, and it is also used for saving the objects to a database on the Android client for caching.

  1. Custom (optional) 

We ran into some issues where the semantically described models weren’t precise enough to describe some aspects of the views, especially for certain aspects that we wanted to be customizable per view, such as utilizing the same exact view, but with a light background versus a dark one. To add customization between components, we started to utilize the Google Protobuf Struct model, which is untyped. The goal is that we shouldn’t have to use them in most cases, but the provision exists if we have a case that doesn’t fit any of the other primitives.

  1. Component

This field contains component information which informs the client on how to use the remaining information contained in the Facet.

  1. Children (optional) 

This field is an array of Facets, which provides the option of nesting these building blocks recursively.

  1. Style

We didn’t originally specify size, colors, or font, and previously used the Custom dynamically-typed object to specify these attributes when needed. Since then, we started matching styling types with our design language system library, which has helped mitigate our usage of Custom and reduce redundancies between components. 

  /// Struct that indicates styling elements, e.g. background color, font, dls style, etc.
    struct Style {
        /// Specifies configurable background color. Right now, only
        /// addressPicker utilizes this
        public var backgroundColor: ColorSemantic?
        /// Specifies the size class. Is useful for button components, currently
        public var sizeClass: SizeClass?
        /// Maps to a Prism `type`. Somewhat overlapping in specs with
        /// `backgroundColor` field, since the type will typically
        /// specify the background color, in addition to foregroundColor, etc
        public var dlsType: DLSType?
    }

Product use cases: collections, tiles, and landing pages

Now that we have a design, we need to test it out in a production environment. Conveniently, we were in the process of developing a new type of screen real estate, tile collections and custom landing pages. This was a great opportunity for us to field-test our design, because it had components that we were building from scratch, as well as components where we integrated with our previous solution. This testing would also give us a sense of how easy future migration efforts could be and was an ideal testing ground because it was not a critical feature. While there are some risks associated with coupling a launch with new technology, we were confident we could ensure they didn’t spiral out of control.

Figure 1: An instance of our collections carousel, on the left, was folded into the existing homepage design/framework. Individual collection tiles can lead to different variations of landing pages, shown on the right. This illustrates how our Facet concept can be interlaced within a predominantly non-Facet response as well as in a greenfield Facet-only response.

As shown in Figure 1, above, tile collections are actually part of the homepage. Because the homepage followed the Display Module pattern, we didn’t want to completely revamp it for an MVP. By definition, an MVP’s scope should be minimal. To conform to Display Modules on the homepage, we actually defined a new Display Module that was essentially a wrapper around the corresponding Facet for a collection.

Because the landing page was brand new, we had the ability to design it completely from scratch using Facets. When defining components, we employed a pragmatic approach: define components that are needed immediately rather than making defining the entire library a prerequisite for our work.

The business and product  teams were excited about the new screen real estate and the promise of flexibility. Mobile developers were excited about how fun and easy it was to design a component, and reduce incremental work. Although there was some initial investment effort, and more effort required when new components are added, the effort was considerably less compared to the previous solution.

From V0 to V1: adopting and expanding the MVP

Although DoorDash’s business model began with restaurant food delivery, our future lies in adding other delivery use cases, such as convenience items and groceries. In order to incorporate these new verticals in the customer experiences, we embarked on an ambitious project to re-architect the UI elements on our homepage to allow different configurations for different use cases. 

The previous API response structure that powered our homepage was geared toward displaying restaurants, and served restaurant data models, but the new designs called for a paradigm shift to a feed that split up the homepage into different categories. We knew we were making a drastic change to the backend of our homepage that would still employ mostly the same broader components as before, and we wanted to experiment with different layouts and allow for the backend to completely configure the entire page.

Model updates to support animations and interactions

Our previous proof-of-concept was designed to be a simple, linear layout for a scrollable feed of simple UI components. Our homepage is more data and interaction-heavy than other parts of our app, so we had to come up with ways to support features that weren’t needed in our MVP.

Paging: Because we don’t want to give the user all the available stores in one response, as that is not scalable, and our users wouldn’t appreciate such a huge dump of data, we needed to support pagination. To accomplish this, we added a new case to our Action enum to represent loading content. The content referred to here would potentially include offset information. In cases where paging is appropriate, we also added a key/value pair of Action to our outer response structure.

enum Action { 
    case navigate
    case loadContent(offset: String)
}

Reloading specific sections: Our MVP simply served components to be laid out sequentially, but could not group parts of the page. We ultimately decided to update our response to be structured in a way that indicated separate groupings with IDs. Some actions might specify that it was only relevant to a particular section (or set of sections).

message FacetSection {
 google.protobuf.StringValue id = 1;
 repeated Facet header = 2;
 repeated Facet body = 3;
 Layout layout = 4;
}

Bulking up the component library

The conception of our layout engine and FacetFactory, which is a class that utilizes a factory pattern to produce UI components was originally implemented for our MVP. At the start of the first version of the project for powering new verticals, we needed to make updates to our main layout engine to support the required new features, as well as develop many new UI components.

Having multiple engineers working on each of the iOS and Android apps in parallel worked better and was ultimately faster with this new paradigm because, with each new component, we did not have to design a new API response structure; we merely defined a new component, where we matched the different aspects of the new component into the text, images, and events structures. Each engineer also had greater velocity, as the data coming down was already in view model form, so it was easy and intuitive to create reusable components client-side. We did not need to write a new response model and domain model client-side with each new component, which also had the side effect of letting us worry less about deserialization issues.

Note: The only data that isn’t in view-model form is  that in custom fields, which are data dependent, and force us to write custom deserialization logic. We tried to avoid having custom objects, and when it was necessary, we would make them as lightweight as possible, which helped with velocity, and kept as many fields nullable as possible, to help mitigate deserialization issues.

Challenges, pitfalls, and learnings

As with any new system implementation at scale, we faced various challenges and learned lessons along the way. Production testing, with both less and most critical use cases, revealed design issues we needed to overcome. We also uncovered edge cases in need of solutions. Addressing these and other issues led to a more reliable and scalable production system.

Versioning of client Facet capabilities

Mobile engineers can’t deploy a new version of the app to every single device. Even with strong new version adoption there will always be a percentage of users who continue to use an older version of the app because they don’t or can’t update it. When we quickly iterate over components and features, users with outdated app versions may have a sub-optimal experience or, worse, run into bugs that prevent successful task completion.

Scenarios:

  1. One reason for outdated experiences could be that apps have older versions of native components. This situation is not specific to server-driven UI, and is a reasonable and unavoidable effect of individual users not updating their apps.
  2. Another reason could be that outdated apps will not understand new components deployed outside of a standard app update. These new components might be designed for a newer version of the app, so the app cannot map them to a view. This issue might not be so bad in itself, as omitting a new component is not too different from having older versions of the app out in the wild, which existed before those components were developed, but with server-driven UI, this leads to interesting edge cases like:
    • Container components, such as carousels or lists, where the app doesn’t understand any of the children, the actual content, which could lead to an empty container being displayed.
    • If we created a new version of an instrumental component, such as the store row in the homepage, and it was different enough that it warranted a new Facet ID, older versions of the app would omit those views as soon as the backend starts sending the newly versioned ones. 
  3. Yet another reason for broken experiences is that clients might understand the component, but a new action type might be unsupported, in which case the component will cause an error on action, but might still be rendered, ultimately resulting in a poor experience.

Our solution to minimize broken experiences for older clients

We ultimately decided on a two-pronged approach to solve the versioning problem. We designed a simple Semantic Versioning-based system that allowed clients to specify the extent of their Facet rendering capabilities (or lack thereof) while making a call to the backend. We may, in the future, map this to versions of our DLS library, but the current version maps to various feature sets, i.e. components and action sets they support.

Additionally, we added some simple validation on the client-side, so that the versioning does not get out of control:

  1. Omit container Facets if we don’t recognize any of the children
  2. Omit Facets with navigation actions that we do not understand

We found that these two simple checks reduced our need to version our libraries by about 50%.

The pitfalls of using SwiftProtobuf

We initially tried using Protobuf, our response specification, to auto-generate client-side models, and used the SwiftProtobuf library to do this. We ran into issues with seemingly safe changes, such as adding an enum case, that broke the application. Because we were only generating a handful of models, including section, Facet, text, images, and events, we ended up removing the SwiftProtobuf library as a dependency and manually implemented them.

Heavy user interaction and state maintenance

Several of our components have complex interactions, animations, and functionality that is dependent on state. Had we relied on the backend to supply the state, we might need additional API requests, which would be error-prone. It would also have led to a less responsive UI, as we would have to wait to update the state until the API responses came back.

Figure 2: Relying on the backend to supply the state for our app’s Filters Carousel would have been unnecessarily complex and likely caused slow response times.

Taking the Filters Carousel as an example, we already had a modular filters component in which the backend returns a Filter data model that determines its behavior. For example, we would generate the appropriate query parameters upon selection of a filter, based on the type of filter, such as binary, collection, or range,  and the options sent from the backend. We thought about making this query-generation backend-driven, but quickly realized that updating the selection state and scroll position, and having the backend send down the query parameters that we need in a round-trip with the API response, would be unnecessarily complex. Subsequently, we decided on a hybrid-approach for this component where the backend would actually send down the data in a custom field we could continue to use our local filters modular UI component with minimal changes.

Another example of a Facet which has state maintenance and interactions is the Address Picker view. The Address Picker updates based on the overall user location, which is stored in-memory in app sessions. If we wanted to move toward it being powered by the Facet API, there would be a delay in updating the address. There were minimal returns in having it powered by Facets, so we decided to go with the data-dependent approach.

Convenience fields are a double-edged sword without centralized usage, tracking, and review

One pitfall we kept running into was gating the usage of custom for various components. As mentioned above, in order to support further customization of views that cannot be encapsulated in text, images, and events, we utilized a dynamic object called custom. Although having a dynamic object is great for flexibility, there are downsides caused by the lack of type-safety and the potential for  deserialization errors. We have currently mitigated the deserialization errors by preventing them from failing deserialization of the entire response when running into a custom object that we don’t understand. Additionally, as we mentioned earlier, some of our components are hybrid in the sense that they are partially data-dependent. 

We realize that utilizing this custom field is a bit of an anti-pattern, and we noticed that when other engineers onboard onto this new system, they usually gravitate towards using it instead of the prototypical Facet fields, such as  text, images, and action, so we are constantly trying to minimize our reliance on custom, and finding automated ways of deterring its use except for the most dire needs.

Web support required modifications to our design 

Since we didn’t integrate with all client platforms at the same time, there were some quirks that were overlooked. We started developing this system for iOS slightly before Android, and did not adapt to our web application for several quarters. Interestingly, this system almost constrained the design for the parallel features on the web because our goal was to have the same APIs power all the clients. Although the web design was similar to mobile, recent product updates wanted to use the larger and varied screen real estate that the web affords even though layouts need to be slightly different for the web in order to support so many different, and larger, screen sizes. 

For example, the views in our store feed section of the page were meant for the store to be treated as an edge-to-edge component, and therefore we called it Store Row. However, for the web, it did not make sense for the component to be edge-to-edge, and we needed several of the views to fit in one row, like a grid, and needed to encapsulate the Store Rows in a grid-like container just for the web. We were able to find workarounds within our framework which weren’t too painful, such as updating the web-specific layout engine that lives on the client, which we feel is a testament to the robustness of our overall design. 

We also ran into scenarios where certain components needed to behave differently across platforms while maintaining the majority of their functionalities. As a fix we introduced extra custom parameters for the impacted components and updated the logic with the facet components to ignore (on mobile apps) or consider (on web) them.

One major challenge when implementing the Facets framework for web was orchestrating  Facet data through GraphQL queries. The DoorDash web app uses GraphQL to query data from the backend. GraphQL requires the client application to request data fields explicitly. The client application should have a clear understanding of data structure, actual data fields and hierarchy of them to query.  The Facet data model was designed to be recursive, ; , so there was no clear way to understand how many nodes were presented in the data or if nodes had any children prior to requesting the data. This structure makes the process of requesting Facet data via GraphQL close to impossible or at the best case scenario inefficient. 

To overcome this challenge we implemented a logic on the gateway layer to flatten the Facet response before sending it to the client app and incorporated a logic on the client side to reconstruct the original data structure from the flattened response. 
We used an efficient, fast and very straight algorithm to flatten the data. The code below presents a sneak peak of flattened data. The code shows how “carousel:filters” has 4 children and the immediate 4 nodes before it should be considered as the children of the carousel.

header:
0: {id: "filter:dashpass_eligible", childrenCount: 0,…}
1: {id: "filter:star_rating", childrenCount: 0,…}
2: {id: "filter:eta", childrenCount: 0,…}
3: {id: "filter:price_range", childrenCount: 0,…}
4: {id: "carousel:filters", childrenCount: 4,…}

Fallback components were unnecessary

We originally were very concerned about updates to important views, such as the Store Row or the standard carousel, that would cause us to version that component with a different component ID, but would have been extremely missed if omitted. Because of this fear, we have a property on Component called Category, and we created a prototypical view for each category, such as a prototypical page-header, standard-carousel, and store-row. However, we have not found this to be useful at all, as when we version components, we typically have a very specific, new design in mind, in which case, we utilize our Semantic Versioning system to know which components the clients can support before sending a response back to the client.

Results

Ultimately, the expansion of this project to create a complex server-driven system was successful in powering different verticals and supporting new features. Because it was such a huge project, it was heavily staffed with three client engineers per platform (iOS and Android). Each platform built 19 net new components, in addition to re-writing a handful of existing components on iOS, and we did not run into any issues with merge conflicts. 

The ability to work independently on the same feature was critical to the execution of a redesigned homepage, as we were rewriting the entire homepage to be modular and more flexible to non-restaurant merchants, such as grocery stores, flower shops, convenience stores, and chocolate shops.

Figure 3: The end result of our Facets concept helped us launch multiple critical product surfaces.

Additionally, choosing a homepage rewrite project to expand the system to support so many new components and new features resulted in a robust component library. This effort has led to several teams adopting our layout engine to power brand new features by reusing our view and layout engine. 

Next steps

We are currently in the process of customizing view attributes such as font, color, and padding. There was previously no way to customize these simple attributes per component, so each component’s text styling was constant across all instances of a particular component. This led to some redundancies in which some components were very similar, such as headers. We are currently in the process of making the backend aware of certain style semantics, such as text, spacing, and size  as defined by client DLS libraries. This will allow for even greater flexibility, and a reduction of redundancy, while maintaining a cohesive brand and theme.

Conclusion 

Many companies struggle with slow mobile development and adoption hurdles. Our solution allows faster iteration of products. For mobile developers, that means a larger initial investment in creating a layout engine and pragmatically creating a large component library, but once components are defined, the page can by laid out with those components by backend deploys, which ultimately results in faster, more parallelizable development, and less time building duplicative implementations. We’re constantly experimenting with new layout changes trying to find the ideal configuration as we onboard more verticals. From a business perspective, this also means a tremendous increase in ship velocity because most client-side code can now be reused, and there is less dependence on mobile release cycles.

Acknowledgement

The authors would like to thank Jeff Cosgriff, Xisheng Yao, Jimmy Zhou, Byran Yang, Kirtan Patel, Daniel Kelley, Liviu Romanascu, Erik Zhang, Bingxin Zhang, Fiona Miao, Salmaan Rizvi, Calvin Chueh, Ephraim Russo, Suke Hozumi, Ezra Berger, Wayne Cunningham, Josh Zhu, Rui Hu, Kathryn Gonzalez, Manolo Sañudo, Mauricio Barrera, Jimmy Liu, Eric Gu for contributing to this project.

DoorDash delivers millions of orders every day with the help of DeepRed, the system at the center of our last-mile logistics platform. But how does DeepRed really work and how do we use it to keep the marketplace running smoothly? To power our platform we needed to solve the “dispatch problem”: how to get each order from the store to the customer, via Dashers, as efficiently as possible. In this blog post, we will discuss the details of the dispatch problem, how we used ML and optimization to solve the problem, and how we continuously improve our solution with simulations and experimentation.

Understanding the dispatch problem

To better understand the dispatch problem we will take a look at the goals DeepRed tries to achieve for each part of our three sided market then we will examine the hurdles we face. 

The goals of dispatch

Let’s first define the goals we are trying to achieve when dispatching Dashers. Our goals are two-fold: 

  • Propose offers to Dashers as efficiently as possible so they can maximize their earning opportunities.
  • Deliver orders fast and on time so consumers and merchants are happy with their experience. 

Reaching these goals requires overcoming a number of challenges. We approach each challenge using machine learning and optimization solutions, and use simulation and experimentation methods to build on that performance.  

Finding the best Dasher 

To find the best Dasher to deliver an order, we need to consider a number of different factors. The most important factor is the geographical location of Dashers. We typically want to find a Dasher who is as close as possible to the store to minimize the total travel time. The second factor we look at is ensuring the Dasher will arrive at the right time. If we dispatch a Dasher too early, they will have to wait for the order to be ready. If we dispatch too late, the food will sit too long and could get cold, while the merchants and consumers become upset that the food wasn’t delivered as quickly as possible. Another factor is batching, utilizing Dashers as effectively as possible by looking for opportunities where a single Dasher can pick up multiple orders at the same store (or a set of nearby stores).

Figure 1 The goal of dispatch is to find the best Dasher to pick up each order once it’s ready at the merchant and deliver it to the customer. (This figure appeared in a previous blog post.)

Accounting for marketplace conditions

There are other marketplace conditions outside of our control that play into our decisions of which Dasher to choose. The most important one is the supply and demand balance in any given market. While we try to make sure there are enough Dashers available to fulfill orders, there may be times when we don’t have enough Dashers to pick up all the orders. In these undersupply scenarios, we have to make tradeoffs about which orders to pick up now versus later. These are also situations where it is beneficial to look for batching opportunities where a consumer can get their order faster if a single Dasher is able to pick up multiple orders at the same time. We also need to look at conditions like the weather and traffic that can impact delivery times or cause Dashers to refuse orders at higher rates than we would normally expect. For example, if it’s raining and many Dashers use motorbikes we can expect fewer accepted deliveries, which can cause lateness and hurt our ability to complete our goals. 

Tackling the dispatch problem

Taking on such a complex problem was a two-stage process. First, we built a sophisticated dispatch service that utilizes a number of ML and optimization models to understand the state of the marketplace and make the best possible offers to Dashers to meet the needs of our marketplace. The second stage was to build simulation and experimentation platforms that would allow us to make continual improvements to our dispatch service. Both of these methods help us achieve our goals and continue to get 1% better every day. In the following sections, we will go through our dispatch system’s architecture and how it handles a sample delivery. Then we will describe how we leverage our simulation and experimentation platforms to improve our decisions. 

Building DeepRed: our dispatch service 

At a high level, the dispatch engine is built on two sets of mathematical models. The first set of models are ML models that are trained to estimate how an order will unfold if we offer it to a particular Dasher. These models are focused on making predictions about each individual order, store, and Dasher.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Once the estimates are made they are fed into our second modeling layer, our mixed-integer optimization model. The optimization model makes the final recommendations about which orders to offer to which Dashers. Whereas the ML layers are focused on making individual estimates for each order, the optimization layer is focused on making system-wide decisions for the whole marketplace.

Taken together, the ML models and the optimization layer distill millions of data points from our marketplace into a set of dispatch decisions that ensure each order is offered to the Dasher who can deliver it from store to consumer as efficiently as possible.

An order’s journey through dispatch

The best way to understand how we solve the dispatch problem is to consider how an individual order works its way through DeepRed’s complex system. We’ll look at how an order passes through the multiple layers of DeepRed starting with our offer candidate generator, proceeding to our ML layer that estimates how those offers could play out in the real world, and then passing through our optimization layer that makes final recommendations.

Figure 2 Orders make their way through DeepRed’s three layers: offer generation, the ML layer, and our optimization model.

Constructing potential offers

When a new order arrives to our dispatch engine, we first update our understanding of the current state of the marketplace and how this order interacts with the Dashers and other orders. We are looking to find which Dashers are nearby and available to pick up the new order. These Dashers could be waiting for their next order, in which case we can offer them a new order right now, or they could be finishing up another order, in which case we can plan to offer them a new order as soon as they complete their current delivery.

Our focus at this stage is not limited to which Dashers are available: we also look at what other orders are waiting to be picked up. If there is another order being picked up at the same store or on the same block as our order, it may make sense to offer both orders to the same Dasher. The same can be true if there is another delivery that needs to be dropped off near where our order needs to be delivered.

By looking at the available Dashers and other orders, we are able to construct potential offers for our new order: a set of Dashers that this order could be offered to and possibly other orders that could be picked up by the same Dasher. These potential offers then get sent to the ML layer where we predict what might happen to these offers in the real world.

Predicting how an order will play out in the ML layer

With a set of potential offers in hand, we are ready to make some estimates using our ML models, including, but not limited to: order ready times, travel times, and offer acceptance likelihood.

  1. The first question we want to answer using ML is when an order will be ready for pickup (order ready time). We estimate the order ready times based on a prep time model. A previous article, Solving for Unobserved Data in a Regression Model Using a Simple Data Adjustment, described how we estimate an order’s prep time and how we overcame the challenges of making a prediction with censored data.
  1. The second set of questions our ML layer helps answer has to do with estimating travel times, the amount of time it will take a Dasher to travel to the store and then to deliver the order to the customer’s desired location. In addition to these travel times, there are multiple aspects of the Dasher journey that we model separately: how long it will take the Dasher to find parking at the merchant and consumer locations, how long it will take them to manage the logistics of picking up the order, and how long it will take them to return to their vehicle. Each of these steps of the order delivery journey requires a separate model. Some are based on tree models built on our Sibyl ML platform. Other estimates are based on simpler naive models. For example, we can estimate parking time at a particular store by using the average amount of time it took Dashers to find parking at that store over the past month.
  1. A third and final question for our ML layer to answer is the likelihood each Dasher will accept the order if we offer it to them. Because Dashers have the freedom to accept or decline each offer, we work to anticipate the types of offers that are more likely to be accepted and present them to the most relevant Dasher. It is important to make sure every order is still delivered on time even if one or more Dashers turns down the offer before we find one who will accept it.

Once our new order passes through each of the three sets of models in our ML layer – order ready time, travel times, and acceptance rate – we now have a much better understanding of what we need to do to deliver the order as efficiently as possible. The rest of the work of making the final dispatch decision is up to our optimization layer.

Making final offers in the optimization layer

The optimization layer is our new order’s last stop in its journey before being dispatched to a Dasher. The optimization model does the work of scoring and ranking potential offers, making decisions about batching orders, and strategically delaying dispatches when necessary. 

Upon arriving in the optimization layer, our new order’s potential offers are scored and ranked to allow the mixed-integer program (MIP) to make its decisions, a process that we describe in a previous article, titled Next-Generation Optimization for Dasher Dispatch at DoorDash. Our scoring function is designed to recognize tradeoffs between efficiency (using Dasher time as efficiently as possible) and quality (getting deliveries to consumers as quickly as possible), while trying to account for explained and unexplained variance in our ML estimates of order ready times, travel times, and Dasher acceptance rate. Once every offer is scored, we solve the MIP using Gurobi, a software-based commercial MIP solver that is able to solve this type of problem at scale very efficiently.

In addition to scoring and ranking individual orders, the optimization model also considers which orders should be batched, i.e. served by the same single Dasher, to gain efficiency. Batching works particularly well when orders will be picked up from the same merchant, which reduces the number of pickup transactions, or multiple nearby merchants, which reduces the number of parking instances. The time it takes to prepare an order is actually helpful and can afford us extra time to pick up another order along the way, rather than simply delaying an offer to a Dasher. In some cases, batching leads to improvements in both efficient utilization of Dasher time, and faster order delivery — especially when we don’t have a lot of Dashers available in that area. In other cases, we may be slightly delaying delivery time to help our Dashers get more earning opportunities from serving multiple simultaneous orders without violating our on-time promise to our customers. By making adjustments to the scoring function to accommodate batched orders and compare them to single orders, we allow the solver to make these tradeoffs efficiently for us.

An important final step before issuing final offer recommendations is to decide whether we choose the best available Dasher right now or delay the dispatch to choose a more optimal Dasher that might be available in the near future.. Another reason to delay dispatch is if we want to wait until closer to the order ready time so that the Dasher will not have to wait too long at the store. Using our dynamic dispatch engine, the optimization model is able to make the best tradeoff between dispatching right away or waiting.

After we score and rank offers, consider the best batches, and decide whether to delay dispatch, our order eventually completes its journey through dispatch. The order is offered to the Dasher we have chosen, and we wait to see if they will accept or decline the offer. If necessary, we will find another Dasher to offer the order to, until the order is picked up at the store and delivered to the happy customer. Our order has been dispatched!

Managing interactions between the ML and optimization layers

In our sample journey through dispatch, our new order passes through the ML layer and then the optimization layer in sequence before a Dasher is dispatched to pick up the order. But in reality there is a lot of complex interaction between our ML and optimization models. These interactions pose three important challenges that we addressed when building DeepRed.

  • Dealing with garbage in garbage out 
  • Avoiding overfitting 
  • Handling cascading variability 

Let’s go through these challenges one by one. 

  • The first problem could be termed garbage in, garbage out. In a previous article, Maintaining Machine Learning Model Accuracy Through Monitoring, we described how we monitor for model drift in our ML models. From DeepRed’s perspective, if our prep time or travel time predictions start to drift over time, the quality of our optimization will decline as well. For this reason, we continuously retrain our ML models and use rolling historical and real-time features that make sure the inputs to our models stay fresh.
  • The second challenge is the risk of overfitting parameters in our optimization model. Overfitting occurs when parameters are tuned precisely to conditions observed in the past, but result in suboptimal performance when given new inputs that may not match historic conditions. Even if our ML models are trained using regularization to avoid overfitting to the training data, we can still risk overfitting parameters in our optimization model if we tune our parameters naively based on empirical feedback to optimize short-term performance. We can fall into a local optimum trap where any improvements to the accuracy of our ML models fail to improve the global optimization output. To combat overfitting, we have implemented Bayesian optimization techniques to tune parameters in a robust, adaptive way.
  • The third challenge is cascading variability. Each of our ML models contributes variance to our model of what will happen upon offering a particular set of orders to a particular Dasher. This variability accumulates as a route becomes more complex or longer, for instance, via batching. We designed our scoring function to account for the added variance from more complex routes by adding a penalty term that scales with each of these forms of complexity and discourages DeepRed from making offers with high variability. 

By understanding and addressing the challenges posed by interactions between the ML and optimization layers, we designed DeepRed to be as robust as possible.

How we improve the dispatch service

Making improvements to the models that power dispatch is challenging both because our dispatch decisions are critical to DoorDash’s business and because any product changes can have complex interactions and downstream impacts throughout DeepRed. We employ two modeling approaches to mitigate this challenge: experimentation and simulation. Offline simulation helps us do early-stage exploration of product ideas to evaluate and anticipate their system-wide impacts before undertaking a large implementation effort. Rigorous experimentation helps us measure the holistic impact of all changes — minor or substantial — to the underlying ML or optimization models within dispatch. In this section, we’ll describe how we leverage our experimentation and simulation platforms to drive continuous improvements within DeepRed.

Figure 3 The lifecycle of a DeepRed product change to one of our ML or optimization models starts with simulation as a sandbox for ideas, and ends with experimentation in our marketplace.

Simulating real-world marketplace conditions

Simulation is a tool we can use to accelerate the pace and scale of innovation and understanding by prototyping new concepts and understanding potential impacts of product changes. There are some questions that are difficult to answer via experiment, such as the impact of product changes that require substantial engineering work across the platform, or how DeepRed will perform under different marketplace supply and demand conditions that we may not have observed historically in the real world.

Running simulations allows us to create counterfactual worlds where we can estimate how novel ideas or different environmental conditions would impact core performance metrics, without the downside risk of degrading customer experience or business metrics. The simulator we are building can imitate consumer, Dasher, and merchant behaviors interacting with the current or test dispatch system. Simulations give us insight about how the dispatch models perform in different operating conditions, including how efficiently we could handle high demand, low Dasher availability, and other likely future scenarios that we don’t observe today.

Testing product changes in production

Our experimentation platform helps us measure the actual performance of changes to our ML and optimization models based on key business metrics. We use experimentation methods to scientifically measure whether our improvements are actually moving the needle. With experimentation, there are two ways to ensure we get accurate measurements of the impact of product changes: thoughtful design and rigorous analysis.

  • The first approach is to be thoughtful about how we design our experiments. In a previous article, titled Switchback Tests and Randomized Experimentation Under Network Effects at DoorDash, we describe the challenge of experimentation in DoorDash’s marketplace setting and how we overcome network effects using switchback experimentation designs. In addition to network effects, we also have to worry about interaction effects between experiments. Because dispatch is so central to everything we do at DoorDash and the system is complex, we have a sizable team of data scientists and engineers running many experiments at the same time. With a weekly iteration cycle and lots of interacting experiments, we have sought ways to reduce the confidence intervals on our top-line metrics. One way we tackle this is by randomly splitting up our geographies and time periods each week into two separate groups. We use the first group to run a large number of exploratory experiments. In this group, we get a noisy initial measurement for these experiments. Experiments that show promising results from the first group can advance the following week to the second group where we run a select number of experiments (no more than three) to get a less noisy final measurement.
  • The second approach to accurate measurement is to be careful about how we analyze our experiments. Specifically, it is important to find ways to reduce variance in our estimators for efficiency and quality metrics. Some of our variance reduction methods, such as cluster robust standard error, have been discussed in a previous article. We also use post-hoc regression methods and analysis of interaction effects.

Building the right team

Building the right team is critical to tackle the complex set of challenges that the dispatch problem presents. To develop models in ML, optimization, experimentation, and simulation requires a diverse set of skills in data science. We have built and grown a dispatch data science team that is diversified across disciplines (OR, ML, causal inference, statistics) and industry experience (ridesharing, gig economy, Google).

Conclusion

The goal of dispatch at DoorDash is to find the right Dasher to deliver each order from  the merchant to the customer. The dispatch decisions we make define the experience our Dashers, customers, and merchants will have, and the efficiency with which our marketplace operates. To solve the dispatch problem, we used ML and optimization to build our dispatch engine DeepRed. We leaned on experimentation and simulation to make continuous improvements to DeepRed to keep things running as smoothly as possible. These efforts wouldn’t be possible without the diverse and talented Data Science team we’ve built to tackle this exciting set of problems.

DoorDash uses machine learning to determine where best to spend its advertising dollars, but a rapidly changing market combined with frequent delays in data collection hampered our optimization efforts. Our new attribution forecast model lets us predict the efficacy of ad campaigns from their initial data, helping make critical business decisions earlier.

Usually, we need to wait some time to measure ad performance due to our attribution methodology. This slow reaction time means that if one ad channel performs particularly well, we cannot move our marketing dollars to that channel quickly. 

Our new attribution forecast model predicts an ad’s final conversion volume after observing only its initial data. This allows us to utilize more recent data, optimizing our conversions per ad spend by scaling the best channels as performance changes over time.

Beyond routine optimization, this forecasting framework is especially useful during marketing experiments–where identifying the winning ad sooner accelerates impact–and could be extended to other delayed-information problems, such as ticket sales for concerts, hotel bookings, or holiday flower sales. 

Intro to attribution methodology

Before we talk about the forecast model, let’s discuss what an attribution system is and how to choose one. An attribution system helps companies measure the effectiveness of marketing campaigns. It has two key elements: allocation method and attribution window.

Reviewing allocation methods

The allocation method determines how to assign credit across marketing touchpoints when a new consumer places their first order, commonly called conversion. Allocation methods can be either single touch or multi-touch, depending on the number of channels that get credit for the conversion. See Figure 1 for an illustration.

  • Single touch assigns all conversion credit to a single channel, assuming that either the first or last touchpoint drives the conversion.
  • Multi-touch considers all touchpoints along the path to conversion. As a simple example, it might distribute credit to all touchpoints evenly.
Figure 1: The choice of allocation method affects how much credit each channel receives. In this example, the customer interacted with three marketing touchpoints before placing their first order: a DoorDash TV commercial, an ad on social media, and a paid keyword search ad.

Trade-off in attribution windows

Intuitively, the impact of marketing isn’t always immediate. After the customer in the example above saw the DoorDash commercial on TV, they might have taken days or even a few weeks before placing their first order. Thus, another key question in attribution methodology is how many days to look back when defining marketing touchpoints. This time period is the attribution window.

A long attribution window allows companies to recognize more conversions for any given ad. Figure 2, below, shows the difference in cost curve between two attribution windows. With a shorter attribution window (the seven-day attribution below in red), the new customers who converted after seven days aren’t credited to the ad, leading to a cost curve that underestimates the ad efficiency.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

At the same time, conversion patterns are different across ads. For example, some ads, such as those on search channels, lead to a quick conversion, while ads on other channels, such as TV or radio, are slower. A shorter attribution window will cause us to underestimate the attribution for slower ads more, resulting in suboptimal marketing decisions.

Figure 2: Each dot represents one day’s marketing spend and the attributed new customers. When the attribution window is short, attributed conversion data is incomplete, resulting in an underestimating cost curve.

However, too long of an attribution window is also undesirable because it requires a longer wait time to fully measure an ad’s performance. In a rapidly changing market, longer wait times may result in out-of-date cost curves. If we could somehow include this data that isn’t available yet, the extra information would materially improve our cost curves and in turn marketing decisions, as shown in Figure 3:

Figure 3: With a long attribution window, the latest conversion data (red) is not available until many days after an ad is bought. The latest available data (black) is stale, yielding an out-of-date cost curve.

The problem with our legacy attribution system

Currently, DoorDash uses a several-day last-touch attribution system for all digital marketing channels, which provides a good balance between a holistic view of conversions for most ads, and reasonable wait time for fully refreshed attribution performance. 

However, an attribution window of several days still means that ads posted in the last few days are operating off of incomplete attribution data, which can’t inform marketing decisions until the window has elapsed. Given the rapid changes in the food delivery marketing landscape, having to wait before reacting to recent data isn’t ideal. We needed a way to unlock our most recent data.

Forecasting final outcomes

Before we jump into the details, let’s discuss an ideal solution. Our ad channels come in different shapes and sizes. For example, we run a small number of high-spend TV ads and a huge number of low-spend search ads (sometimes targeting individual obscure keywords, like misspellings of “DoorDash”). Ideally, our solution should handle both small and large ads.

The approach we chose was to build a forecasting model that predicts final attribution data from a limited amount of initial data.

Defining forecast accuracy

An easy way to measure the performance of our forecast model is by backtesting. Backtesting means training the model on old data and checking whether it can predict more recent data.

The main performance metric we picked is mean absolute error (MAE), 

where ci is conversions attributed to ad i and the hat ^ distinguishes a prediction from the actual value. Because MAE simply takes the absolute value of forecasting errors, it isn’t biased toward larger ads (unlike root mean square error, RMSE) or smaller ones (unlike mean absolute percentage error, MAPE).

However, one pitfall of MAE is it scales with conversion volume, which makes it harder to compare across channels or other segments, such as day of week. To facilitate comparison, we normalized MAE by conversion volume:

Building the forecast model

We wanted a forecast model that updates its predictions as we collect more data. It should be able to predict the final attribution outcome, whether there are four days of observations to work with or ten. The more of this initial data the model has, the more accurate the forecast should be.

We evaluated two types of models: 

  • Simple heuristic models 
  • Machine learning models

Simple heuristic models

The simplest models we considered assume the conversion pattern for an ad will be the same in the future as the past N weeks. For example, suppose we want to predict the number of conversions attributed to an ad at the end of a 30 day window, c(30 d). The prediction on day t (tth day after the ad is posted) is

where c(t) is the number of attributed conversions observed so far. This approach directly applies a historical ratio to predict final conversions from the current observation.

Below are some of the parameters or variations we explored with this heuristic model. We selected the best parameterized model by backtesting, as described in the previous section.

  • Number of weeks N used to calculate historical conversion ratio. This parameter corresponds to the question of how long the conversion pattern stays the same: too long (larger N) might be slow to capture market changes, while too short (smaller N) might be noisy. We considered values from one to twelve weeks.
  • Aggregation. Related to the previous point, small ads might generate too little data to confidently calculate the historical conversion ratio. Aggregating similar (e.g., same channel, creative, or region) ads when calculating the ratio can decrease noise.
  • Seasonality adjustments. Seasonality, especially day of week, plays an important role in our new customer conversions. For example, a consumer is more likely to place their first order on a weekend night than a Tuesday night. To account for that, we could  calculate a different historical ratio for each day of the week.

Machine learning models

This forecast is a typical regression problem. We tested the following machine learning regression methods:

Results

As shown in Figure 4, below, the LightGBM and simple heuristic models significantly outperform the other models.

Figure 4: Each bar represents the average normalized MAE across the five backtesting weeks. The error bar on the top shows the standard error.

However, how would this accuracy improvement translate to better marketing decisions? To better understand the true impact, we plugged the forecast model predictions back into our downstream workflow and used them to draw cost curves. Figure 5, below, shows that when these predictions are included, an example ad’s cost curve captures spend efficiency more accurately, which in turn helps us assign our marketing budget more optimally. In this case, without the forecast predictions we would underestimate the ad’s performance and mistakenly move some of its budget to other channels.

Figure 5: When only historical actuals (black dots) are used to construct a cost curve (black line), the curve poorly predicts future ad performance (blue triangles). But when forecast model output (red squares) is also included, the cost curve (red line) becomes more accurate, as in Figure 3.

As with forecast accuracy, we backtested cost curve accuracy using the same approach and the same normalized MAE metric. Currently, the cost curves are able to achieve a reasonable normalized MAE with delayed historical data, as shown in Figure 6. By plugging in the forecast predictions, the best models (simple heuristic and LightGBM) further decrease the error significantly.

Figure 6: Adding the more recent predicted attribution data from the two best forecast models (heuristic and LightGBM) significantly improves cost curve accuracy.

In conclusion, by taking advantage of recent data more quickly, the attribution forecast model significantly enhances our ability to draw cost curves and make marketing decisions. We found that the heuristic and LightGBM models performed similarly, so we chose to productionize the heuristic model because it’s simpler and more interpretable.

Summary and next steps

At this point one might ask, why does the simplest model perform best? We think there are two reasons:

  • Strong existing patterns: Consumer conversions after a marketing touchpoint usually follow a particular pattern. The majority of consumers are going to convert in the first few days, and the number gradually reduces. External factors play a relatively small role in impacting consumer behavior when the onboarding funnel is short. Therefore, a simple heuristic adequately captures the conversion flow.
  • Limited amount of data: Typically, we have only several days of observations on which to base a forecast prediction. With this small amount of data, more complex ML models don’t show many advantages. 

When patterns are less obvious, or the ads of interest span a hierarchy of different regions or countries, a simple heuristic might not perform as well and a more advanced model might be justified.

A similar methodology could be applied to attribution systems at other companies. Depending on the data available, an easy model like the simple heuristic could be a great place to start. Beyond marketing attribution, applications could also include other delayed-response situations. For example, predicting final concert ticket volume from the first few days of sales, or predicting final hotel occupancy on any given day from early bookings. Learning the final outcome sooner enables faster reactions and better decisions.

Given the variability in the Technical Program Manager (TPM) role, it can be hard to know if a new opportunity will grow your career massively or fall short of your expectations. After years of building world-class TPM teams, previously at Uber and now at DoorDash, I’ve developed a few principles for how to build and run a high-functioning and impactful TPM organization. If you are looking for a new TPM opportunity and focused on growth, scope, and impact, you should seek a team that follows these principles.

The mission I set for my TPM team at DoorDash is to drive complex, cross-functional engineering initiatives. Their primary contributions include effective program management, lightweight and appropriate process definition, and ensuring program success metrics are understood and achieved. All of this is driven through the lens of a strong technical perspective, which allows the TPM to contribute directly to the program’s successful definition and execution by uncovering and solving for gaps rather than simply reporting progress given by engineers who are working on the program. A strong TPM will enable proactive and timely strategic decisions, clear alignment across teams and stakeholders, accountability and ownership, and successful execution of the program’s objectives. 

The TPM team has a strong leadership position in engineering

TPM should be a centralized function reporting to the head of engineering. Although TPMs represent a tiny fraction of the engineering team, they support the success of cross-functional engineering programs and deliver the most complex projects.

Impactful TPM teams have a seat at the leadership table, allowing them to give input on how engineering builds for reliability, velocity, and integrity. Their work helps engineering scale effectively and get ahead of the biggest technical and growth challenges, as well as successfully deliver the largest cross-functional initiatives. They also help the engineering team decide when to build more generally, with a goal of centralization and reuse, such as platforms that scale across multiple use cases, or when a specific use case makes the most sense. 

Great TPM teams drive alignment and process improvements that push engineering to operationally up-level over time. Additionally, given how cross-functionally TPMs operate and embed, they are a critical component to a growing engineering team’s culture, diversity and inclusion, and collaborative development.

Of course, there are downsides to a centralized TPM team, particularly in surfacing feedback from engineering and other cross-functional stakeholders about how individual TPMs perform in their roles, which requires TPM managers to regularly check in with these stakeholders. From my experience, however, there is an advantage for the TPM team and the engineering team if those downsides are managed effectively by strong TPM leadership rather than distributing TPMs to work under engineering leaders. 

I believe it is better for the TPM team, the individual TPMs, and the engineering organization if the TPM team is centralized. The advantages of a centralized TPM team include:

  • Centralized TPM teams create synergy which leads to best practices, shared learnings, and a rapid way to pinpoint answers to questions around service ownership or organizational expertise. This TPM network is particularly important in the high-growth stages of an engineering organization, where things change rapidly and are documented lightly. TPMs work most effectively amidst a support system to share ideas, learn what worked well for other teams, and combine their collective knowledge across multiple engineering and cross-functional teams.
  • A centralized reporting structure preserves the impartiality of the TPM. Although their reporting structure is centralized, the TPMs still embed with the engineering teams they work with on a particular program. This embed model allows them to work closely with engineering and gain trust with the project teams, but it also provides a safe alternative path to raise, brainstorm, and resolve more contentious technical, operational, cultural, or organizational issues.
  • Centralized TPM teams have specialized TPM leaders in manager roles who generally have many years of TPM experience themselves. These leaders understand the TPM discipline deeply and can optimally develop and coach TPMs in a way that helps them grow and develop their core competencies and the impact they deliver. Most engineering leaders don’t have this expertise, so a decentralized reporting structure is sub-optimal for TPM growth and effectiveness.
  • TPM teams remain lean through a centralized structure. An experienced TPM manager places TPMs on teams where they are needed, rather than a decentralized model where each engineering team has its own TPM. In the latter case, TPMs often end up doing less specialized work, such as sprint/scrum ownership, because the engineering leader mistakenly believes this is the best use of their time, when these tasks generally should be handled by engineering or product managers. Centralized teams of impactful TPMs contribute to an engineering culture of ownership. This model also promotes scope and growth opportunities for TPMs, making it easier to recruit and retain top talent.
  • A centralized team provides an organization-wide view, embedding TPMs where they can support a company’s top priorities, even when these priorities are shifting rapidly. At times this means moving a TPM from one embedded team’s priorities to another team or program that is a higher impact or bigger risk for the company. The flexibility of a centralized reporting structure encourages these priority discussions and changes as appropriate to accomplish the most critical projects for business and engineering success. 

The TPM team is highly leveraged

The best TPM teams are a critical part of tackling the company’s biggest problems. To ensure my teams stay highly leveraged, I’m particularly conscious of how we build the team in its early stages, recruiting a founding dozen or so team members who are top talents in their domains as well as program management experts so we can build around their leadership as they embed in each top-level engineering area. Having senior team members seed the team early on sets it up for future leverage and impact, and lets us add more junior TPMs around them more efficiently as the engineering team grows.  

Although ratios are a terrible measure for a small engineering organization, they become a reasonable way to plan hiring and headcount as engineering becomes a bigger organization. My target ratio of engineer to TPM at scale is approximately 50 to 1. I also ensure that each engineering team they embed with understands the value they bring and gives them an opportunity to be part of strategic as well as tactical execution and planning. 

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Ensuring the team is highly leveraged also helps in attracting and retaining senior TPM talent. This high leverage ensures there are growth opportunities for everyone and removes the need to compete with one another for the scope and impact needed for career growth. TPMs should constantly be automating and inventing, implementing light processes for the team, or teaching someone basic program management techniques–looking for how to make themselves unnecessary so they can work on solving the organization’s new challenges. I often tell people, tongue in cheek, that if we meet our goals we will work ourselves out of a job. Of course, in a company experiencing rapid growth, the business and scale issues mean there are always new problems to tackle. But solving problems systemically is a core tenet of high-performing TPM teams.

Being highly leveraged is also an important part of the TPM team’s internal dynamic. A team that sees opportunities for growth will support its other members, share best practices and learnings, coach and mentor more junior team members, and create synergy by working together to solve bigger problems consistently across organizations. They will challenge each other to get better, which makes engineering better. This synergy is the real superpower of a centralized TPM team, and as a leader, the most important way to nurture it is to ensure there is opportunity for this kind of cross pollination. If a program management organization believes in adding a program manager to every two-pizza project team, then its leadership doesn’t follow this leverage principle, which impairs the growth and synergy created by a team that does. 

The TPM team is empowered to fiercely prioritize the work they drive

The work we decline is just as important as the initiatives we take on, and having clear principles for deciding what work will produce impact and is uniquely suited to our strengths as TPMs is an important point of view to cultivate. Much like the scoping decisions we make when building an MVP, engineering time is valuable and must be balanced against the return a feature provides. Similarly, I want to cultivate a team that is looking for the highest value way to spend their time every day to deliver great engineering outcomes. As one of my mentors used to say, “Spend your heartbeats on what you value.” For most TPMs what we value is the impact we have on the engineering team’s success.

As a TPM leader, I do my best to set each TPM up for success in the team they are embedded in. Before embedding someone in a new area, I work with the leadership to rank all their priorities in order to figure out what we should and should not support.  

Additionally, once a TPM has been working with the team and knows them and their tech stack well, they should begin to cultivate their own viewpoint. This will likely mean adding suggestions for what should be on this stack-ranked list, based on gaps they see in such things as how the team plans and delivers work, supports their partner teams, handles inbound requests, manages technical debt, and operates on call. Identifying the right priorities to improve engineering effectiveness is a big part of the value an embedded TPM can provide, over time, as they establish trust with local engineering leadership and can inform and influence how the team allocates its engineering time. 

Scaling engineering is one of the core competencies of a strong TPM organization. Often the initiatives we don’t take on are an opportunity to up-level our engineering teams by guiding them and developing their ability to own initiatives themselves, rather than project managing everything for them. Instead of taking on project management, for instance, we might help teams establish better milestones to reduce a project’s risk or suggest a tool to help track project dependencies more effectively. Teaching project management techniques and tools to our engineering managers and technical leads who own delivery of more contained projects helps us scale and up-level the engineering organization. When a team can project manage its own contained projects, we maximize the TPM team’s leverage, saving them for bigger projects that involve many engineering teams and really need a project management expert for success, rather than smaller ones that only cross one or two teams. We are also simultaneously contributing to a high-performance engineering culture with a strong ownership mentality.

TPMs should be able to focus on making the biggest contribution with their limited time. TPMs are not just part of one team, they are part of two teams–the engineering team they embed with and the centralized TPM team. Everyone on a team should be willing to do whatever it takes to move the team forward, including mundane work. TPMs are not above certain work. Even as a director I have personally ordered dinner each night for a team I had holed up in a war room to solve a huge time-bound system scaling issue, because at that moment it was the most important way I could contribute. The ability to choose your focus based on what will have the most impact is something you earn with what you deliver, the value you add, the partnership you provide, and the trust you build.

The TPM team consistently drives technical clarity 

As TPMs, we embed in engineering and grow deep domain knowledge from the engineering organization we work with. We develop expertise on our tech stack, partnering closely with the tech leads and engineering managers to come up with the short-term deliverables and shape the long-term strategy and technical direction. TPMs should be able to bridge the business needs with the technical challenges to help engineering balance time-to-market with reliability and scalability. Additionally, TPMs use their deep technical knowledge to help drive technical decisions, mitigate technical risks, and propose alternatives. 

Our partnership with engineering is founded on our ability to add solution value, not just track deliverables. TPMs operate cross-functionally, driving programs that include many teams and integration points, and this breadth and scale typically increases as the TPM gains experience and seniority. In a strong engineering team like DoorDash’s, I’m not as worried about how each individual piece is designed and built because I trust our engineers and tech leads to drive towards the correct technical solution for their piece of the problem. 

A great TPM ensures this technical solution is built within the context of the bigger picture, however, ensuring the business, technical, security, and compliance needs of the overall solution are well understood by teams. By holding this broad picture, TPMs can actively ensure that all the pieces of the puzzle actually build a holistic solution once they come together, and there are no gaps or overlaps between teams and parts of the solution. This is where TPMs really shine and where their technical skills are tested. They should be looking for these gaps or overlaps and actively solving for them in the project, as well as aligning the milestones, deliverables, and dependencies of the project so that issues are found early and solved, reducing risk to delivery dates. Technical depth also allows a TPM to ask the right questions around scale, growth, and success metrics for the solution so that we invest in an implementation that will serve our customers appropriately, particularly in a hyper-growth environment.

Technical as well as business acumen allow us to drive the appropriate trade-off discussions and ensure the big picture is kept in mind while solving the immediate problem. Having technical depth allows us to drive accountability by being part of the process end-to-end, not just tracking its progress in a spreadsheet and asking engineers for estimates and project updates. That’s when a TPM team really sizzles, bringing value to engineering by making complex problems consumable, leading to simple and effective solutions.

The TPM team tackles difficult problems in a blameless culture

TPM teams walk the nuanced tightropes of ownership and accountability. As TPMs, we drive programs that span many teams, and none of the people involved in doing the real work report to us. Much of what we do involves leading delivery of something cross-cutting and complex through influence, alignment, and providing accountability against the project plan and end goals. Leading with influence is a core competency for an effective TPM, and we should strive to create clarity and alignment by constructing a narrative for why the program matters and gaining commitment based on the strength of that narrative relative to other business or technical priorities.

At DoorDash we believe in a culture of ownership and use a DRI (directly responsible individual) model to drive that ownership with clarity. Every major initiative at DoorDash has a DRI, and sometimes, but not always, that is a TPM. 

The DRI is responsible for ensuring that the program is completed successfully. This means ensuring it has the resources needed, that issues are resolved quickly, that risks are managed effectively, and that key decisions are made in a timely manner, allowing the program to move forward. If the TPM is the DRI on the program, their role is clear–ensuring it is successfully delivered. But even if someone else is the DRI, the TPM will be a key partner in the DRI’s, and program’s, success, because the key skills of organization, definition, alignment, identifying and mitigating risks, unblocking and resolving issues, and tracking progress against project goals are all key responsibilities of the TPM. A strong TPM is not going to let a program fail because of unresolved issues, lack of alignment, or lack of resources. These are the problems we fight to solve so a program can be delivered successfully.

One of the most important things a centralized TPM team provides is neutrality. Even though the TPM may be embedded with a particular engineering team or area, the separate reporting structure means that the TPM remains a neutral party. This is again a power to be wielded with care and earned with stakeholder respect and trust, but I want my team to feel able to highlight and engage with the team and leadership to resolve weaknesses, whether technical, organizational, or cultural, which impede a program’s success. 

The team should be incentivized to operate with openness and honesty and to create clarity. They need to work with engineering to set aggressive goals and milestones, without padding, and report accurately on where the program is against those goals, particularly if there are key work streams that are yellow or red against plans. We can’t fix issues we don’t acknowledge, and neutrality ensures that there is some air cover to ask tough questions and shine a light on problems, with the end goal of working with the teams to get to a solution and successfully deliver the program together. Often, problems are exposed at the borders between engineering teams, where a cross-functional project is suffering because there is no alignment between teams on goals, responsibilities, or timelines. TPMs need to bridge these gaps between teams, and neutrality paves the path for that to happen in a blameless way.

The TPM team is recognized and valued by the organization for its ability to bring order to chaos

At DoorDash, each of the TPMs build recognition and the team’s reputation incrementally, day-by-day, with what we deliver: managing projects over the finish line, ensuring each planning cycle is a little smoother than the last, running effective postmortems to avoid future outages, and driving monthly operational reviews that gradually tighten up engineering execution. 

When I joined DoorDash to lead the TPM organization, I knew we had crossed the tipping point the first time a leader outside engineering asked me for a TPM to drive a critical company objective. People ask for us when something is big, complex, broad, and difficult. I might not always say yes, adhering to the principle of fierce prioritization, but I’m happy when someone sees a problem and thinks of us first. That’s when I really know we’ve made it, at least for today. We have to get up and do it all over again tomorrow, with the next hard problem. But that’s the fun of this job, if one is lucky enough to get to do it at a rapidly growing company, solving interesting problems at scale with a group of really amazing and committed engineers. I love building that TPM team and delivering that impact, together with engineering.

Final thoughts

Your role on a team and at a company is contextual, because we operate day-to-day in a specific environment, not a vacuum, and are impacted dramatically by the characteristics of that environment. The qualities of the work we do and the time we spend working are heavily impacted by the culture as well as the leadership of the company we join. As leaders, therefore, we are responsible not only for the teams we build and develop, but also for our thoughtful and intentional impact on providing an optimal environment for that team to thrive and do their best work in service to the company. 

I chose to come to DoorDash for many reasons, and one of them was certainly the culture I saw here and how that culture was developed and nurtured by its leadership team. I didn’t have to change anything culturally to be successful or to develop and build a world-class TPM team within DoorDash Engineering, and the growth and scale trajectory provided by the business creates the opportunities for scope and impact that compel great TPMs. 

In other words, the scaffolding was already in place for the team I wanted to build and be part of. This team is one where TPMs are not only valued for their basic program management skills but also for their strategic thinking, technical contributions, problem solving, positive cultural impact, and operational talents. As I continue building our TPM team at DoorDash, I strive to also build the best environment and culture for that team to be a valued partner in engineering’s success. I am incredibly fortunate to have the opportunity to build a world-class TPM team at this amazing company.

The DoorDash TPM team is growing. Check out these opportunities if you are interested in joining us!

In our previous article of this series we covered the decision we made at DoorDash to move to a microservice architecture, the technologies we chose, and how we approached the transition. As with most decisions in computer science, this one also came with its unique sets of challenges and shortcomings. This article focuses on the main pain-points we encountered in moving off of the monolith, with particular focus on the following:

  • Changes in reliability patterns 
  • Data migration 
  • Workflow visibility
  • Data access layer 
  • Overall testability
  • Deployment etiquette
  • Tech stack proliferation

Hopefully our summation of these challenges will be useful for those beginning the journey of migrating to a microservice architecture, as we give an overview of common gotchas of leaving the monolith as well as high-level solutions that we have or still are in the process of implementing to address them.

Changes in reliability patterns

When rearchitecting a system from a monolith to a service-oriented architecture (SOA), it is common to run into reliability challenges that can get out of control if not addressed early on in the process. These challenges are typically due to the intrinsic nature of working with an SOA instead of a monolith: when dealing with the monolith, working with cross-domain business logic is as easy as making a simple function call. When moving to microservices this kind of cross-domain interaction gets more complicated because of internal process communication being replaced by RPC, which can hurt latency and reliability unless the proper guardrails are put in place.

DoorDash’s monolith was a single big service backed by a single Postgres cluster as the most frequently accessed external dependency. As long as both components are healthy, the system can be considered functional for the most part. Once microservices are in the picture, however, anti-patterns, such as the following, begin appearing: 

  • Strong dependencies, where services become unable to fulfill a user request, even partially, unless one or more dependent services are also functional.
  • T0 services (Golden Workflows), a new class of services that are critical to the business logic, so much so that any downtime in any one of these services results in the application becoming unavailable.

In both the cases above, the net effect is that the platform’s SLOs, either at the service level or system wide, becomes subject to uptime compound probability: if two services are part of the Golden Workflows (like order fulfillment and the dispatch logistics service are on DoorDash’s platform) and they both have a 99.9% uptime, the uptime of the workflow itself becomes 99.8%:

Formula 1: The probability of two stochastically independent events happening, like the availability (A) of a service (S) and of its dependency (D), is given by the product of the probability of each one being available.

When too many services are entangled in a mesh of strong dependencies, the platform becomes a distributed monolith. A distributed monolith is strictly worse than a regular monolith in terms of reliability because failures even in seemingly unimportant flows can bring down the whole system.

In order to prevent proliferation of cases like the one described above and mitigate the negative effect of service dependencies, we introduced new access patterns for cross-service communication, such as:

  • Request fallback, where we make sure there is an alternative source of critical data in case the primary source is unavailable. When this is employed, we get to compound the downtime probability instead of the uptime, hence resulting in higher availability. For example, if the main source of some particular data has 99.9% availability, and the fallback also has 99.9% availability, a service’s uptime won’t be as impacted, as shown in Formula 2, below.
  • Fail open, if we can partially serve a request when a dependency is unavailable by simply ignoring the failure and applying some default behavior, we have removed a strong dependency from our microservice. For example, if we are unable to shorten a URL we wish to send over SMS to a consumer, we use the full URL  instead of failing the request.
Formula 2: The probability of two not mutually exclusive events happening, like either service S’s dependency D or its fallback DF being available at a given time, is given by the probability of D and DF not being down at the same time. This, compounded to Service S’s  assumed availability (99.9%) results in a barely impacted uptime.

In the spirit of making these patterns widely used, we introduced libraries that take care of such patterns automatically, by service owner recommendations. We also instituted a reliability bootcamp where we help new developers become aware of the most common gotchas of building microservices.

Migrating data from monolithic storage

When migrating to a microservices architecture there’s an expectation that each individual service is the owner and sole source of truth of its data. In DoorDash’s case, the monolith wasn’t simply limited to the application code, our data storage was also monolithic. For the most part, our data existed in a single Postgres cluster which was quickly reaching its limits in terms of the primary node’s vertical scalability. As we broke down the monolithic application into microservices, we also had to perform data extraction from the monolithic storage into multiple clusters.

Extracting data from a database (DB) is a well known problem in the industry, and one with no easy solution, especially when the expectation is to have zero downtime during the migration. One of the most common approaches to data extractions is using double writes and double reads, where services gradually switch their traffic from the monolithic database instance to a new one, table-by-table, as shown by Figure 1, below:

Figure 1: Double reads and double writes enable a gradual shift from one database to another by writing all the data in both databases and slowing moving off the original.

To give a high level summary of how this works, the approach consists of writing all new rows of a specific set of tables in both the new and old DB, while a backfiller job copies the old rows. Once there’s enough confidence on the new DB, read traffic is gradually shifted to the new DB until the old one is finally decommissioned.

While this approach works in most cases, at DoorDash we found some hidden complexities in the form of race conditions that can happen depending on the specificity of the access patterns and schemas that are being migrated. Indeed, no configuration can instantly propagate to all the components of a distributed system, so for brief periods of time there was no single source of truth for a given row or partition of rows. In tables exposing multiple uniqueness constraints, this can result in data inconsistencies or conflicts that often need to be manually resolved. 

Moreover, this approach would either require a common data access layer to manage the double read/writes or some work being done by all the service owners on a service-by-service basis, which would be costly. A common data access layer is usually present in a monolith, but depending on the order in which data is extracted, respective to the application logic extraction, this might not be true. At a time when stack proliferation was a problem, as new microservices were born faster than company standards were created, we opted for a different but yet successful approach: atomic swap of a single source-of-truth.

This topic alone deserves an article by itself, which we will publish in the future. And it’s going to touch on tons of interesting technical aspects of database management systems in general and Postgres specifically.

Ensuring workflow visibility and observability

One of the advantages of running a monolithic application is that, more often than not, there is a common middleware layer which intercepts all calls and can enforce all sorts of common functionalities. This is very helpful because it can monitor everything from authentication to authorization, as well as metrics and logging.

When running a monolith, metrics are predictable, dashboards can be easily replicated, and minimal domain knowledge is needed to come up with a common set of relevant indicators for workflows that are useful for creating new cross-domain measurements. For example, service level indicators (SLIs) can be identified for all workflows as they are all exposing the same metrics, with the same naming and labels, allowing a more consistent definition of per-workflow SLOs.

In the microservice-extraction frenzy, it’s easy to end up in a situation where each team adopts a different tech stack and library versions, creating their own conventions around metrics. This situation results in teams developing their own separate tribal knowledge.

When that happens, it not only becomes difficult for non-domain owners to understand other domains’ metrics, but often situations arise where it’s really hard to tell which service is involved in a given workflow. This ambiguity makes it very difficult to identify superfluous strong dependencies (as defined above) until there is a full-on outage.

To solve this domain tribalism problem, it is important to make the early effort to specify an observability standard, a set of company-wide recommendations that define what needs to be measured in every single service as well as the naming and labelling. In addition to that standard, adopting transparent solutions to distributed tracing (a la OTEL) sooner rather than later saves lots of headaches in answering questions like: “Why did the increase in p99 response time of a given service result in a huge traffic drop of a seemingly unrelated service?”

As the standardization effort becomes more substantial, and new internal frameworks are born, it’s also essential to include all the above knowledge in these frameworks so that future generations of architecture can benefit from them and once again gain that centralized layer of control for endpoint observability.

Building a data access layer

Once again in this article, it seems like we are going to praise the monolith for all of its centralized components that can be tweaked by a few expert engineers in ways that benefit everyone in the company. In this case, we refer to the monolith’s data access layer, in which even the smallest change can result in tremendous benefits for the whole team. The data access layer is a component usually found in monolithic applications that intercepts all queries to external data storages, and is able to run any custom code to observe, optimize, and redirect such queries.

While it’s risky to have a single database cluster that holds all a company’s data, it is actually good to have a single codebase that handles all the storage access. This centralized access layer can be used and tweaked to obtain things like:

  • Predictable query observability (described in the previous section)
  • Automatic caching
  • Multi-tenancy
  • Automatic primary/replica routing w/ read-your-own-writes capabilities
  • Query optimization
  • Connection pooling
  • Control over suboptimal access patterns (N+1 anyone?) 
  • Control over schema migrations for online tables (a simple CONCURRENTLY keyword can make the difference between an outage and a smooth index creation)

To be completely fair, one of the advantages of moving to a microservice architecture is the ability to experiment with new database technologies that might fit a specific use case better than others. But, at the end of the day, there’s a chance that most services in an engineering organization are using homogeneous DB types. And they could really use all the good stuff mentioned above.

Moving from a centralized data access layer to a distributed system is a problem that is still very much open at DoorDash, and also widely discussed. Possible solutions involve things like building a centralized schema migration tool, gated by the Storage team, that provides linting and code reviews ensuring that migrations are safe before they run in production. In addition to that, DoorDash’s Core Platform and Storage teams have recently invested in a centralized data access layer in the form of a DB gateway, which is deployed in isolation for each DB cluster and replaces the SQL interface for microservices with an abstract API served by a gRPC gateway. Such a gateway needs many precautions, such as isolated deployments, versioning, and configuration, to make sure it doesn’t become a single point of failure. Figure 4, below, shows at a high level what this data gateway looks like.

Figure 2: The data access layer becomes the point-of-contact between services and their storage, hiding complexity such as caching or routing.

Ensuring testability

Experienced engineers will feel that sense of déjà vu every time they see a staging environment falling into oblivion at a speed that is directly proportional to the number of heterogeneous services populating it. Staging environment degradation is a process at risk of accelerating during the microservice migration frenzy: this is a time when new services are extremely fluid and change continuously, especially in staging, which often doesn’t have the same SLOs that are expected from a production environment, eventually rendering it nearly unusable. Services with many dependencies compound this degradation.

In order to grow past this problem, testing needs to evolve together with architecture. Along with introducing new testing frameworks for integration testing, synthetic monitoring, and load testing, DoorDash recently began the journey of putting in place the necessary guardrails in its production environment to allow for safe testing in production. These guardrails are based on the principle of allowing our developers to experiment with new features and bug fixes in production without the risk of polluting the real traffic or, even worse, data.

This topic is a widely covered one in the industry, and going into the details of what DoorDash is building to make this happen probably deserves its own article. For now, here is a high-level overview of the main components and guardrails that make up our production testing environment:

  • Proxies that reroute test traffic from the edge into local development environments
  • Standardized test traffic definition and propagation in each hop of a request through OpenTelemetry (OTEL) baggage
  • Data at rest isolation through query-routing and query-filtering based on traffic tenancy (enforced by the data access layer discussed above)
  • Configuration isolation through namespacing of our experiments, runtime configuration, and secrets, based on traffic tenancy (enforced by common libraries)
  • Test Actors Service which provides test users to developers
  • Developer console for managing testing environments and creating new scenarios

One important objective of the testing-in-production project at DoorDash is that, once test traffic is generated, all guardrails around test data are enforced at the platform/infrastructure level without requiring any knowledge from the microservices. By ensuring that our services are tenant-agnostic, we prevent the proliferation of tenant-specific branching in our services’ codebase that we would inevitably have otherwise.

Dealing with tech stack proliferation

When looking back at the challenges DoorDash faced in building the existing architecture, one simple answer would often come to mind: just build a common library.

In the monolithic world, where everything ran in a single codebase based on the Django framework, it was really easy to build new libraries for common usage as well as upgrading existing ones whenever a new feature or security improvement was released. The thought of having a single requirement file to update for all teams to benefit from was a comforting one.

As we moved to microservices, developers began experimenting with languages and solutions that felt the most appropriate for the problem at hand. In the beginning, services were born using a variety of languages, namely Python3, Kotlin, Java, and Go. On one hand, this was a really good time for the company: by getting hands-on experience with multiple languages we were able to eventually standardize on a few technologies, and started our internal standardization effort based on Kotlin. On the other hand, however, it became really difficult to share code and add new service-level functionalities. Accommodating all our different stacks now required writing libraries for multiple languages as well as relying on the deployment cadence of each service so services can pick up whichever new library version is needed. 

After that initial period of experimentation, we started building internally-supported frameworks and libraries for greenfield services, added linting to all our repositories to catch dependencies that need to be upgraded, and began the effort of reducing the number of repositories, while keeping roughly the same number of microservices (some organizations are currently testing one monorepo per org). For the most part, the Kotlin Platform team at DoorDash is responsible for leading these standardization efforts, providing developers with the basic templates and frameworks that solve some of the issues discussed in the previous sections of this article.

Defining deployment etiquette

So far we have focused on a number of challenges involved in building a microservice architecture that were mostly rooted on the same principle: the shared codebase of a monolith has some advantages that microservices risk losing. Another aspect to consider, however, is how things that are normally perceived as an advantage of moving away from a monolithic architecture could in reality hide some challenges. Case in point, the ability to freely deploy services independently of each other.

With the monolith, deployments were more predictable: a single release team handled all deployments, and a new version of the whole application was released to the public at a regular cadence. The deployment freedom that microservices come with, however, can result in proliferation of both good and bad practices, with the latter causing different kinds of outages from time to time. Moreover, unpredictable deployment times can cause delayed response from oncalls of related services, such as upstream dependencies.

Mitigating these problems and establishing proper deployment etiquette required DoorDash’s Release Team to shift focus from being the deployment gatekeepers to building deployment tools aimed at enforcing those best practices, such as raising a warning whenever a peak hour deployment is attempted, or by providing easy ways to roll back code with a simple click of a button. Moreover, global kill switches have been put in place to freeze all unapproved deployments in certain critical situations, so as to prevent unaware developers from deploying new code during, for example, an outage. Finally, pipelines have been built to implement a global changelog, which gives visibility on each and every change that happens in our production environment, from deployment to runtime configuration changes. The global changelog is a powerful resource in the event of an outage, as it gives engineers the ability to identify what caused the issue and promptly roll back.

Lessons learned from migrating off the monolith

After discussing all the pain points of leaving the monolith it almost makes one question why we did in the first place. Despite all the advantages a small engineering team could get from working on a monolith the advantages that come from moving to microservices are tremendous and worth solving the pain points mentioned above. We gain the ability to independently scale and deploy single components, to reduce the blast radius of bad deployments, scale as an organization, and experiment more quickly.

To benefit from a microservice architecture, an organization needs to approach the extraction carefully. After all, losing commonly shared code can lead to inconsistent behaviors across services for things like visibility and data access, deployment freedom can become detrimental without the proper guardrails in place, tech stack proliferation can grow out of control, testability can become more challenging, and bad patterns in establishing service dependencies can lead to distributed monoliths.

At DoorDash we faced all of these challenges in one form or another, and learned that investing in standardization, good practices, testability, and common data access/observability layers will result in a more reliable and maintainable microservice ecosystem in the long run.

I am pleased to announce DoorDash’s newest engineering office, located in Los Angeles, showing our commitment to the second largest city in the US, and one of the most diverse places in the world! We believe LA’s approximately 25,000 restaurants represent unlimited opportunities for DoorDash. The growth potential in LA will be matched by the size of this new office’s engineering workforce, which will support mission critical aspects of our business. We are looking for qualified engineers of all levels to join our team in LA. 

Our mission critical focus: DoorDash’s financial infrastructure 

A significant part of our platform involves ensuring reliable and secure payments. Customers should be able to pay for their deliveries in whatever form they prefer, while merchants and Dashers (our term for delivery drivers) receive their earnings as promptly as possible. Ensuring efficient, scalable, accurate, and reliable financial transactions is a mission critical part of the DoorDash platform. The LA engineering office will be at the center of that universe for DoorDash’s worldwide business. Additionally, the LA office will be responsible for expanding DoorDash’s business to exciting new verticals and products. Let’s dive a bit deeper into the teams that will comprise the DoorDash LA office. 

Money Products team 

The Money Products team has one of the biggest impacts on our customers as it focuses on making payments through the DoorDash platform as easy as possible. This team’s work affects virtually every DoorDash user and helps delight and retain customers through increased convenience. The Money Products team also makes it easier for Dashers and merchants to secure the revenue they receive when using the platform. The Money Products team designs and implements new and innovative programs that will have an immediate, real-life impact on DoorDash users by facilitating access to and payment for services. For example, with our DasherDirect prepaid debit card Dashers get paid daily, earn 2% cash back on all gas purchases, and can easily manage their money and rewards via a mobile banking app. 

Money Platform team 

While the Money Products team focuses on payment methods and user experience, the Money Platform team builds and operates a software platform to manage the daily transactions made through DoorDash. Every month, millions of people order food from DoorDash using different credit cards, resulting in billions of dollars in gross order volume. The Money Platform team ensures that each transaction executes securely, efficiently, and accurately, with money moved to the correct accounts seamlessly. This core team supports the DoorDash business model globally.  

Gifting 

An exciting aspect of working for a fast-growing company like DoorDash is the ability to work on new and innovative projects. We are excited to be building this totally new product that will reinvent sending gifts, making the process faster, cheaper, and more convenient. One of the opportunities in joining the LA office is being able to create something totally new that will delight our customers. 

By joining the LA office, you will not only join a hyper-growth company but also help set DoorDash’s LA office culture. Our team is growing quickly and we are looking for talented engineers from diverse backgrounds who have a passion for impact.

If the opportunity to build financial products and new features for DoorDash interests you, come join us today!

Check out our open roles:

Software Engineer, Product (all levels)

Engineering Manager, Gifting 

Header photo by Cedric Letsch on Unsplash.

Maintaining a pleasant online ordering experience involves ensuring that large search indexes remain effective at scale. For DoorDash this was a particular challenge as the number of stores, items, and other data increased every day. Under this load, it could take up to a week to reindex all of the changes and update our search database. 

We needed a fast way to index all of our platform’s searchable data to improve product discovery, ensuring that we offered consumers all available ordering options. In addition, this project would also increase the speed of experimentation on our platform so we could improve our search performance more quickly. 

Our solution involved building a new search indexing platform that uses incremental indexing on our data sources. We based this platform on three open source projects, Apache Kafka, Apache Flink, and Elasticsearch.

DoorDash’s problem with search indexing 

Our legacy indexing system was not reliable or extensible, and it was slow. A reliable indexing system would ensure that changes in stores and items are reflected in the search index in real time. Incrementally indexing helps refresh data faster, building fresh indexes to introduce new analyzers and additional fields in shorter amounts of time, which ultimately helps improve retrieval.

Teams from new business verticals within DoorDash wanted to build their own search experience but didn’t want to reinvent the wheel when it came to indexing the search data. Therefore, we needed a plug-and-play solution to improve new search experiences without slowing down development for these business vertical teams. 

Building an event-driven pipeline for indexing documents 

We solved these problems by building a new search indexing platform that provides fast and reliable indexing to power different verticals while also improving search performance and search team productivity. It uses Kafka as a message queue and for data storage, and Flink for data transformation and sending data to Elasticsearch.

High-level Architecture

Diagram of data indexing pipeline
Figure 1: The data pipeline in our new search index system uses Kafka for message queuing and data storage, and Flink for ETL and syncing with Elasticsearch.

Figure 1, above, shows various components in our search index pipeline. The components are grouped into four buckets:

  • Data sources: These are the systems which own CRUD operations on the data. We call them the source of truth for the data. In our stack we utilized Postgres as the database and Snowflake as the data warehouse. 
  • Data destination: This is the data store which has been optimized for search. In our case we chose Elasticsearch.
  • Flink application: We added two custom Flink applications in our indexing pipeline, Assemblers for transforming data and Sinks for sending data to the destination storage. Assemblers are responsible for assembling all the data required in an Elasticsearch document. Sinks are responsible for shaping the documents as per the schema and writing the data to the targeted Elasticsearch cluster.
  • Message queue: We used Kafka as our message queue technology. The Kafka 2 component, from Figure 1, above, uses the log compacted and preserved indefinitely topics.

Bound together, these components comprise an-end to-end data pipeline. The data changes in data sources are propagated to Flink applications using Kafka. Flink applications implement business logic to curate search documents and write those to the destination. Now that we understand the high level components, let’s go through the different indexing use cases.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Incremental indexing

The indexing pipeline processes incremental data changes from two different sources. The first one captures the data changes as they happen in real time. Typically, these events are generated when human operators make ad hoc changes to stores or items. The second one is ETL data changes. Our machine learning models generate ETL data in a data warehouse. The indexing pipeline handles events from these two data sources differently.

Indexing change data capture (CDC) events 

DoorDash’s data about merchants gets created and updated continuously, and needs to be addressed by our index pipeline solution. For example, these updates can be anything from merchant operators adding tags to a store to updating menus. We need to reflect these changes on the consumer experience as quickly as possible or consumers will see stale data in the application. These updates to the platform are saved in data stores such as Postgres and Apache Cassandra. Iterative workflows also crunch the data in the data warehouse with daily cadence, powering things such as business intelligence applications.

To reliably capture these update events from a service’s database, we explored enabling change data capture (CDC) for Aurora/Postgres using Debezium connector, a Red Hat-developed open source project for capturing row-level changes. The initial performance testing carried out by the storage team suggested that this strategy had too much overhead and was not performant, especially when the service uses the same database for serving online traffic. Therefore, we implemented save hooks in the application, which are responsible for handling data update requests, to propagate change events through Kafka whenever there is a change on the underlying data store. We call this approach Application Level CDC.

With Application Level CDC, we could run into consistency issues. A distributed application has multiple instances. Two separate update calls may get served via two different instances. If we include updated values in the Kafka messages, it wouldn’t guarantee consistency and solve the issue because in certain cases multiple instances of the application will push events that are updating the same value. 

For example if Application instance #1 sends an event, {store_id: 10, is_active=true}, and Application instance #2 sends an event, {store_id: 10, is_active=false}, there would be conflicts on the consumer side.

To ensure consistency, we send only changed entity IDs in the Kafka events. Upon receiving the Kafka events, our Assembler app calls REST APIs on the application to gather other information about entities which are present in Kafka events. The REST API calls ensure data consistency about the entity. The Assembler amalgamates the information to create an event which it pushes to Kafka for the Sink app to consume. The Assembler implements a windowed dedupe, which prevents calling REST APIs for the same entity multiple times within a specified amount of time. The Assembler also does aggregation of events in order to call REST endpoints in bulk. For example, over a period of 10 seconds, it aggregates item updates for a store. It calls REST APIs for that store including all of the deduped and aggregated items.

To summarize, we use the Application Level CDC to capture data change events. We resolve consistency issues with simplified events and REST APIs. We use dedupe and window functions to optimize the event processing. 

Indexing ETL data

Many properties of the store and item documents that are critical to our retrieval process, such as scores and tags generated by ML models, are updated in bulk once a day. This data is either model generated, as when an ML model runs the freshest data, or manually curated, as when our human operators manually tag items with “chicken” for a particular store. This data gets populated into tables in our data warehouse after a nightly run of the respective ETL jobs. 

Before our new search index platform, we did not have a reliable way of uploading data onto our index, instead using slow and imprecise workarounds. We wanted to improve our existing pipeline by giving our new search index platform the mechanism to reliably ingest ETL data into our index within 24 hours. 

The CDC patterns for the ETL use case are very different from the incremental update case described in the previous section. In the case of incremental updating, the merchant data stores are constantly being updated, resulting in a continuous stream of updates over the course of the day. On the other hand, for the ETL use case, the updates occur all at once when the ETL runs, with no other updates until the next run.

We decided not to use a variant of the Application Level CDC for the ETL sources because we would see large spikes in updates everytime the ETL ran, and this spike could overly stress our systems and degrade performance. Instead, we wanted a mechanism to spread out the ETL ingestion over an interval so that systems don’t get overwhelmed.

As a way forward, we developed a custom Flink source function which periodically streams all the rows from an ETL table to Kafka in batches, where the batch size is chosen to ensure that the downstream systems do not get overwhelmed. 

Sending documents to Elasticsearch

Once the Assembler applications publish data to destination topics, we have a consumer that reads the hydrated messages, transforms the messages according to the specific index schema, and sends them to their appropriate index. This process requires management of the schema, index, and cluster. We maintain a unique Kafka consumer group per ElasticSearch index so that consumers can maintain offsets for each index. To transform messages, we use a  DocumentProcessor(s), which takes in a hydrated event from the destination topic and outputs formatted documents that are ready to be indexed. 

The Sink process utilizes Flink Elasticsearch Connector to write JSON documents to Elasticsearch. Out of the box, it has rate limiting and throttling capabilities, essential for protecting Elasticsearch clusters when the system is under heavy write load. The process also supports bulk indexing where we gather all documents and the relevant operations over a time window and perform bulk requests. Any failure to index a document results in the document being logged and stored in a dead-letter queue which can be processed later.

Backfilling a new index quickly

Oftentimes, we might want to add a new property to our index, such as adding the market ID associated with a store or item to the document because it helps us in sharding. Likewise, we may need to rapidly recreate a new index, such as when we want to try out different index structures to run efficiency benchmarks. 

In the legacy system we relied on a slow and unreliable job that typically took a month to reindex all the store and item documents. Given the long indexing duration, it was difficult to properly estimate the error rate associated with the reindexing process. Thus, we were never certain of the indexing quality. We often got complaints about mismatches in store details between the index and the source of truth, which had to be fixed manually.

With our new search index platform, we wanted a process to rapidly recreate a new index or backfill a property in an existing index within 24 hours. For the process of bootstrapping, we needed a mechanism to rapidly recreate all the documents which needed to be indexed in Elasticsearch. This process involves two steps: 

  1. Streaming all entity IDs corresponding to the documents which needed to be indexed in ElasticSearch 
  2. Mapping the entity IDs to their final form by making external calls before they are sent downstream for indexing. 

The pipeline for mapping the entity ID to the final form of the entity had already been established as part of our work on the online assembler, mentioned above. Therefore, all that was needed was to stream all the document IDs which needed to be indexed in Elasticsearch. Accordingly, we maintain an up-to-date copy of all the entity IDs which need to be indexed in bootstrap tables in our data warehouse. When we need to bootstrap, we use the source function described in the ETL section to stream all the rows from these bootstrap tables to Kafka. We encapsulate the logic to perform the above two steps in a single job.

If we run our incremental indexing pipeline at the same time as our bootstrapping pipeline, we run the risk of getting stale data in Elasticsearch. To avoid these issues, we scale down our incremental indexer everytime the bootstrap is being run, and scale it back up once the bootstrap is complete.

Putting it all together, the steps we take to backfill and recreate the index are as follows:

  • Create the index and update its properties as needed, and update the business logic and configurations in the assembler and the sink to populate the new property.
  • Scale down the online assembler. 
  • Scale up the bootstrap job.
  • Once the bootstrap is complete, scale down the bootstrap job and scale up the online assembler. Once the offset becomes recent, the bootstrap process is complete.

Enabling a forced reindexing function 

From time to time, some of our documents in Elasticsearch might have stale data, possibly because some events from upstream didn’t get delivered, or one of our downstream services took too long to respond. In such cases, we can force a reindex of any documents in question. 

To accomplish this task, we send a message with the ID of the entity to be indexed into the topic which the online assembler consumes data from. Once the message is consumed, our indexing pipeline described above gets kicked off, and each document is reindexed in Elasticsearch.

We annotate the messages being sent in our one-off indexing tasks with unique tags which provides us with a detailed trace of the document as it passes through the various stages of the indexing flow. In addition to providing us with a guarantee that the document did indeed get indexed, it provides us a wealth of debugging information which helps us validate and helps uncover any bugs which might have prevented it from being indexed in the first place.

Results

Our new search indexing platform is more reliable. The incremental indexing speed helps refresh data faster and appears more promptly in our consumer applications. Faster reindexing enabled fresh indexes to be built in a short amount of time to improve our retrieval: 

  • Reduced the time for backfilling our entire catalog of stores from one week to 6.5 hours
  • Reduced the time for backfilling  our entire catalog of items from two weeks to 6.5 hours
  • Reduced the time to reindex existing stores and items on the platform from one week to 2 hours

Conclusion

Data lives at the heart of any organization. Moving data seamlessly and reshaping it for different use cases is an essential operation in our microservice architecture. This new search index platform lets other teams at DoorDash design search experiences for specific business lines without having to build a whole new search index architecture. Our reliance on open source tools for this search index means a lot of accessible documentation online and engineers with this expertise who might join our team. 

Generally, this kind of solution applies to any company with a large, growing online catalog that is focused on making changes to its search experience. By taking a similar approach as described above, teams can cut down on the reindexing time and allow faster iterations and less manual interventions while improving the accuracy of their index. Our approach is particularly beneficial to companies that have a rapidly growing catalog and multiple manual operators making changes that need to be reflected in the index.  

Header photo by Jan Antonin Kolar on Unsplash.