Skip to content

Focusing on delivery allowed DoorDash to build a food search engine, but expanding beyond food with more SKUs and merchants requires a substantial upgrade. As DoorDash expands into new verticals such as groceries, convenience, alcohol, and pet food, new challenges arise that our food-driven legacy search wasn’t built to handle. 

As our business has grown, our data layer has become more granular, requiring tracking of discrete SKUs for every item offered rather than the nested menus we started with in food delivery. With our expansion into the convenience and grocery space, we need to index new verticals in our search database.

Simultaneously, our system needed to accommodate entirely new product domains. That requires us to bootstrap data labels for new domains, such as grocery, alcohol, and pet food. The vast number of new SKUs creates scaling challenges for our data labeling. As a result of expanding query and document understanding, the relevance ranking of search results also needs to be substantially improved. These new challenges required onboarding machine learning-based approaches.

To solve these challenges, we needed to update our search infrastructure to handle the additional scale as well as to rank new search results appropriately with novel improvements including learning-to-rank and query-and-document-understanding features.

We rebuilt our search infrastructure to support this new scale in search, including the indexing, retrieval, and storage engine. We significantly shortened the time to refresh the search index with a technique we call active/nextgen indexing. We also rebuilt search to conduct federated search to intermix search results across such verticals as food, groceries, and alcohol. Finally, we demonstrated the value of new search storage engines for future search infrastructure improvements at DoorDash.

To extend query-and-document-understanding beyond food, we worked on new taxonomies that help us to understand new types of data in the search corpus in a sensible and structured way. We then shared our experience with machine learning labeling services through human annotations to show how we bootstrapped our new taxonomies. Ultimately, we discussed how natural language processing can contribute to future search query understanding improvements at DoorDash.

We built learning-to-rank systems to make our search results more relevant and expanded our backend systems to define search evaluation and establish a framework to assess whether a result is “relevant” to the query. We showed how we successfully applied pointwise ranking and personalization to make search results relevant. To go into more depth such as going beyond pointwise ranking with new models (such as pairwise ranking), we plan to create future blog posts dedicated to areas such as machine learning applications in search by corresponding DSML teams.

Rebuilding search infrastructure for new challenges

We need to rebuild our search infrastructure from scratch to achieve our central mission of updating our search experience to handle scale and new verticals. This involves: 

Implementing active/nextgen indexing

The first goal for our new search infra was to improve indexing speeds. Because there is a never-ending flow of new SKUs and merchants constantly update their information, indexing speed determines whether customers can get fresh and consistent search results. Earlier last year, we published a blog on search indexing in which we illustrated an architecture that allowed us to build search indexes faster, as shown here in Figure 1.

Figure 1: Previous indexing optimized for food search

Building on top of the existing system, the search platform team significantly enhanced the architecture of the next-gen indexing system, as shown in Figure 2.

Figure 2: Current indexing supporting multiple verticals

The new storage system creates a materialized view of search documents before inserting them into indexes. This facilitates even faster index rebuilds and also deterministic and reproducible indexing. The new system also features improved QoS – quality of service – for different types of document updates. For high priority updates whose staleness has a bigger user impact, we chose incremental indexing. This greatly reduces the head-of-line blocking effect during index rebuilds. We also created the ability to switch between old/new indexes with an automated control plane. This significantly reduces human errors, while the state-machine design in the controller makes it possible to reason about the state of the system by, for instance, showing which index is active.

Now that we have updated our search indexing to be at the right speed, we can move our attention to federated search, an architecture that allows us to retrieve search results efficiently and effectively across verticals such as food, groceries, and alcohol.

Federated search is the dual part of active/nextgen indexing. With this new design, we can retrieve search results efficiently and effectively across verticals through searching each type of data independently and then mixing results from them together to form the final results, as shown in Figure 3 below.

Figure 3: Federated search pipeline to support search across multiple verticals

Our previous system was tightly coupled with food delivery systems. As our previous engineering blog illustrated, we were primarily running a search engine for restaurants. The search team spent a significant amount of effort rearchitecting a federated search system to support many new verticals by design. Similar to other companies building their own search systems, we also adopted the federate-then-blend paradigm.

With the new system, onboarding a vertical no longer requires surgical code changes in core search systems. This greatly improves go-to-market velocity for new verticals. Additionally, the modularized nature of the new system allows each vertical to customize relevance according to their business. This makes it possible for discrete verticals at different stages of business development (e.g. enter the market, growing user base, optimizing revenue, etc.) to pivot their business goals easily, for example to consider transaction volume vs. revenue.

The new system also allows us to set latency SLAs on every vertical; when there are production issues with certain indexes, we can achieve graceful degradation without incurring a global outage. For example, if one vertical is having issues the overall search does not need to error out. The blending stage performs global coordination among business verticals. This allows us to do many interesting things such as multi-objective optimization, exploit-explore trade-off, inserting ads, and merchandising.

The motivation behind creating a new search storage engine

As our previous blog posts have illustrated, we have been using cloud-hosted ElasticSearch (ES) as our out-of-the-box storage engine. An open-source project, ES is commonly used as the storage system for search applications.

There are many great things about ES and cloud-hosted operating models. As DoorDash’s business grows and expands, however, there are new use cases that could benefit from a different storage engine. Among the limitations the team encountered with hosted ES are:

  • Difficulties with having computation close to storage. Without an extremely fast network – for example, hundreds of microseconds – certain complex retrieval and filtering algorithms would require that the search service make multiple network round trips to storage. A typical intra-datacenter network round trip between two containers through an operating system networking stack could take 10ms; fetching multiple rounds of data could easily become a latency bottleneck. For example, achieving two-stage search in an efficient way becomes challenging without full control of the storage.
  • Limited selection of first pass rankers. ES allows only a limited selection of first pass rankers through plugins. The idea is to apply a simple ranking algorithm to scan through storage for the first pass of candidate retrieval. But if we can’t push our code to storage, we can either opt into ElasticSearch’s plugin paradigm or pay the cost of network round trips to do post-processing in the search server’s memory.
  • Expensive for high capacity deployments. ElasticSearch becomes expensive as capacity increases. Even after many optimizations to shrink the clusters and with a discounted rate for committed purchase, ElasticSearch generates large cloud compute bills. 

Experimenting with other search paradigms

The search team started experimenting with in-house solutions using Apache Lucene, a mature and industry-leading open source technology, which is also what ElasticSearch is based on – we are going right under the hood. In our early results, we observed end-to-end latency reduction of multiple times and an order of magnitude smaller computation cost. The benefits were myriad, including:

  • Tight retrieval algorithm control and the ability to push down exact logic close to storage Fast experimentation with first-pass ranker algorithms and textual feature engineering with little operational overhead
  • Reduced query latencies through direct leverage of the early termination supported by Lucene, which is also an academic and industry standard method of query processing

In addition to Apache Lucene, the team continues to explore other search paradigms. For example, there are companies already employing deep learning-based embeddings to represent search objects and queries. Some companies are also trailblazing by embedding all things in their systems. In embedding searches, we convert a textual match problem into a high-dimensional spatial search problem. We convert a query to an embedding – for instance a 512-dimension numerical vector – and then conduct a nearest neighbor search among all document embeddings. Searching nearest neighbors for embeddings is a daunting task because of the curse of dimensionality. Fortunately, most of the time we do not need exact KNN — top-K nearest neighbors – and can instead approximate our search problem with ANN – approximate nearest neighbors. There are many researchers and practitioners in the industry who already have been working on this problem for decades. Some recent work includes Meta’s FAISS – Facebook AI Similarity Search – and HNSW, or Hierarchical Navigable Small World graphs and their implementations.

Now that we have discussed how we rebuilt indexing, retrieval, and storage to support search across verticals, we can move on to the second challenge we faced: query and document understanding beyond food searches.

Improving query and document understanding

After discussing scaling our infrastructure for query and document understanding, we needed to move beyond our taxonomy for dishes in restaurant menus to incorporate more verticals. That required extending and defining a new taxonomy to enrich search, leveraging machine learning labeling services through human annotation to bootstrap our new taxonomies, and using natural language processing for the new search experience. Here we demonstrate why it is important to continue investing in this technology.

Expanding on the food taxonomy

As illustrated in our previous blog, having a taxonomy improves our understanding of a consumer’s search intent. Since we adopted our initial taxonomy, we have standardized the methodology and process to design, curate, govern, and operate taxonomies. We now are working on:

  • A food taxonomy that includes such things as ingredients, nutrition, dietary attributes, regional cuisines, food preparation styles, and restaurant types
  • A retail taxonomy that includes such things as brands, products, nutrition, ingredients, and packaging
  • A query intent taxonomy that includes such things as brand, store name, verticals (food, grocery), modifiers (e.g. yellow/small/one-gallon), dietary, nutrition, and ingredients

With all the standardized knowledge about what DoorDash has in its inventory and what a query potentially is searching for, we can optimize our search system to return more relevant results. One challenge, however, is establishing the source of truth.

To a great extent, we have converted an open problem – understanding everything – to a more tractable problem  – understanding within an established framework. Even so, we still have far too many dishes and retail inventories inside our databases. Our strategy to contend with the remaining issues is as follows:

  • Begin with an unscalable manual approach, which allows us to move fast, verify our assumptions, and iterate, scaling up to hundreds.
  • Streamline the manual approach to a standard operating procedure, or SOP, and let our business strategy and operations partners scale the process up to hundreds of thousands.
  • Build machine learning models with the data acquired in previous steps to further scale to millions and more.
  • Create an SOP for quality checks. Just as every public company is required to have third-party auditing of their financials, we conduct quality checks against our annotated data in steps one through three through sampling. The audited quality evaluation is used to improve the business SOPs as well as fine-tune machine learning models. Quality checks are essential to reduce systemic biases and improve accuracy and coverage of data with taxonomy information.

Using human annotations to create labels 

Manual human labeling requires a framework to account for cross rater variance. A standardized process minimizes such variance in our data. Google search has developed excellent guidelines around this process. DoorDash’s search team has partnered with the business strategy and operations teams to come up with relevant annotation guidelines for our company. To delve more deeply into this topic, please review our DSML team’s work on related topics around item embeddings and tagging.

Even if every DoorDash employee were to work full-time following the SOP, we still would not get to all of the hundreds of thousands of annotation tasks. Human annotations must rely on vendors to scale. There are many companies across the world in the annotation business, including Amazon’s Mechanical Turk, Google Cloud’s AI Labeling Service, Scale.AI, Appen, and many more. These vendors take annotation tasks and parcel them out to affiliated human annotators across the world. Because of the human labor involved, vendor prices and capacities vary widely. As a result, properly managing vendors and capacity/budget allocations is a full-time job.

An annotation task usually does not and cannot rely on a single human judgment. A popular way to de-bias is to have a majority vote with a quorum. To ensure a reasonable level of accuracy, we would need three, five, or even more human annotators to conduct the same annotation task. This imposes a higher demand for human labor and budget. Another factor for consideration is the complexity of a task. it is important to have a well-designed annotation guideline so that annotators can conduct their work efficiently and effectively.

After we collect annotations from vendors, we still need to conduct audits. Just as in any system, quality assurance is critical here. When we conducted in-house audits on vendor outputs, we found a few interesting themes:

  • Some vendors source their annotators from particular regions, engendering cultural differences that made annotators systemically misjudge a class of concepts.
  • Certain vendors trained/maintained better quality annotations than others.
  • For some types of labels, our annotation guidelines left too much room for ambiguity.
  • Vendors took significantly longer to ramp up and process some of our annotation tasks when we underestimated complexity.

Using natural language processing to enrich query context

In addition to using new taxonomies and labeled data to enrich query understanding, we further explored using Natural Language Processing (NLP) to enrich query context as a scalable approach. The team is just starting to partner with data scientists to apply NLP approaches to query understanding. Here we will focus on a specific application – Part-of-speech (POS) – to highlight NLP’s utility and challenges.

There were two key challenges. First, we needed a performant NLP model to give us good Part-of-speech (POS) annotation. And second, we needed to consider during query processing time how to efficiently obtain such POS annotations.

The model for POS annotation

POS annotation of a query helps us understand its linguistic structure. For example, “red pizza” is parsed as [JJ NN], in which “JJ” is the POS tag for an adjective and “NN” is for a noun. As a result, we know this query contains a structure of “an adjective describing a noun”. During search, when “red pizza” full match is not available, we can relax our query to “pizza” by walking up the hierarchy of taxonomy, which effectively achieves “query expansion” to increase recalls.

Transformer models are very popular today in NLP. There is a plethora of off-the-shelf open source libraries and models. We experimented and found Spacy to be a good library for its ease of use, active community updates, and rich set of models. NLP libraries involve far more than POS tagging; we are excited to continue experimenting and adding more NLP capabilities.

Serving POS tags efficiently

Dealing with our second challenge of serving POS tags efficiently remains an ongoing effort. A transformer model running in Python without optimization kills performance. Because of the Python per-process model, we would pay a very high latency penalty and would need a huge fleet of Python-serving containers just for online inference. We have several options to bring latencies under control: :

  • Load offline pre-computed head and torso queries into server storage such as Cassandra, Redis, or CockRoachDB for their POS tags, eliminating the need for online inference for them. Precomputing for tail queries is impractical given the unbounded cardinality though.
  • Use a simpler NLP model for online inference. This involves a trade-off between model power and system performance. There are a few JVM-based NLP libraries; DoorDash runs microservices in Kotlin, so JVM libraries are good for us. There are two notable popular choices: Apache OpenNLP and Stanford CoreNLP. The first one is open source under Apache license, making it free to include. CoreNLP is under full GPL license, but the library owner offers license purchase for commercial use.
  • Go deep into Python and neural network internals to solve the performance issue inside Python runtime. This is doable with in-depth engineering investment; Roblox recently showed that they were able to serve one billion online BERT inferences daily.

Because we are just starting to apply NLP techniques to search, we are only scratching the surface of this decades-old, well-researched field. We will publish more about our endeavor in this area as the team makes more progress.

Learning-to-rank

Our remaining challenge is how to bring better relevance to search results. After we retrieve data from search storage systems, we need to rank them according to relevance. There has been an evolution of learning-to-rank technology at DoorDash. After that we will show it is critical to define a search evaluation framework and which learning-to-rank techniques have helped us improve relevance.

Figure 4: Evolution of search ranking at DoorDash

We began with a simple heuristic-based ranker based upon content popularity and lexical similarity – BM25 – (as described in our last blog post) as an out-of-box solution from ElasticSearch. Besides BM25, PageRank and probabilistic language modeling have demonstrated their success in the information retrieval domain and prevailed with “unsupervised” approaches.

Combining different ranking heuristics is not trivial, however, and eventually becomes unscalable when the business/product continues to grow and search behavior becomes more diverse. Performance also can be suboptimal and inconsistent. The heuristic ranker works most effectively when a consumer searches for a specific merchant name, which is called branded search; otherwise, the relevance pales. 

At the end of 2020, we started to introduce the learning-to-rank, or LTR, technique to improve search performance based on a machine learning framework. 

From a pure rules-based string match ranker to a more sophisticated ranker, we think there are some common learnings we could be sharing. Search evaluation is critical to establish a principled framework that assesses search result relevance.

Human evaluation for relevance metrics and training data

Before the data/ML-driven approach kicks in, it’s important to set the first principle for search relevance: the ultimate consumer experience that we are trying to deliver and the optimization goal. Until we do this, data plays a vital role in providing valuable information first for evaluation and then modeling.

Prior to the first LTR model, we put in significant effort to establish the online search evaluation framework and methodology up front. 

We established two categories of metrics: business vs. information retrieval.  For business metrics, search conversion rate is the north star along with a set of input metrics – examples include click-through rate, search usage, and first-click/converted rank position.

For information retrieval metrics, the first step is to define search relevance. From the beginning, we leaned primarily on consumer engagement as implicit feedback and assigned varying weights to different engagement types – for example, click and conversion – as a perceived relevance score. Accordingly, we built mean-reciprocal rank and normalized discounted cumulative gain (NDCG) to evaluate search quality beyond materialized business impact. 

The implicit feedback serves as an effective data signal for both evaluation and modeling in the short- to mid-term, especially for head-and-torso queries in which the consumer engagement information is rich and relatively cheap to acquire. To comprehensively measure any nuanced search quality improvement and drive the performance for tail queries, however, we later introduced the human-in-the-loop method and lean heavily on operational excellence to collect more reliable signals and slice-and-dice search quality from 360 degrees. The team is currently revamping the next-gen evaluation system; we show only current high-level conceptual flow in Figure 5 below. We will publish a future blog post to dive deeper into our next generation quality evaluation system.

Figure 5: Current search evaluation high-level conceptual flow

Pointwise ranking

After defining the search evaluation framework, our next step was to build learning-to-rank models and apply them to production. We started with pointwise ranking, a well-studied learning-to-rank paradigm. Given that search conversion rate is our north star metric, we directly train an ML model to predict the view-to-conversion rate for all the stores retrieved by a specific search query.

We leverage search logs and store metadata to engineer four types of input signals, including:

  • Query features, which characterize the query type – branded vs. non-branded – query length, click-through rate (CTR) and conversion rate (CVR) of the query
  • Lexical similarity features, such as Jaccard similarity and string edit distance between query and store metadata
  • Store features, or historical CTR and CVR in such things as search, total sales, rating, price range, or ETA
  • <Query, Store> pairwise features, including historical store impressions, CTR, and CVR for a specific query.

The first LTR model was trained using gradient boosted decision trees and was evaluated on a separate dataset. We leveraged the Bayesian approach for hyperparameter tuning and area-under-the-curve (AUC) gain for offline model selection. 

From heuristic ranker to having an ML-empowered LTR, we have observed an immediate big leap in all consumer engagement metrics with search. 

Behind the scenes, however, it took more than six months from model development to final deployment. Significant effort went into building a system surrounding ML so that we could seamlessly build and deploy, iterate, and measure the ML performance. The context of an end-to-end ML pipeline is far beyond building a performant offline model. ML technology at DoorDash also has improved significantly since then.

The second big leap happened when we introduced personalization features into the LTR model. At DoorDash, our search has two distinct usages:

  • Brand search, in which the consumer searches for restaurant names with navigational and highly focused intent, and
  • Non-brand search, in which the consumer searches for a certain cuisine type or food category, usually through cuisine filters. The intent to convert is largely driven by personalized preference.

When introducing consumer specific signals into the LTR model, we have seen a substantial boost in core business metrics. Indeed, the improvements are mostly from non-branded searches, which take a sizable chunk of the search volume.

Conclusion

To expand from a focused food delivery search engine to a general platform that allows searching for everything, we needed a holistic approach that touches almost every aspect of a search system including search infrastructure, query and document understanding, and learning-to-rank. We showed how to build a more reliable indexing architecture to shorten indexing time with realtime updates. We also showed how to federate searches across verticals for efficiency and effectiveness. We believe it’s worth pursuing new types of search storage engines instead of betting on third-party solutions. For query and document understanding, we showed practical considerations and lessons defining and collecting data for a new taxonomy. We believe continuous investment in natural language processing will yield big gains for us. Finally, we discussed the evolution of learning-to-rank technologies at DoorDash and how we established our evaluation framework and leveraged specific ranking techniques.

We plan to publish future blog posts by our DSML teams about search dedicated to machine learning details focusing on pairwise ranking and deep learning. We are constantly making improvements to our search performance and relevance. We hope to meet you again in the near future in another blog post!

Acknowledgements

Many people from search and new vertical engineering teams have contributed ideas to make this blog post come to life: Aleksandr Sayko, Kostya Shulgin, Qihua Yang, Jimmy Zhou. We are also grateful to the DSML teams for their contributions in ML modeling. Finally, we would like to thank the leaders in engineering, S&O, analytics, product and DSML: Prasad Sriram, Tian Wang, Tarun Bansal, Melissa Hahn, Oliver Davies, Jessica Zhang, Akshad Viswanathan, Parag Dhanuka, Han Shu, and Kurt Smith.

At DoorDash, we make millions of predictions every second to power machine learning applications to enhance our search, recommendation, logistics, and fraud areas,  and scaling these complex systems along with our feature store is continually a challenge. To make predictions, we use a microservice called Sibyl Prediction Service (SPS) written in Kotlin and deployed in Kubernetes pods with load balancing to serve all our prediction requests. These requests typically take features as inputs and produce a return value that drives a product decision. A typical feature value will capture the number of items in someone’s order (numerical), be an identifier of the market they are in (categorical), or be a representation of a menu item they are ordering (embedding). We can also represent more complex data types (such as images) as vectors.

These feature values are frequently loaded from our feature store, which is powered by a set of  large Redis clusters (in-memory key-value stores). We use multiple TBs of memory in our Redis clusters, so this service is a significant source of cost and maintenance. Additionally, prediction requests are performed online and need to be served in a matter of milliseconds with strict integrity requirements.

We investigated the idea of adding a caching layer at the microservice level to achieve better request efficiency and ease the load on Redis, expecting it to make our system more scalable. We instrumented SPS with some logging to understand request patterns. After our initial analysis, we found that a large fraction of requests are repeated and wasteful, making a strong case for the efficiency of an in-process caching layer that we proposed.

Our feature store has certain optimizations in place, such as encoding, compression, and batching. We will explain these in detail so you can better understand our experiment setup. Once we decided on the instrumentation steps, we investigated various caching libraries to get the most performance out of the system.

To maintain system stability, we rolled out the caching solution in phases, with rigorous verification steps along the way. After a few iterations, we found that the real-world results closely match our simulations, achieving 70%+ cache hit rate.

How we use ML features to make predictions in Sibyl Prediction Service today

How do features end up in the feature store? Using a centralized method, we take ground truth data from our database and upload them using a specific format to our feature store.

For batch features (such as long-term statistical features), this happens on a periodic basis, such as on a daily basis orchestrated by our internal integration called Fabricator. Real-time features, on the other hand, are uploaded continuously using Kafka queues orchestrated by Riviera, our feature engineering framework.

In both cases, the features are stored in a Redis store in a specific format, which will provide important context for understanding the design principles behind our features caching solution.
Internally, we name features using a convention used to encode the aggregation type, a brief description, and the features type. We then hash these feature names using xxHash to reduce the repetitive storage costs of long strings:

xxHash32(daf_cs_p6m_consumer2vec_emb) = 3842923820

We then associate each of these features with entity_ids, which are internal identifiers of consumers, Dashers (delivery drivers), merchants, deliveries, etc. To be able to retrieve multiple feature values for a given entity ID in one request, we create hashmaps in Redis keyed by the entity ID, as follows:

HSET entity_id xxHash32(feature_name) feature_value

During a prediction request, we retrieve one or more feature values for all of the entity IDs involved:

HMGET entity_id1 xxHash32(feature_name1) xxHash32(feature_name2) …
HMGET entity_id2 xxHash32(feature_name3) …
…

which returns the mapping

{
entity_id1: {feature_name1: feature_value1,
             feature_name2: feature_value2, …}
entity_id2: {feature_name3: feature_value3, …},
…
}

Some nontrivial feature values (such as lists) can be very long, which we then serialize using protocol buffers. All of the traffic sent to the feature store is then compressed using the Snappy compression method.

The design principles behind our feature store can be found in more detail in an earlier post about our feature store.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

The lifecycle of feature store requests

In a prediction request, SPS can receive multiple prediction queries; and for each of those, it takes as input a set of feature names and corresponding entity IDs of interest. Some of these features are provided by the upstream client since it has the context (for example, the current search query term provided by our customer), but most features need to be retrieved from the feature store. Stored features are typically statistical features over a period of time and can be aggregated. For example, the average order size from a merchant in the past seven days.

During the prediction request, the feature values get populated in the following order:

  1. Features passed in from the request
  2. Features retrieved from Redis
  3. Default feature value, if the feature is not in Redis
Figure 1: The process of feature retrieval throughout the lifecycle of a prediction request

We use a set of sharded Redis clusters, where we partition data by use cases. For example, we have a separate shard for the use cases of search and DeepRed (the system at the center of our last-mile logistics platform). This allows us to scale the storage cluster for each use case independently. We make batch requests to Redis for each entity ID, which typically have 10-100 feature names. We retrieve tens of millions of key/value pairs from our Redis clusters on average per second.

The feature store’s scalability problem

Our feature store is a massive and very costly database that needs to support billions of requests we make each day to retrieve ML features. We store features for DoorDash customers, merchants, Dashers, and deliveries. These features can either be batch or real time features and account for billions of rows in our feature store.

Our primary way of storing data is Redis, which fits our needs well since there is a strong requirement for low latency predictions. These predictions need to pull in hundreds of features values for each prediction. At the same time, this cluster has to be highly available for our prediction requests so we use a large number of instances to balance the load. Since Redis is an in-memory store, we need to host our data in servers with large amounts of memory. Having servers with lots of memory in turn drives up our compute costs.

Cost efficiency is an important aspect of any platform, and DoorDash’s prediction platform is no different. The feature store is a large contributor to our costs, so tackling this scalability problem has been the focus of our team. To this end, we ventured to explore the possibility of using alternative storage systems (that might be slower but more cost effective) for our features and explore caching. We hypothesized that most requests to our feature store are repeated and therefore are wasteful, making caching an attractive way forward.

By implementing caching, we expected to significantly improve: 

  • Latency: the time it takes to perform predictions. Faster predictions can unlock more use cases and lead to a better end-user experience.
  • Reliability: Consistently performing predictions in a timely manner means that our predictions are higher quality since they fall back to a baseline less often.
  • Scalability: As the scale of DoorDash grows and we are using machine learning for more use cases, we can make predictions in a more cost-effective manner.

Ultimately, even if we don’t realize the above gains immediately, caching is a very promising path forward for us because it unlocks storage backends that are slower but more cost effective than Redis.

How caching can help us improve online prediction scalability and reliability

Before embarking on a large migration project, we wanted to gather more information to estimate the return on investment on our idea. Building out a functional testing environment in production was a larger investment and comes with the risk of negatively impacting production traffic, so instead we first set out to validate our idea using simulation.

Our main idea was to capture live production network traffic originating from our prediction microservice to our Redis cluster. We instrumented SPS in a nonintrusive way to get an idea of what requests are sent to Redis. We used the well-known tool pcap to capture network packets and save them to a file, and then rearranged these packets to the correct order with a tool called tcpflow.

These TCP packets looked like this:

HMGET st_1426727 2496559762 1170383926 3383287760 1479457783 40892719 60829695 2912304797 1843971484

In this example, HMGET will request eight different feature values pairs from the hashmap corresponding to the st_1426727 entity_id. The keys are xxhash’ed values of feature names in 32-bit integer format.

Once we had a large enough sample of these HMGET commands, we ran a custom Python script to parse the requests and simulate various caching scenarios with the goal of estimating cache hit rate.

import sys
from functools import lru_cache
hits = misses = 0
cache = set()
 
@lru_cache(maxsize=100000)
def lru_get(x):
   pass
 
def hit_rate(w):
   if not w:
       return 0.0
   return sum(w) / len(w) * 100.0
 
with open(sys.argv[1], encoding="utf8") as f:
   for x in f:
       x = x.strip()
       parts = x.split()
       if not parts:
           continue
       if (parts[0] != "HMGET"):
           continue
       for i in range(2, len(parts)):
           key = (parts[0], parts[1], parts[i])
 
           lru_get(key)
 
           if key in cache:
               hits += 1
           else:
               cache.add(key)
               misses += 1
 
lru_info = lru_get.cache_info()
lru_hitrate = lru_info.hits / (lru_info.hits + lru_info.misses) * 100.0
print("Infinite memory: {hits}, Misses: {misses}, Cumulative Hit Rate: {hits/(hits+misses)*100.0}")
print(f"LRU with size {LRU_SIZE}: {lru_info.hits}, Misses: {lru_info.misses}, Cumulative Hit Rate: {lru_hitrate}")

With this framework, we were able to run various simulations offline to find out how the cache would behave.

Figure 2: Simulated partial cache hit rate for the last 10,000 requests over five minutes in a single pod for varying sizes of  LRU

This experiment provided some valuable insights. In our benchmarks, we’ve seen that we can achieve close to 70% cache hit rate with a cache capacity of 1,000,000. We also discovered that the cache gets saturated after 15 minutes of production traffic.

To summarize, our simulation methodology was as follows:

  1. Run tcpdump on production hosts to gather outgoing traffic to Redis for 15 minutes.
  2. Parse Redis requests data to get the list of keys retrieved.
  3. Simulate caching by looking at repeated keys.
  4. Plot graphs that show the results of caching over time.

How we implemented and measured the effects of pod-local caching on prediction performance

As the first layer of caching, we set out to implement an in-memory, pod-local cache for our SPS microservice. The idea was to implement the cache logic around the existing feature retrieval logic – returning early if there is a value stored in the local cache.

Since we have billions of distinct feature values being processed through SPS, we cannot afford to store all of those values in each of our pods. We have to limit the size of our cache and when it becomes full, evict some values (preferably the ones we don’t need). Therefore, we needed to come up with a cache eviction scheme. There are many different cache invalidation schemes – for our use case, we chose an LRU invalidation scheme.

SPS is implemented in Kotlin and so we sought out a way to integrate an LRU cache with our microservice. Since SPS is a multithreaded application serving a huge throughput of requests, we needed to have an implementation that can operate with high concurrency. Therefore, the cache implementation had to be thread-safe. There are different strategies to achieve thread safety, but implementations typically use read-write locks.

Another consideration was observability: We needed to know how our cache performs in a real-world environment and whether it matches our expectations set in the offline experiments. At DoorDash, we use Prometheus to log metrics, which can be viewed on Grafana graphs. For the LRU cache, we wanted to log latency, hit rate, cache size, memory usage, request count, and cache correctness.

Efficient caching solutions in Kotlin

We set up a base interface for our cache in Kotlin:

interface Cache<K, V> {
    // Returns the value associated with the key, or null if the value is not in the cache.
    operator fun get(key: K): V?

    // Inserts a key/value pair into the cache, overwriting any existing values
    fun put(key: K, value: V)

    // Deletes a specific key in the cache
    fun delete(key: K)

    // Deletes all keys in the cache
    fun clear()

    // Returns the estimated size of the cache
    fun size(): Int
}

A basic, inefficient, and unbounded implementation of a cache might look as follows:

class UnboundedCache<K, V> : Cache<K, V> {
    val store: MutableMap<K, V> = Collections.synchronizedMap(mutableMapOf())
    override fun get(key: K): V? {
        return store[key]
    }
    override fun put(key: K, value: V) {
        assert(value != null)
        store[key] = value
    }
    override fun delete(key: K) {
        store.remove(key)
    }
    override fun clear() {
        store.clear()
    }
    override fun size(): Int {
        return store.size
    }
}

We can add LRU caching to the class by using LinkedHashMap with a custom removeEldestEntry:

class LRUCache<K, V> (private val name: String, private val capacity: Int, private val loadFactor: Float = 0.75f) : Cache<K, V> {
    private var store: MutableMap<K, V>

    override fun size(): Int {
        return store.size
    }

    init {
        assert(capacity > 0)
        store = object : LinkedHashMap<K, V>(capacity, loadFactor, true) {
            override fun removeEldestEntry(eldest: MutableMap.MutableEntry<K, V>): Boolean {
                return size() > capacity
            }
        }
        store = Collections.synchronizedMap(store)
    }

    override fun get(key: K): V? {
        val value = store[key]
        return value
    }

    override fun put(key: K, value: V) {
        assert(value != null)
        store[key] = value
    }

    override fun delete(key: K) {
        store.remove(key)
    }

    override fun clear() {
        store.clear()
    }
}

In the above code, the Collections.synchronizedMap(store) clause makes the class thread-safe. However, it will not perform well in a high concurrency environment, because we lock the whole internal linked list on each lookup and insertion. Therefore, this code is purely for illustration purposes and should not be used in a production application.

For production use cases, we experimented with multiple open-source libraries and found the Caffeine library (written in Java) to work best. It provides a clean API and is highly performant. It has great benchmarks compared to other open-source libraries.

Integrating the cache library with SPS

Once we had the caching library set up, the next step was to modify the SPS feature retrieval logic. The idea was to intercept the feature retrieval call and check the cache for feature values before sending out the requests to Redis. We then retrieved the missing values from Redis and populated our cache before returning them to the caller.

Figure 3 : Redis feature retrieval with caching.

While the majority of feature values are unchanged over short periods of time, there are some real-time features that get updated more frequently. We had to implement a cache eviction strategy to maintain prediction integrity. For batch features, we mark the features that get uploaded to Redis by their upload time and poll this information periodically in SPS. When a feature is uploaded, we evict all feature values for that particular feature. This eviction process can be wasteful in some cases; however, we found that it strikes a good tradeoff to maintain integrity.

Safely rolling out the live caching abstraction in a business-critical environment

Since SPS is a business-critical service, we had to pay special attention to making sure introducing complexity through caching does not interfere with regular production requests. We had to be very careful not to increase latency or use incorrect features when making predictions, as an increase in load can lead to severely degraded user experience. Therefore, we thoroughly instrumented our service with metrics and used a dry-run rollout process. Internally, we track various metrics such as availability, latency, and prediction accuracy, which we extended with cache characteristics such as hit rate, cache size, and correctness.

As the first step of the caching experiment, we marked only a certain set of features as cacheable. This allow list allowed us to roll caching out feature by feature and not have to worry about real-time features initially.

We then added a code path parallel to the regular Redis requests that simulates cache population and retrieval and also compares the final result with what the production Redis request would yield. This parallel process was run in an asynchronous thread to not cause a latency overhead in the critical flow of requests. In our production setting, we found that Caffeine greatly outperformed Java’s LinkedHashMap implementation and Guava, consistent with our earlier benchmarks.

During this dry-run process, we observed how the cache size grew as we were receiving requests. The dry run was also helpful in determining what cache hit rate we could expect. Initially the cache hit rate was very low due to various implementation bugs, so our rollout precautions already paid off.

We compared expected and cached feature values over a longer period of time to make sure the feature values returned were as expected. These checks were immensely helpful in identifying gaps in feature value refreshing logic.

When we were confident enough that we could serve cached requests live, we still controlled the percentage of requests sent to the cache and limited the size of the in-memory cache. We followed this process so that we did not overload the production environment.

After these series of tests, we launched the caching live and observed an improvement in latency and reliability. It’s also worth noting that the cache hit rate characteristics closely match what we saw in our simulation.

Figure 4: Cache hit rate for one of our SPS clusters. The sharp drop represents a deployment, while the smaller drops are due to feature refreshing. Steady cache hit rate is at around 78%.

Further improving overall feature retrieval performance in SPS

Implementing the local cache is the first step in a comprehensive caching strategy. To maximize the caching performance and correctness, we plan to implement additional improvements to address some inefficiencies:

Cold start problem: In a production environment, the microservice processes often get restarted. This can be due to a resource allocation or an ordinary service deployment. When the process restarts, it starts with an empty cache, which can severely affect performance. Shadowing production traffic to a newly started process and “warming up” its cache can solve the cold start problem.

Deserialization overhead: We are repeatedly decompressing and deserializing the raw Redis responses in our cache. While the cost is generally small, it can still add up, especially for large objects such as embeddings. We can modify our workflow to instead store deserialized objects in memory:

Figure 5: Optimized caching workflow with deserialized objects stored in memory. The major difference is the elimination of the decompression step and the type of the saved object in the cache.

Cache sharding: Another potential optimization is to improve the cache hit rate by sharding the caches instead of storing data independently on each pod, effectively as replicas. A useful heuristic can be to shard requests by an application-specific ID. The intuition is that similar requests will overwhelmingly reuse the same features. For example, we can shard ETA (estimated time of arrival) prediction requests by the location of the merchant and expect a lot of the features for that merchant to be likely present in the cache associated with the shard of that merchant, as compared to a generic cache. This method, if implemented correctly, can effectively multiply the cache capacity by the number of pods we have.

Real-time features: Real-time features require special attention, since they are not uploaded in one big batch but rather gradually over time. To track updates of real-time features, we can have Riviera write the updated features to both our feature store and a separate Kafka queue. With that in place, SPS can read from that queue and update its cache.

Figure 6: Real-time feature cache refreshing with Kafka queues

Conclusion

At DoorDash, we successfully rolled out a caching layer for our gigascale prediction and feature store infrastructure. Caching methods have been successfully deployed in low-level (such as CPUs) and high-level (network request) environments for a variety of applications. The concept is simple, but implementing it requires domain knowledge and experimentation. We demonstrated that with the right abstractions and benchmarks, it can be gradually and safely rolled out in a large-scale production environment.

Feature stores are commonly used in machine learning operations and many teams are now facing scalability challenges. Caching is an effective way of improving scalability, latency, and reliability while maintaining integrity. By adding caching layers, the ground truth data can be stored at a more cost-effective and scalable level without sacrificing latency.

Acknowledgments

I would like to give special thanks to

  • Brian Seo for adding Fabricator upload status marking,
  • Kunal Shah for driving the development of Riviera and Fabricator,
  • Arbaz Khan for the development of our feature store and guidance on SPS
  • Hien Luu for general advice and support on our feature caching initiative.

Three key steps are of paramount importance to prevent outages in microservice applications, especially those that depend on cloud services: Identify the potential causes for system failure, prepare for them, and test countermeasures before failure occurs.

Because DoorDash’s complex infrastructure and applications are susceptible to breakage, we need to understand which system failures cause outages. This is ideally done by subjecting backend services to intentional simulated failures in a test environment, where the application’s behavior under failure can be observed. Only after completing this process can we design appropriate countermeasures to ensure that future failures do not become full-fledged outages that affect DoorDash’s customer experience.  

In this post, we discuss both nascent and more traditional resilience engineering approaches and their application to large-scale microservice applications. We also introduce a new approach – service-level fault injection testing – that was explored on the DoorDash platform during my recent research internship experience.

Enter Filibuster: An automated resilience testing tool  

As a PhD student at Carnegie Mellon University, I have been working for two years on developing an automated resilience testing tool called Filibuster in the Composable Systems Lab of the university’s Institute of Software Research. Filibuster’s design is based primarily on the study of grey literature – conference presentations, blog posts, white papers, and other material not produced by commercial publishers – to identify resilience bugs that have caused outages in order to better understand how they can be prevented in the future. While Filibuster was designed with practicality in mind, the ultimate goal of the academic research is to be able to adapt and apply research ideas in an industrial setting.  

As a result of DoorDash’s growing interest in improving platform reliability, I joined DoorDash as an intern during the summer of 2021 to test Filibuster’s applicability to the DoorDash platform. My work produced positive preliminary results along those lines, while also affording me an opportunity to extend Filibuster’s core algorithms and to implement support for new programming languages and RPC frameworks.  Access to DoorDash’s real-world industrial microservice application was extremely valuable both to ensure that Filibuster’s design matches how microservice applications are developed today and to influence future features that I would not have identified in a lab setting. Experiences like these are a win-win for both the research community and the host company!

Background: Microservices and their complexity

Backend services at DoorDash are written using a microservice architecture, a style that allows independent teams, focused on one specific component of the business, to deliver features at their own pace and then independently scale those services to meet demand. As is the case at DoorDash, microservice architectures typically are adopted in order to increase developer productivity and improve feature delivery as the business scales up. Microservice architectures typically are not adopted for technical reasons alone; they take a simplistic, easy-to-test monolithic application design and convert it into a harder-to-test, harder-to-reason-about distributed system.

Distributed systems are notorious for their complexity. Once a monolithic design has been broken apart into its constituent services, the handling of a single end-to-end user request from a mobile application may involve tens if not hundreds of different services working together in concert. This decomposition requires that developers now consider a new type of application complexity – partial failure, in which one or more of the services that a single process depends on happens to be in a failed state. More practically, it forces developers of these applications to ask questions like the following: What happens when one of the necessary services goes down or becomes unavailable? What happens if a needed service is slow? What happens if one of the required services deploys with a bug and starts returning errors when called?

Unfortunately, developers must have answers to all of these questions. Cloud providers do not provide 100% uptime guarantees because failures do occur; software engineers are not perfect and therefore they write bugs that sometimes get past manual testing processes. Even if it were possible to have perfect pre-production testing to ensure that bugs never made it into production, DoorDash relies on a number of third-party services where bugs may exist. In short: Outages happen.  

Failure is not something that can be avoided entirely. It is inevitable and must be anticipated and planned for.

Chaos engineering and when it can be used in production

Over the last 10 years, chaos engineering has emerged as the primary discipline to address this reliability problem. It first acknowledges that failure occurs and then subsequently focuses on the organizational and/or technical responses needed when services inevitably fail.  

Chaos engineering was pioneered by Netflix when it moved from physical servers to AWS. The first instantiation of chaos engineering was Chaos Monkey, a tool for automatically and randomly terminating EC2 instances to verify that Netflix could withstand such failures in production. It has since expanded to an entire suite of tools. For example, Chaos Gorilla simulates the failure of an entire availability zone in AWS and Chaos Kong simulates the failure of an entire AWS region. Netflix since has devoted significant time to automating fault injection; entire production clusters now can be set up to run a chaos experiment on a very small percentage of Netflix production traffic and then can be torn down automatically by the company’s CHaP and Monocle fault injection systems. Former Netflix employees recently formed Gremlin, a company for providing chaos-engineering-as-a-service, which already has been used by major e-commerce sites.

Netflix espouses the value of doing this style of chaos engineering directly in production. Some Netflix users might not care if they have to refresh the Netflix homepage when they happen to fall into a chaos engineering experimental group and encounter a bug. But some users do care; Netflix lets them opt out of the experimental group. DoorDash, however, can’t expect a customer trying to place an order to tolerate a random app failure, particularly not during a chaos engineering experiment. Rather than simply refreshing the page as in the Netflix case, that DoorDash user likely would become frustrated and stop using the platform. Additionally, while there are loyalty programs like Dash Pass there is still a low switching cost for consumers, which is why providing a quality customer experience is essential to not losing business to competitors over the long haul. 

As a result, chaos experimentation is impractical for applications and companies like DoorDash.  

What about chaos engineering in staging or during development?

Alternatively, the same style of chaos experiments could be run in a staging or local development environment. This is precisely what Gremlin and developers of local-first chaos engineering tools – for example, Chaos Toolkit – recommend as a starting point. Chaos experiments in staging can be used successfully to test an organization’s response to failure  – for example, on-call response or validation of runbooks. But from a technical perspective, using chaos engineering at this stage highlights some of the main downsides of the whole chaos engineering approach.

First, we are not only concerned with failures that render a service either unavailable or slow. We are also concerned with understanding the entire application’s behavior when a particular service begins returning an error. The new error may be caused by the failure of a dependent service further down the chain, or the error may have occurred because a dependent service was redeployed with a bug. These types of failures typically are harder to generate through traditional chaos engineering approaches that rely solely on exhausting memory of service instances, rendering them unavailable on the network, or simply crashing them. The failures we are concerned with occur at the application level, not the infrastructure level. If a bug is deployed in a downstream service, we wish to harden the implementation of individual services in order to make our overall application more resilient. For a fault injection approach to be maximally useful, it should also consider application-level failures.

Second, without taking a systematic approach to fault injection, it is difficult to guarantee overall application reliability; random experimentation may miss the one particular failure that will result in an outage. Most chaos engineering approaches and tools used today rely on manually specified experiment configurations. This places the onus on the developer to devise and manually specify the possible failure scenarios they want to test – for example, Gremlin or LinkedOut – and neither systematically nor exhaustively search the space of possible failures. To guarantee that a service will act a particular way when a failure occurs, we need to test that service for that failure. Therefore, we believe that any fault injection approach should generate and execute the fault configurations automatically to provide these guarantees of resilience.

Chaos engineering as a discipline originated from concerns about infrastructure reliability and the technical and organizational responses to the inherent lack of reliability. But the reliability of microservice applications extends far beyond the infrastructure level to software defects in the application and ensuring resilience against those defects. While chaos engineering is an incredibly useful technique for identifying infrastructure-level defects and testing the organizational response, it is important to identify the potential application-wide effects of those defects early in the development process by building services in a reliable and defect-free way.

Filibuster: Automating resilience testing of microservice applications

Filibuster is a tool designed to automatically test microservice applications for resilience.  Filibuster operates from the perspective that any low-level system resilience issues such as service failures or timeouts will manifest themselves in the application layer as errors or exceptions alongside any application-level failures when issuing interservice RPCs. To identify a microservice application’s resilience issues, it is therefore only necessary to enumerate these errors, synthesize them in the application, and then verify the application behavior under fault through systematic exploration.

To do this, Filibuster relies on a number of techniques that work together in concert:

  1. First, a static analysis identifies the possible errors visible – either thrown or returned – at each RPC call site within the application. 
  2. Next, Filibuster systematically injects these errors first one by one and then in combinations while repeatedly executing a functional end-to-end test of the application to ensure that it continues to behave desirably until the space of possible failures is exhausted. This approach avoids the overhead of requiring developers to write unit or integration tests manually that contain mocks for all failures and combinations. 
  3. Because developer-written functional tests typically do not contain a test oracle that considers the failure of one or more service dependencies, developers are forced to think about what the system should do when one of the dependent services inevitably fails and an assertion in their functional test fails. 
  4. Once developers know what the desired behavior under fault is, they can encode this behavior directly into the functional test using Filibuster-provided conditional assertions.  For example, the application returns 200 when all services are online; if service X is down, the application will return error code Y.
  5. You can think of this iterative process like resilience coaching. Through the use of Filibuster, developers are made aware of a failure scenario that they did not anticipate and are forced to think about what the application should do before writing an assertion to capture that behavior.

We like to think of this process as resilience-driven development.

In short, Filibuster can identify many of the technical resilience issues that exist in applications earlier, at the time of development, without testing in production as required in more traditional chaos engineering techniques.

Two key Filibuster features 

Because no single design suits all organizations, Filibuster was designed with configuration in mind. Without covering all of the possible Filibuster configurations, we highlight what we believe are two key features for organizations concerned with adopting resilience testing – deterministic replay, a technique that allows developers to write regression tests for outages or previous system failures, and runtime fault selection, a way to use Filibuster incrementally to increase resilience test coverage as code moves through a continuous integration pipeline.

  • Deterministic replay: When developers are debugging a particular failure, they can, for example, replay the failure while using Java’s interactive debugger to walk through a particular service failure step-by-step. When failures do occur while testing with Filibuster, it also outputs a counterexample file to the file system. This counterexample file can be used to write unit, integration, or functional tests to put the system into a failed state and write tests that simulate past outages to build a failure or outage regression suite.
  • Runtime fault selection: Filibuster can be fully configured so that certain faults are tested as part of a local development environment while other faults are tested as part of continuous integration. For example, to keep the local development environment fast, only common faults are tested for each service and a full list of failures is used on each commit to a branch or as part of a nightly release build. All of this can be specified as part of a configuration file supplied to Filibuster to enable test environment flexibility.

Adapting Filibuster to DoorDash

At the start of the internship, Filibuster was designed to test only microservices implemented in Python, using Flask, that communicated strictly over HTTP for RPCs. DoorDash uses Kotlin on the JVM for service implementation and uses both HTTP and GRPC for interservice RPC.  

To adapt Filibuster for use at DoorDash, the following had to be done:

  1. First, Filibuster had to be extended to support Kotlin and GRPC. While extension to GRPC was straightforward, extension to Kotlin and the JVM proved more difficult because of the concurrency primitives that both provided. In fact, this resulted in several modifications to Filibuster’s core algorithms to support RPCs that occurred within concurrent code, which was not a significant concern in Python because of its lack of true concurrency and parallelism.
  2. Second, Filibuster relied on application-level instrumentation of services to support fault injection. But application-level code modification wasn’t a viable option for DoorDash both because of the effort required and the overhead that would be involved to maintain a change set solely for resilience testing. Therefore, I worked with DoorDash engineers to devise a strategy to dynamically add this instrumentation at runtime, without code modification, to support Filibuster’s required fault injection.

To see how Filibuster can be used on a Java application, review this YouTube video, which demonstrates a concrete result of the internship.

DoorDash has only started their resilience journey with Filibuster. At the end of the summer, a prototype of Filibuster was working on two of DoorDash’s services running in the local development environment, demonstrating the approach’s viability for real-world microservice code. DoorDash plans to extend Filibuster’s reach in the coming months to continue improving the resilience of their application.

To learn more about Filibuster, watch this video (or read our paper) overview of the service-level fault injection technique.

To continue learning more about fault injection research, follow Christopher Meiklejohn on Twitter or through his blog.

Do you find this work interesting and exciting? We’re hiring in the area of resilience engineering at DoorDash! Apply here.

An optimized merchant selection strategy has been one of the key factors that has enabled DoorDash to become an industry leader in US food delivery service. DoorDash was founded later than many companies in this industry, but we have effectively onboarded high-value merchants to ensure the selection in every market matches customer demand. We were able to do this quickly by developing machine learning models that identify which merchant characteristics do well on the platform while ensuring every market’s selection addresses customer food preferences and popular local merchants. These models provide intelligence to our merchant sales teams so that they can effectively find and onboard merchants that will delight customers and expand the total addressable market (TAM) for DoorDash by adding increased selection to the platform.   

What a model needs to inform merchant selection 

To identify the potential market, we need to know what value off-platform merchants can bring to DoorDash consumers. To help our sales team recruit merchants that will provide a better selection, we need to rank the off-platform merchants accurately to achieve high efficiency with limited resources. Specifically, we need models that can discern  numerical answers to a number of abstract questions: 

  • The value of a restaurant – What characteristics make a restaurant a valuable addition to our platform?
  • How can we evaluate the addressable market in any given region? 
  • How can we determine what selection is missing and onboard merchants that offer those types of food for our consumers.

To address these areas of market intelligence, our next step is to figure out what kind of Machine Learning models we need in order to build our market intelligence system. Ideally these models will understand what makes a merchant successful and what consumers want on the platform. 

Specifically, we need a system that: 

Allows us to compare merchants fairly at different time periods (based on their time using the platform): To measure merchant performance fairly, we need to collect success metrics in a given timeframe and label them properly so that we can train the model on what successful merchants look like.  

Builds features for model prediction: Once we have our labels, we need to develop the infrastructure to build features that will use the data to predict the above identified success metrics. Because our business is constantly expanding, this feature engineering needs to be scalable, allow for fast iterations, and work in global and local marketplaces. 

Validates model performance: To choose the best model that will be most predictive with our features, we need to set up criteria for the success metrics that will be most valuable to us. This may involve evaluating different tradeoffs to determine which is best for our use cases. 

Selection system design

To kick things off, we start with the overall architecture. As seen in Figure 1, we collect merchant data from various sources and create relevant features on a daily basis in our feature store. 

These features capture various traits such as business type, local merchants versus chain-store merchants. The features are then used to generate the labels and training datasets. After data preprocessing, we train a set of XGBoost models by different merchant segments to predict how successful they will be when they join the platform. Model performance metrics are fed into multiple monitoring dashboards to make sure the models don’t degrade over time. We also generate daily predictions for all off-platform merchants in our system based on their most up to date information. Databricks packages are used for model management and daily prediction. In addition to the model performance dashboard, we also provide dashboards for model interpreting and adhoc debugging.

Figure 1. Merchant selection ML model workflow

Building the feature store

For each merchant, we built features to cover a variety of aspects, including:

  • Consumer intent: Consumers’ demand is a big component in this process. We learn from consumers’ based on their platform behavior and consumer intent features. However, these types of features are typically sparse. This requires us to pick the ML models which handle sparse features well 
  • Merchants information: To capture the differentiated quality of our potential Merchants, we generate a collection of their features such as geolocation, cuisine type, reviews, and business hours.

Training ML models to understand merchant values 

We begin by selecting model labels. To understand the value of off-platform merchants, we use data from existing merchants to discern what makes a merchant successful and valuable to the platform. In the process of label selection, business input and OKR is one of the most important factors that should feed into the ML process. 

The second important consideration is how do we define the true success of merchants. For example, most merchants require time to ramp up their DoorDash operation; new merchants therefore don’t appear to be immediately successful because of such onboarding tasks needed. We need to develop labels calculated in the period of time when new merchants most likely would show their true potential.

As seen in Figure 2, when generating the labels for training datasets we use stable sale activation. We ensure the aggregation window is consistent for all merchants to make them comparable. Some special considerations do have to be made, for merchants who activated and deactivated multiple times.

Figure 2. Time window for label generation

Ensuring model performance while building up the Merchant base

In the process of acquiring more selections to the platform, we suffered from the low sample size problem in the early days, which is one of the top reasons we had poor model performance. There are many ways to augment the training data in the machine learning world. In our case, we chose to include existing merchants who have been on DoorDash for a longer time into the training data to increase the sample size. 

Model training and selection based on different business sizes and DoorDash interaction histories

During the prior two steps, we generated the training dataset with appropriate labels and related features. At this point we need to pick the right model and corresponding loss function. Because the label we chose to predict is a continuous variable, we compared several standard regression models and tree-based regression models, including linear regression models and LightGBM models. 

We not only look at the aggregated model performance when choosing the best model,  but further evaluate the segment performance to ensure sales are able to operate at the lowest level of detail. Some segments examples include: 

  • Are they new to DoorDash? 
  • Are they chain-store merchants?
  • Are they local merchants? 

We found that discrete merchant segments demonstrated significant differences in the relationship between the success label and feature sets, this led to a decision for us to build a separate model for each segment. 

Finally, XGBoost models were picked with tuned hyperparameters.  At this point, we concluded our model training steps, and will move to inference.

Generating valid daily predictions going forward

After training the models, we generate daily predictions for each off-platform merchant. Our business partners rely on these predicted values to understand the latest market trends. For example, some may wish to know the market potential for a new service we want to provide and whether they should be prioritized in future. To ensure such market trends are included in the models, they are typically trained every week or month using our internal infrastructure based on Databricks MLflow. During the prediction time, registered models are pulled automatically from a model store to perform predictions with features processed in Snowflake. Standard ML metrics are logged during training, including R-squared and ranking metrics. 
Business operations not only rely on a pure prediction, but often utilize the reasoning behind a specific estimation. Therefore, we focused on model interpretability using Shapley values. Besides a general importance, one advantage of using Shapley is to provide the directions of each feature’s impact. These are often used by businesses to understand the market level situation. This view is also helpful to monitor the models over time. If there are big changes in important features, we dig into the training data to make sure there are no errors in the model training process.

Figure 3. Top 20 important features in the core model

ML system outputs and their downstream application

The ML models generate two predictions for each off-platform merchant: its rank in the local market and its  success value. These are the two most important metrics for the operations team to prioritize leads for sales and calculate DoorDash’s addressable percentage for each market. For example, Figure 4 shows how we rank leads in one of our markets. To correctly calculate the addressable market, this process sorted “all” local merchants’ values, which might include both off-platform and on-platform merchants.

Figure 4. Example of merchant ranking based on Mx values

Evaluating the business value of models

After we have a set of predictions, we need to set up model validation. This is particularly challenging in our case, as sales in many businesses usually have long closing cycles. To translate an offline model performance to true business impact metrics, we created two types of metrics — weighted decile CPE and decile rank score – to measure quality of the prediction and rankings, which is tailored for our actual sales lead allocations.

Weighted decile CPE is used to track the performance of model predictions. We calculate the percentage error by comparing the predicted sales and the actual sales for merchants with the same decile rank. Business inputs were collected to create additional weights during the calculation. 

Decile rank score is used as a measure of if the predictions are able to rank merchants correctly, which has less requirement on the accuracy of the predicted values. It uses the balance score table to calculate the difference between predicted ranks and actual ranks. The greater the difference, the higher the balance points are. The table in Figure 4 shows the balance points associated with each actual vs predicted decile rank combination. We calculate the weighted average score based on the merchant count in each cell.

Figure 4: Balance scores for actual decile rank vs predicted decile rank difference

These metrics measure the model performance in two ways. The weighted CPE shows how accurate our predictions are, indicating how well we understand our potential market. It’s important for planning and goal setting for appropriate quarterly growth goals. Decile rank score measures how accurately we are ranking our off-platform merchants. As we prioritize which merchants we want to acquire each quarter, we want to pursue those who bring the most value to DoorDash customers. 

Other components offered by merchant selection ML platform

Merchant selection is a complex process with multiple steps. The above model is one of the most important steps. Throughout the process, our merchant selection ML platform also offers models to answer other business questions, such as: 

  • How can we ensure an off-platform merchant is an open restaurant? 
  • Can the merchant fill the demand gap for certain items in specific geographical locations? 
  • What’s the probability that the merchant will come to DoorDash? 

These are all important questions to consider when allocating our sales resources.

Conclusion

Many business-oriented applications need to forecast the value of potential merchants in order to allocate resources more appropriately, including allocating human resources, developing personalized promotion packages, and potential product sales. We have seen great value from investing in such intelligence to improve operational efficiency.

Welcome Liangxiao to DoorDash!

Liangxiao is a powerhouse of knowledge, skills, perseverance, and empathy. Her extensive technical experience coupled with her ability to distill complex problems into simple and straightforward solutions is nothing short of extraordinary. Based in the Seattle office, Liangxiao came to DoorDash after having spent almost 10 years at Meta where she most recently led efforts around Payments. This included enabling ad billing for businesses, creating a system to support creators in monetizing their videos and posts and enabling Donation, a product for people around the world to contribute funds to causes that are important to them. She also launched Facebook Pay to empower everyone around the world to participate in the digital economy. Before leading on Payments, she spent nearly seven years building out the Meta ads product platform. This included building products that helped SMBs to connect with their customers online to support large enterprises in measuring and tracking global campaigns. 

For Liangxiao, DoorDash is the ultimate platform that has the power to connect people and businesses to everything they need locally at the touch of a button. She’s excited to build upon its existing products and lead DoorDash into new verticals in order to continue to transform and propel the future of local commerce. 

To get to know her better, here are six questions with Liangxiao:

  1. Why DoorDash and why now? 

There are three reasons why I chose DoorDash. First, I’m a strong believer in empowering local commerce. There’s so much potential in terms of how best to grow this sector but more importantly, the possibilities are endless when it comes to powering the local economy that supports people and gives back to businesses. Secondly, when I met with our company’s leaders, it was clear from the beginning that they had a strong vision and strategy for where they wanted to see DoorDash grow. This was very appealing to me. It showed me their confidence and enthusiasm in this space and the company. It was inspiring to see the clarity in which they created strategies, execution plans, and how they measured success. Lastly, I gravitate towards a work environment that is built on the value of bias for action- whether you’re building something from scratch or if you’re scaling a product. DoorDash is building so many cool products and I’m excited to be here to help teams build transformational products, have fun along the way, and to help streamline cross-functional alignment so that we can move effectively. 

  1. What excites you most about building and nurturing engineering teams? 

I’ve been lucky enough to have had many opportunities in building and scaling teams. At first, I was very reluctant to become a manager but one of my bosses saw potential in me and put me forth to take over his role as he was exiting the company. I learned from there that I enjoy connecting and nurturing talent. I like helping people find their passion, create a vision for themselves, and supporting them along the journey whatever that may be. I also find myself seeking out opportunities that can help further build diverse teams including being a site leader, leading on diversity and inclusion efforts, or mentoring other managers. 

  1. Can you tell us more about your grassroots D&I efforts and how have you rallied people within those organizations to become champions and allies? 

Growing up as an Asian American, I was taught to listen and follow leaders and authority figures. Furthermore, as a woman, I felt that I could never speak up until I was 100% confident in my opinions. The combination of my upbringing and a lack of confidence really slowed down my early career progression. It was only after I overcame these hurdles that I knew I wanted to help people like me grow in their own journey. 

Previously at Meta, I led the Seattle/Bellevue office’s D&I efforts. We started by creating different working groups to focus on building a diverse environment where people can find various communities that could nurture their talent and needs while working in a corporate environment. We also created trainings to educate people to be allies, learn how to amplify good ally behaviors, and so much more.

  1. What is a motto that you live by? 

“Opportunities don’t happen. You create them” –  I believe that everything is possible as long as you make it a priority and have the courage to go for it. You need to be optimistic about the future in which you are creating. You have to develop the steps on how to get there and of course hold yourself accountable to achieve these goals.

  1. Outside of work, tell us about yourself! 

I have two boys so I spend a lot of time with them. I love to cook and garden. It brings me a lot of peace, joy, and patience! 

  1. What is your favorite thing to order on DoorDash? 

Sushi and organic milk

Taking full advantage of a large and diverse set of machine learning (ML) use cases calls for creating a centralized platform that can support new business initiatives, improve user experiences, enhance operational efficiency, and accelerate overall ML adoption.

For a hypergrowth company like DoorDash, building such a system from the ground up is no small task. As you can see from figure 1 below, in a relatively short amount of time we have been able to quadruple the number of models and 5x the number of predictions that our system is able to handle. Among other things, this huge endeavor requires recruiting a high performing team that can lead a thoughtful and intentional collaboration model with the data science community. In this article, we will share DoorDash’s journey of building a centralized ML platform that leverages the principles of “dream-big-start-small,” “1% better every day” and “customer obsession” to support our ongoing growth, meet the needs of diverse ML use cases, and overcome the challenges of applying ML at scale.

Figure 1 – The growth of models in production and total predictions per week

What is an ML platform? 

At the highest level, an ML platform consists of infrastructure, services, tools, and libraries to support the end-to-end ML development process. That highly iterative process is a scientific endeavor that requires ongoing experimentation over the course of multiple steps, as depicted in Figure 1. The faster data scientists can go through this iterative process, the faster they can come up with solutions to business problems. 

Figure 2 – The machine learning development process involves many steps which ideally are sped up in a high-functioning ML platform

Many aspects of machine learning development are complex and technical. In order for data scientists to move through this iterative process quickly, they need software engineering solutions to abstract the underlying complexity, perform feature engineering, and speed up model development at scale. The ML platform centralizes these abstractions. For example, in the feature engineering step, the platform provides a declarative way of performing feature engineering logic, during which it figures out how to execute the logic, orchestrate the necessary computations, and secure the necessary compute resources. Similar abstractions are provided throughout the ML development lifecycle and are often featured in an ML platform. 

The principles we applied to build and scale our ML platform 

Given the complexity of an ML platform, a principled approach is required to achieve success. At DoorDash, we used three key principles: 

  • Dream big, start small
  • 1% better every day
  • Customer obsession

These principles guided us to clarity in setting a direction and outlining a roadmap, anticipating the needs of our customers, delighting them with well-crafted components of the platform, and incrementally improving the infrastructure based on customer feedback and what we learned along the way. 

The following delves into each of our key principles and illustrates how following these principles has enabled us to support our data science users and scale our ML platform. 

How “dream big, start small” helped us navigate  

To realize our goals, we first established a clear vision of what the completed ML platform would look like. Establishing that big-picture goal gave us a north star by which we could navigate. To develop that dream, we studied industry-leading ML platforms such as Michaelangelo from Uber, Pro-ML from LinkedIn, FBLearner from FB, TFX from Google. With those in mind, we then gathered an understanding of DoorDash’s ML use cases and specific needs. Merging this research, we developed a product vision document that contained the ultimate vision for what we wanted, the north-star metrics to get there, a one-year roadmap, and the strategic bets we would have to place. What we discovered throughout this process was that, while the core capabilities of most ML platforms are quite similar, what tends to set them apart and helps with the successful adoption is a set of strategic bets that they established going in. 

With that in mind, we established the following strategic bets:

  • Focus on platform velocity – We strongly believe in automation via such things as tooling, infrastructure, and to accelerate iteration speed and bring ML models from idea to production faster.
  • Building a machine learning platform-as-a-service – We believe providing a cohesive set of components that work in concert to automate the entire ML pipeline and manage ML artifacts will improve the platform’s user experience and general usability. 
  • Commitment to observability – Model predictive performance can decay with time or show unexpected results. We want our users to be able to know about decay, manage it, and take corrective actions quickly to resolve underlying issues for all models and features they build on the platform.

Focusing on these strategic bets does not imply that the ML platform’s inherent characteristics are not important. Scalability, reliability, usability, and other fundamental factors remain critical to success. Rather, the strategic bets act as guiding lights to help us stay on course throughout our journey toward building an ML platform best-suited to meet DoorDash’s unique and ever-growing needs.

What it means to start small  

After we pursued the “dream big” part of our working principles, we knew we needed to “start small.” Starting small encourages us to make meaningful progress and impact incrementally while remaining strategic about where we should double-down. In a fast-moving company like DoorDash, we don’t have the luxury of time involved in building an ML platform using a master plan with sequential steps. We needed to start creating value for our customers fast. 

Starting small with the Sibyl prediction service 

Rather than opting for either of the most common approaches to creating an ML platform – sequentially or slowly fleshing out a full but barebones system – DoorDash went a different route. We started small with a laser focus on building a single core component called prediction service, which we knew would bring meaningful results for our customers.

The logistics team was one of the first DoorDash teams to heavily utilize ML. Their ML use cases revolve around the order dispatch optimization problem and their prediction service plays an integral part in helping with the dispatch optimization problems.

At the beginning of the COVID-19 pandemic, DoorDash food orders multiplied rapidly. The logistics team’s prediction service needed a facelift to keep up with the increased model prediction volume. We partnered with the team to better understand the scaling challenges, their ML model type, prediction latency, and feature volume. Then we married their needs with our long-term vision for the ML platform: supporting a diverse set of use cases to create our Sibyl prediction service to perform online predictions at high throughput and low latency. Among its notable capabilities are batch predictions, model shadowing, and feature fetching. After Sibyl was up and running, we worked closely with the logistics team to migrate their models onto the new service. That process had its own interesting challenges, which we have previously detailed in this blog post. The migration was completed successfully with the new prediction service able to handle the logistics team’s scalability, throughput, and latency requirements.

While the product vision gives us a path toward building the ML platform, starting small, demonstrating progress, and then doubling down when an idea takes shape leads to meaningful business impact. Our success with onboarding impactful use cases first from the logistics team and then from the search and discovery team proves that the “dream-big-start-small” principle is an effective approach to building large and complex projects such as an ML platform.

1% better is about iteration not perfection

The “1% better every day” principle reminds us that constant and never-ending improvement will lead to sustainable and transformative change. As the ML platform adoption takes on more data science teams and use cases, it is imperative to monitor for needed improvements and address customer pain points and feedback. 

Operating at scale shines a light on inefficiencies

As the number of ML use cases increased, demand on the ML platform escalated to support billions of predictions per day and to store billions of features. The higher the demand, the more inefficiencies made themselves known, including feature store space usage, cost, and manageability.  

To detect surprises and make adjustments as needed, we regularly tracked the ML platform’s progress to ensure it was following its north star goals and that secondary metrics were showing progress. At one point, we noticed the feature volume was increasing at an alarming rate, which translates to additional cost and operational overhead. Once the reason for the increased feature volume was clear, we investigated how features could be stored more efficiently. We objectively assessed different storage solutions and optimization options via benchmarking them. The final optimization we implemented reduced costs three-fold and cut feature fetching latencies by 38%. The details of the benchmark and optimizations are described in detail in “Building a Gigascale ML Feature Store with Redis, Binary Serialization, String Hashing, and Compression.” The experience demonstrated how following the “1% better” principle, rather than striving for elusive perfection, results in constant improvements to our platform as it continues to expand to meet the needs of our customers.

Not all improvements require a technical solution

To us, customer experience is just as important as platform capabilities. As DoorDash grows, we’re bringing on more data scientists every month. Recently, our biannual customer survey revealed a need for a proper onboarding experience for new data scientists so they can be productive during their first three months at DoorDash. Each component of the ML platform had its own onboarding documentation, but they were not tied together to capture the big picture, such as best practices and how various components fit together. So the team leveraged the existing documentation to create more comprehensive onboarding content for the new hires. After the first onboarding workshop, we received positive feedback from the survey about the onboarding process and the data scientists’ level of comfort using the ML platform. Now, not only are new personnel more productive from the start, but our team receives fewer support requests to help get them up to speed.

Recognizing when an improvement is needed requires a clear picture of where things are and the direction they are going. That means continuous tracking of key measures and ongoing incremental investments in making improvements – the embodiment of the “1% better every day” principle.

Customer obsession keeps us ahead of customer needs

The precepts around customer obsession found in a retail environment also apply to meeting the needs of internal customers. By establishing the principle of customer obsession early on, we have been able to stay connected, create a delightful experience, and be one step ahead of our customers’ needs.

As detailed below, customer obsession is accomplished through understanding use cases and success metrics, applying the Golden Rule, and anticipating needs with what we call “french fry moments.”

Understanding customer use cases and their success metrics

Building a successful ML platform requires more than getting the technology right. It also requires meeting evolving customer needs over time. There are a few ways to learn about those needs, but one of the most effective approaches within DoorDash involves developing a one-pager – a report that details a customer project’s use case, its success metrics, and its estimated business impact.  Armed with this information, we can prioritize enhancements through a task stack rank process, keeping a close eye on overall business impact. Knowing what our customers need and why also gives our team perspective on how their work impacts DoorDash overall, motivating everyone to stay focused and ensure on-time delivery.

Applying the Golden Rule to support customers  

Customer support is one of the key ingredients of a successful ML platform, so we support our customers in a way that we would like to be supported. We also commit to providing customer support promptly and with respect and fairness. When a request has been fulfilled, we ensure satisfactory closure.

Customers come to us when they encounter problems while using our platform or when they are unsure of what to do in certain situations. We are mindful about the challenge of striking a balance between unblocking our customers and being overwhelmed with a high volume of support requests. As the platform’s capabilities expand and more customers use it, it is critical to evaluate the support load frequently and make any adjustments needed to address increased support issues. At the weekly team meeting, in addition to discussing the critical support issues, we also discuss customer support volume to better understand where the additional volume comes from.  As the data science team size increases, the support volume around the model deployment goes up. After we invested in automating the model deployment process, the support load for this area went down dramatically.  

To help balance good customer support against our limited bandwidth, we:

  • Incorporate customer support time into the quarterly planning process
  • Conduct weekly reviews of support issues to detect gaps and underlying problems 
  • Continuously update the FAQ wiki page to address repeated questions quickly with minimum effort
  • Organize group customer onboarding sessions to reduce volume of repeat questions 

Focusing on our customers and staying connected to them not only makes them happy, but also motivates our team members to build and deliver impactful solutions.

Delight customers with “french fry moments”  

Google’s phrase “french fry moments” refers to the concept of anticipating needs. The concept was created after an executive saw a scene on the sitcom 30 Rock in which Tracy Jordan’s character becomes outraged after he receives the burger he ordered but not the fries he did not order, prompting him to yell: “Where are the french fries I didn’t order? When will you learn to anticipate me?”

This concept motivates us to go beyond customer feedback and anticipate customer needs. We’ll discuss how to bring about these “french fry moments”with a few examples from our past work. 

During the initial release of the Sibyl prediction service, we noticed an important process was slow and manual. We had provided a way for users to test their models during the migration of existing models to Sibyl; the testing procedure involved creating a Python script to make gRPC calls to test and validate model predictions before deploying those models to production. As more data scientists joined DoorDash, however, we observed that this manual process was not scalable, slowing the ML development process and generating repeated questions about putting together the Python script. Without any prompting from our customers, we automated the model testing process by building a simple web application to enable data scientists to test their ML models easily using their browser and a few mouse clicks. The end result of this preemptive thinking: happy customers, proven productivity improvements, and a reduced support load for us. 

Sometimes french fry moments come simply from knowing what’s best for the customer. Because we have more access to performance data about our systems, we can own expected outcomes. When our systems are not working as intended, we can step in, improve our systems, and deliver a french fry moment without any direct user feedback prompting it. For example, when we first released our feature quality monitoring capability (as outlined in “Maintaining Machine Learning Model Accuracy Through Monitoring”),  we required an onboarding step to take advantage of the feature.  We saw that adoption was limited and became curious about why data scientists didn’t take advantage of it even though they knew this feature would help detect model prediction issues quickly. We discovered that the  onboarding step was actually a friction that hindered adoption of the monitoring tool we had built. So, in the second release of the feature quality monitoring capability, we enabled complete monitoring for all features, eliminating the onboarding step entirely. Our swift action delivered a french fry moment, streamlining processes and delighting customers without requiring that they say a word.

The french fry moment concept encourages us to tap into our creative thinking to delight our customers with solutions that don’t require prompting from them. Sometimes we end up benefiting from those solutions ourselves, creating a win-win scenario for everyone.

Future Work

Now that we have established a good ML platform foundation to build on, we are pursuing ambitious goals as we look toward the future. We plan to advance our platform to provide more value to our customers and to support more challenging use cases to meet expanding business needs.

  • Build feature engineering and model training at scale. Large and complex use cases like search and recommendation and advertisement and promotion require continuous model training with billions of feature values to provide optimal predictive power. Creating and maintaining large feature pipelines and training large ML models require an efficient and scalable distributed computation and model training infrastructure.  
  • Double down on the ML portal. This is the web UI for data scientists to manage their machine learning workflow. As the ML platform capability expands, it is increasingly important to provide an easy-to-use self-service way for data scientists to automate their machine learning workflow as much as possible.
  • Create self-service ML observability. The more models that are onboarded to the ML platform, the more there are at stake. We would like to add advanced ML model monitoring and debugging capabilities so that data scientists can quickly identify and debug model prediction quality issues or quality degradation.
  • Enable model prediction flexibility and scalability. We anticipate there will be more image recognition and NLP-related use cases soon. As such, it is imperative to evolve the current ML model prediction infrastructure to be more scalable, more flexible to support both simple and complex use cases, and more efficient to meet business growth.

Acknowledgements

Without a doubt, building such a vast and complex infrastructure to support the ML development workflow for a diverse set of use cases requires more than just a team; it requires a whole village. First, I would like to acknowledge the following folks for bootstrapping the ML platform: Sudhir Tonse, Alok Gupta, Param Reddy, and Cody Zeng. Second, I would like to thank the infrastructure organization under the leadership of Matan Amir for its strong infrastructure support, as well as thank the data science community under the leadership of Alok Gupta for being a close partner and helping to shape the ML platform development. And lastly, I am so proud of and appreciate the entire ML platform team for all their hard work, dedication, and customer obsession – Arbaz Khan, Swaroop Chitlur, Hebo Yang, Kornel Csernai, Kunal Shah, Dhaval Shah, Nachiket Paranjape, Brian Seo, Ben Lerner, Steve Guo, Bowen Dan, and Siddharth Kodwani.

For DoorDash, being able to predict long-tail events related to delivery times is critical to ensuring consumers’ orders arrive when expected. Long-tail events are rare instances where time predictions are far off from actual times and these events tend to be costly. While we previously improved long-tail predictions (as explained in this previous post) by tweaking our loss function and adding real-time signals, we wanted to improve the predictions further by using a gamma distribution-based inverse sample weighting approach to give long-tail events more weight during model training. In this post, we will go into detail about how we used gamma distribution to improve model performance.

Significance of accurate long-tail predictions

Before we dive deeper, let’s quickly recap why the accuracy of DoorDash’s delivery estimates (ETAs) is so important. ETA (estimated time of arrival) is a promise to the consumer of the time when the order will arrive at their door. Under-prediction results in a really bad ordering experience (a late delivery) and over-prediction (giving a higher estimate) might result in consumers not placing an order or getting a delivery before they get home to receive it. Moreover, delivery times are difficult to predict, particularly because of the wide variability in order preparation times, road traffic conditions, availability of Dashers (delivery drivers), and time to find parking and navigate to residential addresses. The unpredictability of these components leads to costly long-tail events. To improve these estimates, it is important to model the distribution and tweak the loss function to better penalize long-tail events (both early predictions and late predictions). 

Finding the best distribution to model delivery times

In our previous post, we explained that we can tweak the loss function and add real-time signals to our machine learning (ML) model to improve accuracy. To iterate and improve on those tweaks, we wanted to employ the loss function that replicates our delivery time distribution as accurately as possible.

We explored a few distributions before finding the best fit as shown in Figure 1. Let’s actually start with a brief review of the commonly used distributions:

  • Normal distributions are commonly used in the natural and social sciences to represent real-valued random variables whose distributions are unknown, because they are symmetrical, mathematically easy, and fit a wide collection of events. You might know this distribution from the very fundamental central limit theorem
  • Log-normal distributions are used to model phenomena with fatter right tails and which take on only positive values (no left tail). For example, in highly communicable epidemics, such as SARS in 2003, the number of hospitalized cases is a log-normal distribution. In economics, the income of 97%–99% of the population is distributed log-normally.
  • Gamma distributions are used to model phenomena with a long tail on the right side. For example, the sizes of the insurance claims follow a Gamma distribution, so does the amount of rainfall accumulated in a given time period.
Figure 1. Comparison between actual delivery time distribution and commonly seen distributions. 

To quantify the similarity of each distribution with our empirical target distribution, we used the Kolmogorov-Smirnov test (K-S test). The statistics of K-S test results can be used to indicate the similarity between two given distributions. It ranges from zero to one, where zero means totally different and one means exactly the same. Typically, K-S test statistic output <0.05 is used to reject the hypothesis that the two given distributions are the same. From the K-S test result in Table 1, we found both log-normal and gamma almost perfectly fit our empirical distribution. In practice, these two distributions are often used to model the same phenomena. For our use case, we decided to go for the gamma distribution to model delivery estimates. In the following section, we explain the fundamental characteristics of the gamma distribution and why they suit the problem we are trying to solve.

Distribution nameK-S test statistics
Normal0.512
Skew normal (asymmetric)0.784
Log-normal 0.999
Gamma0.999
Table 1. K-S test results for different distributions toward the actual delivery time.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

What is the gamma distribution?

Gamma distributions have been used to model continuous variables that are always positive and have skewed distributions. It is often used to describe the time between independent events that have consistent average time intervals between them, such as rainfalls, insurance claims, wait time, etc. In oncology, the age distribution of cancer incidence often follows the gamma distribution, whereas the shape and rate parameters predict, respectively, the number of driver events and the time interval between them. In transportation and service industries, the gamma distribution is typically used to estimate the wait time and service time.

The gamma distribution function has two parameters: a shape parameter and a rate parameter (Figure 2). The shape parameter α represents the number of independent events we are modeling. When the shape parameter (α) is equal to one, the gamma distribution becomes an exponential distribution. Thus, the gamma distribution is essentially the summation of several exponential distributions. In our case, the shape parameter (α)is equal to four gives the best fit to the actual delivery time distribution curve. Since the delivery time is determined by numerous factors, α value does not necessarily indicate there are four independent events. It may imply the underlying events that drive the shape of the gamma distribution are equivalent to four independent events. When keeping everything else constant, increasing causes the peak of the probability density function (PDF) curve to move to the right side. 

The rate parameter β represents the average time between these events. If we keep everything else the same, reducing the rate parameter (which means increasing the scale parameter) will cause it to take longer to observe the same number of events, thus, a flatter PDF curve.

Figure 2. Gamma distribution PDF curves with various parameters. 

The PDF for gamma distribution is given by equation (1).

Now that we understand the gamma distribution, the question arises about how we can use the gamma distribution to model sample weights in our training process.

Using gamma distribution to make better predictions

To make better predictions on the low-frequency long-tail events, we want to allow the model to learn more information from these events. This can be achieved by assigning more weight to these events during the model training process.  We use a weight function to assign the weight to different data points. These were the steps we followed:

  1. First, we find the best-fit gamma distribution for the delivery duration data. In our case, we found α=4 and β =1 to have the best fit.
  2. Next, we set the weight function inversely proportional to the PDF of the gamma distribution to allow the model to learn from the rare long-tail events. 
  3. We split the training data samples into two groups at the peak density (the Mode). The group on the left side of the peak density has a shorter delivery time, and the right side group has a longer delivery time. 
  4. The sample weights of the data points are tuned differently for samples on the left and right sides of the peak according to business constraints. This helps tune for earliness as well as lateness. 

Through this process, we reduced 20-minute lateness by 11% (relative). Particularly, the super long-tail parts were better predicted (Figure 3). The gamma weight trained model doubles the number of long-tail predictions than the Asymmetric Mean Squared Error (AMSE) method. On top of the great results we saw, certain characteristics of gamma distribution enabled us to effectively tune our loss function to incorporate desired business outcomes.

Figure 3. Comparison of AMSE and gamma weight PDF on the actual delivery time distribution. 

Advantages of customizing gamma distribution

To achieve better business results, the model needs to make accurate predictions on the right long-tail events and make sure we are not over-predicting faster deliveries. The nature of gamma distributions helped us to accomplish both goals, allowing for wider distribution of prediction results and better coverage on earliness and long tail:

  • Earliness and lateness were better balanced by adjusting the weight according to business impact. 
  • AMSE distribution does help capture earliness and lateness nuances, but clearly, gamma distribution gives a much longer-tail coverage as you can see in the plot.
  • That said, there are some shortcomings of this methodology:
    • We need to tune an extra set of weights (left/right) during retraining each time.
    • In scenarios when you have continuous retraining through automated pipelines, keeping a constant set of left and right weights can cause degradations in model performance if these weights are not regularly fine-tuned as per the data.

Conclusion

We improved the prediction performance on long-tail events by tweaking the loss function with a customized sample weighting approach based on the gamma distribution. The AMSE is an asymmetric loss function designed to model a right-skewed distribution. It captures the long-tail events that follow a skewed quadratic distribution. However, the gamma-based weight function distribution is more similar to the actual distribution of delivery durations and has more advantages compared to AMSE:

  • Compared to the AMSE loss function, it decays slower on the long-tail part.
  • Separating the weight function at the mode of the distribution gives us more flexibility to tune both earliness and lateness. 

We have not stopped here, and we will continue exploring new options to improve the accuracy of our estimates. If you are passionate about building ML applications that impact the lives of millions of merchants, Dashers, and customers in a positive way, consider joining our team.

At DoorDash, our Parents@ Employee Resource Group (ERG) was designed to connect and support working parents and caregivers of every background, in every phase of their caregiving journey. Many of us personally know the immense benefit of having a community where we can share experiences, support, and feedback. To create this inclusive and desirable environment, our Parents@ ERG hosts events and connection opportunities throughout the year, including its unique twist on the annual Bring Your Kids to Work Day.

As many of us continue to work from home, DoorDash and our Parents@ ERG invited children to intentionally join their parents’ zoom calls with a slew of events on Bring Your Kids to Zoom Day – a day designed to give kids an inside look into Life@ DoorDash.

We kicked things off with DoorDash leaders and parents coming together to discuss the experience of being a working parent at DoorDash and how they find support through the Parents@ ERG. Next, author Suzanne Lang joined us to read her book, “Grumpy Monkey”, followed by a coffee break with parents of young children who shared stories, swapped tips, and bonded over their experiences. Long-time Parents@ partner, Superkin, joined in on the festivities to present to the team on the hot topic of burnout and how to recognize causes, symptoms, and how to avoid and recover. Finally, to close out the afternoon, we hosted an interactive scavenger hunt, where families engaged by searching for items commonly found in their homes.

In addition to supporting parents and caregivers through networking and events, DoorDash is proud to offer a variety of family benefits in the United States, including the following:

  • 16 weeks of paid bonding leave for birthing and non-birthing new parents, inclusive of adoption and fostering
  • Access to childcare referral network and discount for care
  • Flexible vacation days for salaried employees

To learn more about how we’re supporting parents of all backgrounds and to learn more about career opportunities at DoorDash, click here.

To safeguard our users’ accounts and prevent fraud, we sometimes ask users to verify their identity or confirm a transaction by completing a “user friction” such as two-factor authentication. User frictions, or verification steps designed to prevent fraud, are essential in combating fraudulent activity but are not always easy to implement quickly. At DoorDash, we sometimes find our suite of applications being abused and attacked due to unforeseen loopholes. In such scenarios, introducing a friction feature to those workflows is effective in preventing further damage. 

Our initial implementation of frictions focused only on the DoorDash consumer application. While that was effective in the short run, we soon realized the use cases expanded beyond our consumer application and we needed to design the frictions with modularity and scalability at their core. To implement a more long-term solution, we opted to build a common library for all our frictions. This helps us to easily integrate to different workflows and applications as we grow.

Why are frictions important and why a library for frictions?

User frictions are similar to checkpoints on a road and are assigned to users that our risk engine deems risky and effectively prevent fraud in applications before it even happens. Most frictions, such as multi-factor authentication (MFA) or 3-D Secure (3DS), collect and verify user details to ensure authenticity. The information we collect might be known or additional. 

For example, we have a credit card re-entry friction that requires users to re-enter their saved card information at checkout when our risk engine detects a potentially fraudulent transaction. If there is an unauthorized account takeover, the fraudster (who does not have access to the original credit card) will not be able to proceed with a transaction. 

As DoorDash rapidly grows, it has become increasingly hard for us to implement and integrate new frictions quickly enough in response to fraud attacks. We determined we needed a common library for all our frictions so that different applications could be integrated quickly. This approach lets us to:

  • Introduce friction to different workflows and applications
  • Increase ownership of the logic to render the frictions
  • Avoid the risk of bugs or disparity with multiple implementations
  • Have a common point of failure to debug and triage issues
  • Prevent DRY (don’t repeat yourself) code

Creating a library for risk frictions

We created our web library component-risk with interoperability and modularity in mind, to reduce the time needed to integrate risk frictions in any workflow. Some example of our frictions are:

These frictions are used across different workflows, including login, checkout, and edit profile, in different DoorDash and Caviar applications–consumer, Dasher (our name for delivery driver), and merchant. Some of the workflows, such as login, are integrated web views by then also serving mobile workflows.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Architecture of component-risk

Component-risk is a TypeScript library using React.js. As depicted in Figure 1, most of DoorDash’s web React applications have a BFF (backend for frontend) service that is used to communicate with our backend microservices. The applications consume the library component-risk and the library can be configured to leverage each application’s BFF for frictions. For all applications without a BFF we use old monolith DSJ or our in-house service called risk-bff.

 The tech stack for component-risk is:

  • React.js
  • TypeScript
  • Styled components
  • Apollo GraphQL
Figure 1: DoorDash’s web application architecture using the reusable web library component-risk

We use React Contexts for state management. Opting for an in-built state management solution like React Contexts allows us to keep the code simple and limit the bundle size. At its core, component-risk uses multiple contexts as described in Figure 2 to maintain separation of concern and manages state across components that serve different frictions,

  • Risk context: Externally exposed context that manages the state of component-risk at the application level. This is where states such as apollo client, URL proxy are maintained.
  • Layout context: Internal context that manages the design elements of the frictions. This includes setting up the CTA behavior for the modal, controlling the dismissible nature of the modal, and so on.
  • Verify account context: Internal context for frictions like MFA and phone/Email verify.
  • Verify payment card context: Internal context for some of our card challenge frictions.
  • ThreeDSecure context: Internal context for maintaining the state variables for the 3DS friction.
Figure 2: Different contexts and their composition in component-risk

Roadblocks and considerations with building the library

Even though building a library seemed to be the solution for quickly implementing and rolling out frictions, it had its share of challenges with design and implementation. Most of these challenges were around supporting tech stack customizations by different teams and addressing and embracing the disparities that arise due to such customizations.

Supporting a fragmented backend

The web packages in DoorDash are maintained by every team individually and the team has the liberty to design them as they see fit. In addition to each team adopting different approaches, we also had a huge migration effort that saw our backend shift from a Django monolith to Kotlin-based microservices. All these together cause a disparity in the way each application communicates with the backend and was one of the primary roadblocks for our library. We needed to support the following:

  • Applications using GraphQL-based BFF
  • Applications that use REST-based services (includes our Doorstep Django – DSJ monolith)
  • Applications that do not have a BFF and need to communicate with gRPC-based services

We had to use the same frontend code with the flexibility to interact with our varied backend infrastructure. After several considerations, we solved this problem by using the REST link from Apollo GraphQL. REST link allows us to call endpoints inside GQL queries to both GraphQL and a RESTful service and have all the data managed by Apollo Client. This is done by configuring Apollo to support multiple links such as:

new ApolloClient({
  link: ApolloLink.from([
    onError(({ networkError }) => {
      if (networkError) {
        console.log(`[Network error]: ${networkError}`)
      }
    }),
    new RestLink({
      headers,
      uri: getBaseUrl(proxyOverride),
      credentials: 'include',
      fieldNameNormalizer: (key: string) => camelCase(key),
      fieldNameDenormalizer: (key: string) => snakeCase(key),
    }),
    new HttpLink({
      headers,
      uri: getBffUrl(riskBffConfig),
      credentials: 'include',
    }),
  ]),

And then when constructing the GQL, we specify whether it needs to be the one for GraphQL-based service or the one for the RESTful service. This is done using the @rest directive.

1/ Mutation used by HttpLink:

gql`
  mutation verifyPasscodeBFF($code: String!, $action: String!) {
    verifyPasscode(code: $code, action: $action) {
      id
    }
  }
`
2/ Mutation used by RestLink: 

gql`
  mutation verifyPasscodeRest($code: String!, $action: String!) {
    verifyPasscode(input: { code: $code, action: $action })
      @rest(method: "POST", path: "/v1/mfa/verify", type: "VerifyPasscode") {
      id
    }
  }
`

Landing on a common design paradigm

Being a library of multiple frictions, we had use cases for supporting interchangeable frictions in a single workflow. For example, we have MFA, card re-entry, second card challenge, and 3DS all at checkout, and any user might get one or a combination of these based on their risk evaluations. 

To achieve this, we had to decide upon a common design paradigm for all our frictions, which allows us to reuse a lot of the boilerplate code and just change the context of each one. We went with the approach of using a modal for all the frictions. This allows us to:

  • Retain the state of the workflow since the friction is only a hard overlay
  • Not take the user away from what they’re working on
  • Keep the friction quick enough so the user experience isn’t hampered too much
Figure 3: Some of our challenges rendered using a modal

Adapting to different themes

We have different theming requirements for different applications such as consumer, merchant, etc. We wanted our frictions to adopt the theming as can be seen in Figure 4, based on the application that’s rendering it. We achieved this by making all DLS (Design Language System) dependencies to be peer dependencies so that we could use the same theming context across the applications and the library.

Figure 4: MFA friction with different themes for our Caviar and merchant applications

Minimizing bundle size

One of the technical challenges with our library was to ensure that we kept our bundle sizes as small as possible. Since libraries do not have webpack bundling code splitting and tree-shaking wasn’t really an option, we opted to use as much native JavaScript as possible and to use external libraries that are commonly being used in DoorDash applications. For example, we used the same analytics.js loader and created a separate i18n instance from the application. This was achieved by defining the respective dependencies as a peer dependency.

Enabling local development

The goal of component-risk is to allow seamless integration into our applications and improve developer velocity. When developing locally, it is important for component-risk to follow the same environment as that of the application. It would be hard to develop with the application hitting production and component-risk relying on staging. We got over this by leveraging the application’s webpack proxies and allowing the application to configure the library’s URL based on the proxies.

The future: Looking beyond the web

As part of the fraud team, we wanted to develop solutions that transcend our current infrastructure and make sure risk mitigation is baked into every life cycle of software development. Having a common library such as component-risk helps us achieve it since ease of integration prevents risk mitigation from being an afterthought. 

We have had some of DoorDash’s most critical workflows such as login, checkout protected using our library. So far we have almost 15 workflows across consumer, logistics, and merchant applications that use component-risk in production for almost two years with no major breakdowns. 

After creation of our library we have seen our development time reduce from months to weeks for most major initiatives. As an example, we were able to protect our identity workflows such as login, signup, and forgot password with MFA friction in a few months’ worth of effort. However, we still have some scope for improvements for our library and we are ambitious to extend its use cases. Here’s a list of compiled items for our library’s North Star:

  • Consolidate all our backend code in a single BFF, which will let us own and develop component-risk as an end-to-end solution.
  • Integrate with our identity service to authenticate backend calls without having the respective application’s BFFs as a middle man.
  • Split up each category of frictions into its own sub library to improve bundle size and also let our sub teams own the respective sub library.
  • Most frictions are extremely simple in their user experience and don’t necessarily require native solutions for mobile platforms. We see a huge opportunity in creating hybrid frictions that can also be used as webviews for iOS and Android platforms.

If you are interested in building solutions like these and in fighting fraud with us, consider joining our team!

Large e-commerce companies often face the challenge of displaying enticing product images while also ensuring fast loading speeds on high-traffic pages of their website. At DoorDash, we confronted this issue because our home page — the primary vehicle for our online marketplace’s success — was plagued with slow download speeds that were hurting the user experience and our SEO page rankings. 

Our solution was to implement server-side rendering for high-traffic pages, which involved overcoming numerous challenges. In this article, we enumerate those challenges and how we approached them in order to build successful server-side rendering and prove it was effective. 

Contents

Why server-side rendering (SSR) improves a site 

The DoorDash app was running on a client-side system prone to loading issues, poor SEO, and other issues. By moving to server-side rendering, we hoped that we could upgrade a number of key elements, including: 

  • Enhancing the user experience: We wanted to improve the user experience by shortening page-load times. This aligns with the recent introduction of Google’s web metrics that favor fast, lightweight pages on modest mobile devices. These metrics have significant influence on the page rank assigned by Google.
  • Enabling Bundle Size Optimization: Our existing client-side rendered single-page app (CSR, SPA) was becoming difficult to optimize because the size of the JavaScript and other resource bundles had become bloated.
  • Improving SEO: We set out to deliver optimal SEO metadata using server-side rendered content. Whenever possible, it is better to deliver fully formed web content to search engines rather than waiting for client-side JavaScript to render the content. One approach: Move API calls from the client browser (north-south) to the server-side (east-west), where performance typically is better than on a user’s device.

Common pitfalls to avoid when moving to SSR

We wanted to be careful to avoid common issues with SSR as we worked to achieve these benefits. Rendering too much content on the server can be costly and require a large number of server pods to manage traffic. Our high-level goal was to use the server only to render above-the-fold content or content required for SEO purposes. This required ensuring the state between server and client components was an exact match. Discrepancies between client and server will result in unnecessary re-renders on the client side.

Measuring performance to ensure meeting success metrics

We used webpagetest.org both to measure the performance of the pre-SSR pages and also to confirm the performance gains on the new SSR pages. This excellent tool allows page measurements across a variety of devices and network conditions while providing extremely detailed information about the multitude of activities that occur when a large and/or complex page loads. 

The most reliable way to get performance information is to test against real devices with realistic network speeds at geographical distance from your servers. Case in point: Website performance on a MacBook Pro is not a reliable predictor of real-world performance.

More recently, we added Google web vitals tracking (LCP, CLS, FID) to our observability dashboards to ensure that we are capturing and monitoring page performance across the entire spectrum of visitors and devices.

Customizing Next.js for DoorDash

Many engineers at DoorDash are huge fans of the Next.js team and Vercel. Vercel’s infrastructure was built for Next.js, providing both an amazing developer experience and a hosting infrastructure that make working with Next.js easy and maximally optimized.


In fact, we used Vercel to build out our initial SSR proof-of-concept that we then used for pitching stakeholders.

At DoorDash, however, we needed a little more flexibility and customization than Vercel could offer when it comes to how we deploy, build, and host our apps. We opted instead for the custom-server approach to serving pages via Next.js because it provided us more flexibility in how we hosted our app within our existing Kubernetes infrastructure.

Figure 1: This is a simplified view of how we used a Reverse Proxy directed its traffic in the A/B experiment for Server-side Rendering.

Our custom server is built with Express.js and leverages our in-house JavaScript server toolkit, which provides out-of-the-box functionality like logging and metrics collection.

At our Ingress layer, we configured a reverse proxy that directs requests using our in-house experimentation framework. This configuration allows us to use a percentage-based rollout for consumers in treatment. If consumers are not bucketed in treatment, we route their request to our pre-existing single-page application. This proxy setup gives us flexibility over the conditions under which we direct the traffic.

This proxy also handles additional scaling concerns such as logging, circuit breaking, and timeouts, which are discussed below.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Implementing SSR without downtime 

Because DoorDash is growing rapidly and has many developers working at all times, we cannot afford any downtime, which would disrupt the customer experience and our developers working on other parts of the platform.

In other words, we needed to change the wheels of the car while we were barreling down the freeway at 65 miles per hour.  

We carefully considered how our migration efforts to Next.js would affect customers, engineers, and the DoorDash business as whole. That meant resolving several key issues before we could proceed.

How to ensure new technology adoption without stopping or slowing down new feature development

It wasn’t feasible for us to halt new feature development — code freeze — while migrating our stack to Next.js because DoorDash is like a rocketship in terms of growth (see Figure 2).

Figure 2: This Graph shows DoorDash’s cumulative lifetime Orders delivered. DoorDash hit 1 billion cumulative orders on October 25, 2020, and doubled that lifetime orders to 2 Billion in less than one year later.

If we forced engineers to develop their new features only on the new Next.js stack, we risked blocking their rollouts and product launches; customers wouldn’t get to try new experience-enhancing features and our business wouldn’t be able to iterate as fast on new ideas.

Consequently, if we required engineers to develop new features in both the old codebase and the new Next.js codebase, we would be burdening them with maintaining features in two separate runtime environments. Not only that, but our new Next.js application itself is in a rapid development state, which would require developers to relearn  many significant changes throughout the development lifecycle.

Minimizing overhead and maintenance costs while both versions of a site are actively serving traffic

That left us to figure out how to maintain new features in both environments without requiring engineers to contribute solely to the new codebase. We wanted to ensure they were not slowed down or blocked by the migration to Next.js.

Having stacks live in a separate codebase was not ideal because we didn’t want to maintain a fork in a separate codebase, which would increase operational overhead and context-switching for engineers. Remember: DoorDash is growing fast, with new contributors from different teams and organizations hitting the ground running; any technical decision or constraint that affects how engineers operate has massive implications.

We therefore had both apps live together in the same codebase, maximizing code reuse wherever possible. To minimize overhead for developers, we enabled features build-out without concern for how they would eventually integrate with Next.js and SSR paradigms. Through code review and enforced linting rules, we ensured that new features were being written to be SSR-compatible. That process ensured that changes would integrate well with SSR regardless of changes made within the Next.js app.

When unifying the proof-of-concept Next.js codebase with our old application codebase, we needed to take care of some build configurations so that components written for one app were interoperable with the other app.

Some of this unification work involved building tooling changes, including updating our project’s Typescript configuration to support isolatedModules, updating our Webpack’s Babel configuration, and updating our Jest configurations so that code written for Next.js and our existing app were written similarly.

All that was left at this stage was to migrate our app from CSR to SSR.

Incrementally adopting SSR on an existing page without re-writing every feature

We wanted to learn quickly and see big performance wins for our customers without going through a multi-quarter effort to migrate a large app with dozens of pages. Migrating an entire multi-page app to Next.js would have been a massive effort that was out of scope for what we were trying to accomplish. 

We therefore opted for a page-by-page incremental adoption approach in which we migrated one page to Next.js at a time.

We adopted a “trunk-branch-leaf” strategy, which involves focusing optimization efforts on components close to the top of the page or close to the top of the hierarchy of components on the page. For example, we completely rewrote the hero image component at the top of the homepage because it was above the fold and almost at the top of the component hierarchy of the page. Components lower down the page or lower in the hierarchy were left untouched. If these components contained references to objects not available on the server — such as window or document — then we either opted to lazy-load them on the client or simply performed a light refactor to remove the client-side dependency from them.

To permit the symmetrical usage of components in SSR, either server side or client side, and the CSR SPA app, we introduced a context provider called AppContext. This provider gives access to common objects such as query string parameters, cookies, and page URL in a way that works transparently in any context. On the server, for example, cookies are available by parsing them from the request object, while on the client these are available by parsing the document.cookie string. By wrapping both the new SSR app and the existing CSR SPA app in this provider, we could enable components to work in either.

Abstracting implementation details and conditional behavior using app context

There are some differences between our old app and our new app that are critical:

  • Routing: React Router (SPA) vs Next.js-based routing
  • Reading Cookies: Reading directly from document vs. no document available during SSR
  • Tracking: Not firing tracking events during SSR vs. client-side

With the bridge pattern, we can decouple the implementation from the abstraction and change behavior at runtime based on the environment within which we’re running the app.

Here are some examples of how this can be done with some simplified pseudo code. We can create an app context which stores some metadata about our app and experience:

const AppContext = React.createContext<null | { isSSR: boolean }>(null)
 
const useAppContext = () => {
 const ctx = React.useContext(AppContext)
 if (ctx === null) throw Error('Context must be initialized before use')
 
 return ctx
}

Then our core dependencies can read from this global app-state to behave conditionally or swap out dependencies depending on the need as follows:

const useTracking = () => {
 const { isSSR } = useAppContext()
 return {
   track(eventName: string) {
     // do a no-op while server-side rendering
     if (isSSR && typeof window === 'undefined') return
     // else do something that depends on `window` existing
     window.analytics.track(eventName, {})
   },
 }
}
import { Link as ReactRouterLink } from 'react-router-dom'
import NextLink from 'next/link'
// Abstracting away React-Router leads to more flexibility with routing
// during migration:
const WrappedLink: React.FC<{ to: string }> = ({ to, children }) => {
 const { isSSR } = useAppContext()
 if (!isSSR) {
   return <ReactRouterLink to={to}>{children}</ReactRouterLink>
 }
 return <NextLink href={to}>{children}</NextLink>
}

Each application get instantiated with this global state:

const MyOldCSRApp = () => (
 <AppContext.Provider value={{ isSSR: false }}>
   <MySharedComponent />
 </AppContext.Provider>
)
const MyNewSSRApp = () => (
 <AppContext.Provider value={{ isSSR: true }}>
   <MySharedComponent />
 </AppContext.Provider>
 )

Meanwhile, shared components remain blissfully unaware of their environment or the dependencies working under the hood:

const MySharedComponent = () => {
 const { track } = useTracking()
 return (
   <div>
     <p>Hello world</p>
     <WrappedLink to="/home">Click me to navigate</WrappedLink>
     <button onClick={() => track('myEvent')}>Track event</button>
   </div>
 )
}

Scaling and service reliability

We needed to ensure that our new application would work reliably without hiccups. To do that, we needed to understand our current implementation better and prepare the system to withstand any potential issues we might encounter as we increased traffic to our service. We accomplished a reliable rollout using the following approaches:

Measuring and benchmarking

Before rolling out our new service to production, we needed to know how much traffic it could support and what resources it required. We used tools like Vegeta to audit the current capacity of a single pod. After an initial audit we saw that not all cores were being utilized to spread the processing load. As a result we used Node.js’s cluster API to make use of all the pod’s cores, which quadrupled the pod’s request capacity. 

Falling back safely to mitigate service degradation

Because this service was new and not promoted to production yet, we realized that there were rollout risks that likely would need to be mitigated. We decided to ensure that, if the new service was failing requests or timing out, we could smoothly fall back to the old experience.

As mentioned earlier, we configured a proxy to handle routing traffic and bucketing users. To solve our concerns around a reliable rollout, we configured the proxy to send the request back to our old app experience if the new service’s request was failing.

Shedding load and circuit breaking

To prevent system overload or having customers experience a degraded app experience, we needed to have mechanisms like circuit breaking in place. These ensure that we can handle requests that start to fail because of runtime issues or because requests start to queue, degrading performance.

A limited timeout circuit breaker allows us to detect overloading on SSR servers and short-circuit — load shed — to fall back to CSR.

Our proxy was configured with a circuit breakeropossum — to load shed if requests were taking too long to complete or failing.

Figure 3 diagrams such a circuit breaker. According to the diagram’s famous author, Martin Fowler:

“The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually you’ll also want some kind of monitor alert if the circuit breaker trips.”

Figure 3 [Credit: https://martinfowler.com/bliki/CircuitBreaker.html]
A diagram showing the various states of a circuit breaker.

Dashboards, monitoring RUM, and operational readiness

To ensure our service was behaving as expected, it was vital for us to build in full observability and monitoring of the system’s health.

We instrumented metrics and counters for things like request rates, failures, latencies, and circuit breaker status. We also configured alerting so that we would be notified immediately of any issues. Lastly, we documented operating playbooks so we could onboard on-call engineers to the service seamlessly to handle any alerts.

Dealing with gotchas

While our approach was good, it was by no means perfect, which meant that we had to deal with a few gotchas along the way. These included: 

Gotcha #1: Analytics and tracking success metrics

Even though we didn’t implement a full rewrite, partially rewriting components and loading pages with a new stack led to unexpected analytics readings.This gotcha isn’t just specific to Next.js or SSR, but any major migration involving some form of rewrite. It’s critical to ensure product metrics are collected similarly for both the old and new product.

Gotcha #2: Next.js aggressively bundle splitting and preloading

We used client-side rendering within Next.js as a way to improve the server-side rendering performance, lazy load unneeded client-side features, and adopt components that were not ready to be rendered server-side. However, when lazy loading these bundles, which use the preload tag, we saw an increase in interactivity delays. The Next.js team already was aware of this potential performance issue because they are working on addressing more granular control over preloading of JavaScript bundles in a future release of Next.js.

Gotcha #3: Excessive DOM Size when Server Rendering

Web Performance best practices advocate maintaining a small DOM size of fewer than 1,500 elements, and a DOM tree depth less than 32 elements with fewer than 60 children/parent elements. On the server-side, this penalty can sometimes be felt even more so than on the browser as the Time to First Byte is delayed by the additional CPU processing required to fulfill the rendering of the request. In turn, the user can be waiting longer than desired looking at a blank screen while the page is loading, offsetting the expected performance gains that Serverside Rendering can provide. We refactored certain components and deferred loading of certain components to be lazy loaded to reduce the server side rendering overhead and improve performance.

Results 

Migrating our pages to Next.js achieved +12% and +15% page load time improvements on Home and Store. LCP (one of Google’s core speed metrics) has improved 65% on Home and 67% on store pages. Leading indicator of Poor URLs (LCP > 4s) on Google has dropped by 95%.

Conclusion / Summary

To any engineers looking to migrate their stack to Next.js, we’d like to summarize our main takeaways from introducing Next.js to DoorDash:

  1. Performance: Adopting Next.js can lead to huge improvements in mobile web performance. Definitely use tools like https://www.webpagetest.org/ to audit existing performance before and after the rollout.
  2. Incremental Migration: To any team considering migrating their app to Next.js, we want to emphasize that an incremental approach can minimize full re-writes while allowing features to co-exist in both an old CSR and in new Next.js SSR apps.
  3. Rollout Strategy: We want to stress the importance of having a defined rollout strategy, and safe-guard fallback mechanisms to protect against site outages.
  4. Success Metrics: Lastly, the importance of having clearly defined success metrics, and ensuring the right tracking is in place to confirm the migration was successful.