The DoorDash retail shopping experience mission seeks to combine the best parts of in-person shopping with the power of personalization. While shopping in a physical store has its advantages, a brick-and-mortar store cannot be personalized – the onus is on the consumer to navigate aisles to find what they need. Conversely, a digital shopping experience can be highly personalized. By understanding each consumer’s purchasing history, dietary restrictions, favorite brands, and other personalized details, we not only can recommend items that reflect a consumer’s unique shopping needs and preferences, but we can also streamline cart-building. Personalization goes beyond simply curating options for items already on a shopper’s list; it also brings a sense of serendipity by unveiling potential new favorites that consumers may not have considered before. Using the power of personalization to craft a delightful retail shopping journey fosters consumer retention by instilling trust that DoorDash truly understands a shopper’s needs and preferences.

In this post, we show how we built a personalized shopping experience for our new business vertical stores, which include grocery, convenience, pets, and alcohol, among many others. Following a high-level overview of our recommendation framework, we home in on the modeling details, the challenges we have encountered along the way, and how we addressed those challenges.

The challenges of building a recommendation model

Building recommendation models for our retail stores is a challenging task that requires a deep understanding of inventory, customer preferences, and shopping context. Unlike our restaurant business, where a typical merchant sells only a few dozen or at most hundreds of dishes or beverages, our new vertical business stores often carry hundreds of thousands of SKUs in thousands of categories. The inventory size and category variety requires our recommendation systems to sift efficiently through a tsunami of choices to recommend relevant options to consumers. Moreover, grocery and retail shoppers tend to have more varied shopping habits and demands than restaurant consumers; while some customers prefer to reorder the same items every week, others may want to explore new products or purchase seasonal items such as Halloween costumes. Additionally, recommendation systems must adapt quickly to dynamic customer preferences that can change significantly depending on the shopping context, such as promotional sales, special events, or even the time of day.

Overall framework

As shown in Figure 1, upon landing on the homepage of a DoorDash retail store, consumers see a variety of themed collections — for example, “Organic Goods” and “Popular Deals” — displayed from top to bottom, each showcasing a selection of items arranged from left to right. If an item is of immediate interest, a consumer can click on the “+” button to add it to their cart. Those seeking more information can click on the item image to view further product details before deciding whether to add the item to the cart. If the initial collections don’t appeal to a shopper, they can scroll down vertically to view additional collections. Similarly, if the collection theme is compelling but the visible items are not of interest, consumers can swipe horizontally to see more items to the right.

*Figure 1: Example of themed collections on the homepage of a DoorDash retail store*

Before the introduction of ML models, our operations team had to manually curate collections and determine both their vertical positions and the horizontal positions of items within each collection. As DoorDash’s vertical businesses grow, drawing more consumers to these pages, manual retrieval and ranking is no longer tenable, particularly because consumers’ personal needs cannot be taken into consideration. Instead, we built a new framework, as shown in Figure 2, to personalize recommendations for shoppers.

*Figure 2: Overall framework to generate personalized recommendations for retail store homepages.*

This framework consists of the six components below:

I. Collection generation:

Our collections fall under three main categories, depending on how they are generated.

Operator-generated collections: Manually curated by operators and usually contain popular items from a merchant, as well as seasonal items or items grouped by a specific theme.
Rules-based personalized collections: Items selected for each consumer based on their purchase history, for example, new items from a consumer’s top purchased brand or item category.
ML-based personalized collections: Item categories that ML models predict to be highly relevant to the consumer.

II. Collection retrieval:

When serving up a merchant’s page, it can be computationally expensive to fetch all available items in a store and then rank them across all collections. To avoid this hefty cost, we instead use a collection retrieval model to perform a first pass through our large group of collections to determine which ones to show consumers on the first page, second page, and so on. This streamlines fetching and ranking items to a single page at a time.

III. Horizontal item ranking:

After collections are retrieved, we use an item ranker to place items horizontally within each collection; more relevant items appear to the left while less relevant items are pushed to the right.

IV. Item post-processing:

We apply business logic to adjust the models’ rankings. For example, items without photos are down-ranked because consumers are less likely to engage with them. Also down-ranked are items with a high probability of being out of stock, as predicted by a separate model, since such items are less likely to be fulfilled. Intra-collection diversity is also applied to avoid showing similar items in a row — for example, three types of apples in a produce collection.

V. Collection ranking:

After items are ranked and adjusted within each collection, we carry out a second round of fine-ranking within the collections. This ensures that collections with higher average scores for their top K-ranked items appear higher than those with less appealing top K items.

VI. Collection post-processing:

In a similar vein to item post-processing, we also apply business logic to finalizing collections. One example is deduplicating items across collections so that consumers do not encounter highly similar items from one collection to another. We also implement inter-collection diversity to alleviate the grouping of collections that contain similar items.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

ML model deep dive

Collection retrieval

The collection retrieval model, as shown in Figure 3, is one of the key components of store page personalization. It determines which collections are shown to consumers on each page. The model objective is to predict the probability that a consumer will engage with a given collection, for example by clicking or adding items to the cart.

The collection retrieval model considers the following features:

Popularity of collections, which can be determined in various ways, such as through a high click-through rate (CTR), a large number of clicks, or a high subtotal of orders from items in the collection, among other factors.
Consumer features, such as whether the consumer has a DashPass subscription, whether they are a new or power user, or how many orders they have placed previously.
Past consumer engagement with this collection, which can be measured by metrics such as CTR, add-to-cart rates, conversion rates, and subtotals, which may indicate future engagement between the consumer and similar collections.
Past consumer engagement with items from this collection; consumers may interact with the same items from different stores or in different collections. Consumer item engagement from all surfaces — for example, clicks from search results or clicks from category pages — are used as input features for the collection retrieval model.
Context features, including such things as time of day, day of the week, store type, and geolocation, among other factors.

Item ranking

An item ranking model determines the horizontal order of items within a collection. We started with a model that predicts CTR because click events contain rich information about consumer preferences and are highly correlated with our business North Stars, including add-to-cart and conversion. We quickly found, however, that optimizing for clicks had certain drawbacks. Models that optimize for CTR tend to up-rank niche items with high historical CTR that nonetheless only appeal to a small group of shoppers, while other items with frequent clicks are rarely added to the cart, known as a click-to-ATC rate. These problems were greatly mitigated by applying higher weights on positive samples where a click event is followed by adding the item to the cart and, ultimately, conversion.

Features of the item ranking model can be divided into three major categories:

Consumers’ past engagement on this item
Item attributes, including price, discounts, brand, product categories, and popularity
Consumer features, such as category preference, dietary restrictions, and price sensitivity

In addition to traditional numerical and categorical feature types, we also used consumer and item semantic embeddings developed by the DoorDash ML team, which offer a richer representation of our consumers and items beyond the dense features included above.

Addressing position bias

As with other ranking models, DoorDash’s personalized rankers are affected by position bias. In fact, this problem becomes more significant because of the limited real estate in our consumer app. On most mobile devices, consumers can only see the first three items in each collection without having to scroll to the right. As shown in Figure 4, position bias causes a decline in CTR — number of clicks/number of impressions — after those first three items. As consumers are required to scroll manually to explore more items, overall item impression drops suddenly in the fourth position, leading to a significant CTR increase from the third to the fourth item in each collection (item card positions are 0-indexed).

*Figure 4: Click-through-rate against item card position (0-indexed) within a collection*

We incorporated item positions as a key feature in our model to account for the impact of item positions on CTR. Because positions vary across varying product surfaces, we included the product surface as an additional feature. During the training phase, the model learns how item positions and product surfaces collectively impact ranking. During inference, we set the item position value to 0, representing the first position, and the product surface to the actual surface where the model is called to make predictions.

Diversifying our recommendations

Ordering items and collections based solely on model scores often leads to clusters of similar items horizontally and similar collections vertically because they exhibit similar model scores. This lack of diversity does not provide an optimal experience for shoppers, nor does it take full advantage of a store’s page to delight customers with fresh discoveries. To diversify our recommendations, we applied maximal marginal relevance to both items and collections after the ranking stage. Take item diversification as an example: Given the item set I, which includes all previously selected items (initially a blank item set), we aim to find the next item j that maximizes the objective function O(j, I), which balances item score and similarity:

O(j,I) = S_j –⍺· sim(j, I)

where S_j is the predicted item core from the ranking model and the similarity metric sim(j, I) is defined based on item attributes such as categories and brands. The value ⍺ is determined via online experiments. This approach is similarly applied to collection diversification.

In backend processing, this technique is applied as a post-ranking step following the horizontal and vertical ranking of collections. More specifically, horizontal diversification — within a collection — is carried out after items are ranked, with the similarity calculation applied at the product category level. Collections are initially diversified at the store level, after which pagination is used to determine which collections are currently served in view and then diversification occurs at the page level. Collection similarity is calculated by aggregating item taxonomy similarity per collection.

Future personalization goals

While we have detailed how ML solutions are helping DoorDash to recommend relevant and diverse items to consumers from a vast inventory spanning thousands of categories, our ML team also is incorporating restaurant order histories to inform grocery recommendations to individual consumers. For example, a frequent vegan restaurant patron might appreciate curated vegan selections in our grocery stores. We plan to use consumer behavior sequences as features to better capture users’ short-term and long-term interests. On the model architecture front, we are moving toward MTML (multi-task multi-label) architectures to adapt to multiple product surfaces and optimize for complex modeling objectives. Ultimately, we’re looking to implement real-time features capturing consumer behaviors within a session, for example items currently in the cart and search queries in the past few minutes, to make personalization more timely and context-aware.

Acknowledgments

Special thanks to Meng Chen, Shi Wang, Talia Stadtmauer, Vivek Paharia, Andre Jacobovitz, Yucong Ji, Jennifer Yunus, Sudeep Das, and Kurt Smith who all worked together to make this exciting work happen!

Many organizations use infrastructure-as-code (IaC) with pull request (PR) automation to provide a more secure, safe environment for making infrastructure changes. Despite the power and flexibility of IaC software, the lack of strong, secure defaults in PR automation software can make that sense of security a false one.

Infrastructure-as-code and pull request automation

IaC enables a declarative, reusable, and auditable way to manage configuration changes. At DoorDash, the primary platform for this is Terraform, operated by an account-isolated or specifically configured Atlantis instance running in ECS and backed by GitHub.

This type of configuration can be used to manage a myriad of infrastructure, such as Okta, Stripe, Chronosphere, or AWS. For the purposes of this article, we’ll focus on AWS.

A basic workflow for creating an AWS Account could be as simple as creating a new GitHub repository from a template and then issuing a PR against a repository containing the IaC for the account managing the AWS Organization. Atlantis automatically plans on the newly issued PR, and an admin, engineer, or other authorized personnel reviews and approves the proposed changes as appropriate. Upon approval, someone with access to comment on PRs, such as the author or an approver, can leave the comment “atlantis apply,” instructing Atlantis to execute the proposed plan and merge the PR upon success.

Because the Atlantis instance is isolated to the specific AWS Account and only executes the plan post-approval, one would assume that this is a safe setup. However…

Bypassing approval

By default, Atlantis dutifully executes terraform plan in directories where changes to specific files, for example *.hcl, have been made. terraform apply cannot be run unless the PR has been approved. Terraform, however, is a flexible and powerful tool. Terraform providers execute code at plan time and can be pulled from outside the public registry. A user with the ability to open PRs could host, fetch, and execute a malicious provider to circumvent PR approval requirements. In fact, such a user wouldn’t even need to host a malicious provider. An official provider, external, contains a data source which can be used to tell Atlantis to do pretty much anything.

The troubling fact is that the external data source can execute arbitrary code at plan time with the same privileges and in the same environment as Atlantis, allowing arbitrary changes to be made without any need for review or approval.

Plugging the leak

Atlantis has powerful server-side customization that allows customized default plan and apply workflows, provided it is not configured to allow repositories to provide their own configuration customization. This enables running tools such as Conftest against Open Policy Agent (OPA) policies that define an allowed list of providers before terraform plan is executed. Given the large number of providers available in the Terraform Registry and the means to use providers from unlimited sources, a strict allowlist of providers removes the ability to apply changes or leak environmental data at plan time.

To create such an allowlist, it’s important to let Terraform resolve its dependency graph instead of trying to parse required_providers because unapproved providers can be referenced by external modules and their transitive dependencies. Once the dependency graph is resolved with terraform init, all required providers can be found in the dependency lock file alongside version and checksum information. Here is an example server-side config validating an allowlist of providers against the dependency lock file:

repos:
- id: /.*/
  branch: /^main$/
  apply_requirements: [approved, mergeable]
  workflow: opa
workflows:
  opa:
    plan:
      steps:
        - init
        - run: conftest test --update s3::https://s3.amazonaws.com/bucket/opa-rules --namespace terraform.providers .terraform.lock.hcl
        - plan

A starter policy evaluating just the provider source address appears as follows:

package terraform.providers

allowed_providers = {
       "registry.terraform.io/hashicorp/aws",
       "registry.terraform.io/hashicorp/helm",
       "registry.terraform.io/hashicorp/kubernetes",
       "registry.terraform.io/hashicorp/vault",
}

deny[msg] {
       input.provider[name]
       not allowed_providers[name]
       msg = sprintf("Provider `%v` not allowed", [name])
}

With version and checksum information available in the dependency lock file, OPA policies could enforce not just certain providers but also non-vulnerable versions and known checksums.

With these precautions, if a bad actor attempts to use the dangerous data source in their HCL, Atlantis will halt before planning:

FAIL - .terraform.lock.hcl - terraform.providers - Provider `registry.terraform.io/hashicorp/external` not allowed

1 tests, 0 passed, 0 warnings, 1 failure, 0 exceptions

The developer experience can be improved by adding a prescriptive error message and defining a process for expanding the provider allowlist. Additionally, a feature can be added to the custom workflow to allow authorized users or groups in GitHub to permit a dangerous plan anyway with a PR comment.

Note that the above implementation relies on the existence of the dependency lock file (.terraform.lock.hcl), which did not exist prior to Terraform 0.14. We recommend enforcing a minimum version of Terraform to prevent downgrade attacks. If you need to support older versions of Terraform, “terraform version” returns provider information starting in 0.11 with JSON output added in 0.13.

Alternative approaches to implementing provider validation include hosting an internal registry and using a network mirror or baking providers into your image and using -plugin-dir.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Reducing review fatigue

Such a workflow can require quite a few people to get anything done. Consider: An engineer simply wants to update a configuration property, but everything requires a review. This can grind productivity to a halt and make for an unpleasant work day waiting to do something as simple as increasing a memory limit on an EC2 instance.

With Conftest and OPA, specific resources can be allow- or deny-listed, permitting some specific changes without needing approval while others would be specifically flagged for approval.

Additionally, approval for changes to specific properties can be delegated to non-specialized teams in GitHub by adjusting CODEOWNERS and writing the HCL in such a way that it reads the property values from non-Terraform files such as .txt files. For example:

locals {
  users     = var.users != null ? var.users : var.read_users_from_file == null ? [] : [for user in split("\n", chomp(file(var.read_users_from_file))) : user if trimspace(user) != "" && substr(trimspace(user), 0, 1) != "#"]
  set_users = toset(distinct(local.users))
}

The combination of these two techniques can pre-determine that a number of changes are explicitly safe, significantly reducing the need for review by a team member from security or infrastructure engineering.

Management nightmare

Recall the configuration of Atlantis. For safety, each AWS Account has its own instance of Atlantis so that a misconfigured or compromised instance in one account can’t make changes in another account. Each instance runs in Elastic Container Service (ECS) with separately configured containers. Every change to the workflow configuration currently requires a PR. In large AWS Organizations, this can result in a significant number of PRs creating a tedious process.

Presently, Atlantis is tedious to manage en masse. Simplifying this process is a priority, but requires planning. Some design changes can be made to help. For example, workflow configuration can come from a service or source control management system. Additionally, we can create limited-purpose cross-account AWS Identity and Access Management (IAM) Roles to permit updating of all Atlantis ECS Service Task Definitions and Services. Doing so, however, requires planning to limit unknown/unreviewed/unofficial images being used in the Task Definitions as well as monitoring of CloudTrail logs to reduce the chance of unauthorized changes.

Conclusion

Any sufficiently powerful tool is unlikely to come without risk, so it’s important to review the functionality of tools and systems in the critical path of a workflow. A misconfigured build environment could lead to remote code execution on a developer or continuous integration (CI) machine. A misconfigured PR automation system could lead to something similar or more unfortunate. Maintaining safe operations calls for addressing critical findings in reviews.

Simple roadblocks may provide security but often lead to fatiguing inefficiencies. Few people will continue to use a secure system that they don’t enjoy or that bogs down the entire process. Being mindful of this provides opportunities to explore ways to reduce inefficiency while maintaining excellent security, increasing developer velocity, and reducing fatigue.

Batten down the hatches, full steam ahead!

DoorDash’s Engineering teams revamped Kafka Topic creation by replacing a Terraform/Atlantis based approach with an in-house API, Infra Service. This has reduced real-time pipeline onboarding time by 95% and saved countless developer hours.

DoorDash’s Real-Time Streaming Platform, or RTSP, team is under the Data Platform organization and manages over 2,500 Kafka Topics across five clusters. Kafka is the pub-sub layer of the Iguazu pipeline, which provides real-time event delivery at DoorDash. Almost six billion messages are processed each day at an average rate of four million messages per minute, which sometimes peaks at double that rate.

The RTSP team constantly looks for ways to speed up Iguazu onboarding. The slowest step in that process was provisioning Kafka Topics, which involved on-call engineer approval and was prone to failure, further increasing the on-call load. To improve this, RTSP partnered with DoorDash’s storage and cloud teams to automate Kafka resources creation by integrating with an internal infrastructure resource creation service.

Key terminology

Here are definitions and links to further documentation about the tools we used. We address how these tools are used and their pros and cons in the main article.

Terraform: Infrastructure-as-code (IaC) platform. It uses the unique HashiCorp Configuration Language (HCL) for configuring infrastructure resources. To provision infrastructure, create an execution plan, called a Terraform plan, and then execute the plan through Terraform Apply.
Atlantis: A Terraform automation tool. Runs Terraform Plan and Apply. Merges Terraform pull requests on successful runs.
Pulumi: Similar to Terraform, this is also an IaC platform, but without HCL. Pulumi instead leverages existing programming languages to manage infrastructure.
Prometheus: A monitoring and time-series database. Designed for monitoring application and infrastructure metrics. Exposes query language PromQL for writing alerts on metrics.
Chronosphere: Cloud-native observability platform. Built atop Prometheus.
Cadence Workflow: Fault-tolerant, stateful workflow engine capable of executing directed acyclic graphs (DAGs).

Understanding the legacy architecture

As shown in Figure 1 below, DoorDash’s legacy approach to Topic creation involved several steps within a Cadence workflow.

Orchestrator service triggers an API call against the GitHub repo for Kafka Topics, creating a pull request, or PR, for a new topic and corresponding access control list (ACL) entry.
Orchestrator service triggers Atlantis to run Terraform Plan against the topic.
On-call receives an automated email notification about the PR.
Orchestrator service polls the PR status to check for on-call approval.
Once approved and in a mergeable state, Atlantis Apply is triggered against the PR.
Orchestrator services monitors for successful PR merge. In the event of failure, the PR is deleted and the process begins again from Step 1.

*Figure 1: Legacy architecture for Kafka Topic Creation*

Unfortunately, the Topic creation workflow often would fail for any of several reasons:

GitHub merge conflicts upon PR creation when multiple PRs are cut off the same commit
Terraform state drift against Kafka state
Atlantis sometimes would time-out on clusters with hundreds of topics and completes in a nondeterministic amount of time
Atlantis state drift against Terraform state. Terraform is applied but Atlantis did not merge the PR
Because reviewing and approving PRs is time-consuming, on-call sometimes would miss the email notification, causing time-outs. Note: The volume of new Topic-created PRs could exceed 20 per hour during product launches.

Furthermore, it is difficult to programmatically audit Kafka clusters and perform scale-up operations such as adding partitions or migrating to dedicated clusters without manual intervention.

Developing a new architecture

Initially, we considered a number of potential approaches, including:

Creating durable in-memory state to track fine-grained progress and to synchronize between workflows. State would be recovered from disk on Orchestrator restart.
Using the online transaction processing database (OLTP) to persist the state mentioned above.
Writing a custom Terraform provider
Increasing workflow retries.

All four solutions would have been duct-tape solutions unable to completely address the underlying issues: State synchronization across Terraform as hosted on Git, Atlantis, Cadence workflow, and Kafka. Although the first two might have solved some of the issues mentioned, they would have run the risk of further complicating state management by introducing new states to keep in sync. As the authoritative source of truth, Kafka must be consistent with any solution we choose.

Capturing a small win: A use case for Kafka Super Users

While exploring these solutions, we identified that merge conflicts were only occurring in the ACL files for Iguazu users. Each consumer and publisher in the Iguazu pipeline has a separate Kafka user account. Upon each Topic creation, an Iguazu user’s ACL file was updated with an ACL entry for that topic. Eventually, the ACL files grew to have hundreds of permissions, significantly slowing Atlantis applications.

Our “eureka” moment was when we realized that this was a perfect use-case for super-user accounts. Permissions-related pitfalls meant that we usually shied away from setting up super users. But if each Iguazu user — REST proxy or upstream and downstream Flink jobs needed access to every single topic in a cluster, it would be ideal to give these users full read or read-write access as needed, eliminating the ACL file and its related issues. Additionally, the existing workflow could be further improved, as we will outline shortly.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Going for the big win: Streamlined Kafka resource creation

Infra Service is an internal platform that provides an API to perform CRUD operations on infrastructure components. It is built and maintained by the Infrastructure organization at DoorDash. It replaces the traditional approach of using infrastructure-as-code and GitOps to provision infrastructure. Infra Service replicates the important features provided by Git and GitHub, including version control and change reviews. It’s also plugin-driven, allowing teams to add support for resources that they would like to manage programmatically. Most of the work required to implement an Infra Service plugin involves writing an underlying Pulumi program.

Infra Service uses Pulumi to handle infrastructure provisioning under the hood. Pulumi is an infrastructure-as-code tool similar to Terraform, but unlike Terraform Pulumi allows for using general programming languages to define infrastructure. It has robust support for testing and an extensive provider catalog. Infra Service handles programmatically invoking Pulumi when a change is requested and propagating any outputs resulting from the Pulumi execution back to the end user.

To create Kafka and database resources, we’ve developed a component within Infra Service called Storage Self-Serve Platform. This is shown below in Figure 2.

*Figure 2: New streamlined architecture for Kafka Topic creation*

In Figure 2, still wrapped in a Cadence workflow, the Topic provisioning is reduced to a two-step process:

Orchestrator service Minions fires a gRPC request to the Infra Service gateway to provision the Kafka Topic. Minions receive a synchronous response on whether the topic create request is persisted. At this point, the topic creation might not have been completed. From Minions’ perspective, everything behind the Infra Service Gateway is a black box that handles dedupe, validation, and retries.
Because a Topic is considered created when it shows up with non-zero disk usage, Minions continuously polls the Prometheus metrics platform Chronosphere for that state. All Topics, even those without any messages, include some metadata that is backed up to disk. We use Chronosphere for two reasons: First, it independently corroborates the state of the Infra Service black box and, second, DoorDash runs Chronosphere at at four nines (99.99%) availability. This means that Chronosphere outages essentially don’t exist. If Kafka doesn’t report topic metrics for a few minutes, it is improbable that this will continue any longer — unless there are bigger issues with Kafka. When the metrics eventually show up in Chronosphere, they will be pulled by Minions.

Savoring the win

This new architecture allows provisioning roughly 100 new topics every week without manual intervention. With this API-based topic workflow, we reduced Iguazu onboarding time by 95%. Previously, customers were guaranteed onboarding within two business days, or about 48 hours. Now onboarding completes within an hour of request submission and often within 15 minutes. And there’s a bonus: Manual on-call intervention has been reduced about four hours per week.

Each topic created using the new architecture includes rich metadata on ownership, throughput expectations, and message size, which will ease enforcing reliability guardrails in the future.

Ultimately, by integrating with the standard Storage Self-Service Platform within Infra Service, we have access to admin controls including overriding topic configurations, retrieving user passwords, and developer friendly access to Kafka cluster state.

Exploring a storage self-service future

*Figure 3: Here CRDB is Cockroach Database and DBMesh is a data gateway service which interacts with all the supported storage technologies on behalf of users*.

Building on the success of Infra Service and the Storage Self-Serve Platform, we plan to add the following features to improve our guardrails and customer experience. Figure 3 illustrates the high level architecture of the future design.

Centralized validation logic, which will be maintained by the storage team. Such validation logic can be continuously tuned to match business needs.
Smart default values. Partition count and replication factor can be computed based on customer request. This simplifies user input to provision a topic.
Catch duplicate requests earlier in the provisioning process through deduping logic that is specific to Kafka. Return API errors to users.

Acknowledgments

This engineering win has been a team effort across several teams: Real-Time Streaming Platform, Storage. and Cloud. Special thanks to all the engineers who helped realize this: Roger Zeng, Luke Christopherson, Venkata Sivanaga Saisuvarna Krishna Manikeswaram Chaitanya, Basar Hamdi Onat, Chen Yang, Seed Zeng, Donovan Bai, Kane Du, Lin Du, Thai Pham, Allen Wang, Zachary Shaw and Varun Narayanan Chakravarthy.

It is amusing for a human being to write an article about artificial intelligence in a time when AI systems, powered by machine learning (ML), are generating their own blog posts. DoorDash has been building an internal Machine Learning Workbench over the past year to enhance data operations and assist our data scientists, analysts, and AI/ML engineers. In this article, we’ll explain how DoorDash has accelerated ML development velocity through constructing a streamlined environment for automating ML workflows. We also shed light on how we drove value by taking a user-centered approach while building this internal tool.

Importance of ML at DoorDash

ML is involved in a wide range of applications in the tripartite symbiosis of customers, Dashers, and merchants to whom DoorDash caters. From using the right image on merchant store pages to suggesting appropriate substitutes when Dashers are unable to find a suitable replacement for an out-of-stock item, there are opportunities aplenty for which manual solutions are inefficient, expensive, or implausible.

As shown in Figure 1, data science intersects ML in multiple ways and is paramount to DoorDash’s success. Therefore, it’s critical for the data and engineering teams to have comprehensive support throughout the ML process. An internal ML workbench facilitates collaboration and information sharing between these teams and also speeds up and streamlines execution of ML projects.

*Figure 1: Euler Diagram showing the relationship between Computer Science and Data Science*

The concept of an ML Workbench

Our vision for ML Workbench was to create a centralized hub to provide a space for accomplishing tasks throughout the machine learning lifecycle, such as building, training, tuning, and deploying machine learning models in a production-ready environment. The idea was to create a one-stop shop for users to collect data from different sources and then clean and organize it for use by machine learning algorithms.

*Figure 2: Phases of machine learning lifecycle*

Workbench Evolution

ML Portal motivation and backstory

The ML platform team started by building a simple UI to automate the model testing process through a web application called the ML Portal. Data scientists could use this app — ML Portal — to test their models easily using a browser and a few mouse clicks.

This came from preemptive thinking after we observed that the manual testing process wasn’t scalable, slowing ML development and generating repeated questions about putting together the Python script. As we saw users readily adopt this simple automation, we realized that simple tools can help our customers increase model development velocity over the long term.

We soon started adding more functionality to this UI. Some of the initial features included:

Ability to view all models
Ability to test model predictions
View features that constitute a model

We observed ML Portal’s utility as adoption grew and decided to double down on this effort. We continued iterating on our initial prototype, which we created using a Python Flask and HTML framework.

ML practitioners told us that they perform a number of daily tasks that we decided to incorporate into the UI tool to accelerate and streamline their daily workflow. As we reached a critical mass of adoption, users started to put in feature requests for the UI; we knew we needed to improve both our technology stack and our information architecture to make meaningful incremental improvements to their workflows.

At the same time, we were conducting user satisfaction surveys and gathering improvement reviews each quarter that verified how useful the ML Portal was becoming. All of this prompted creation of The ML Workbench: A Homepage for ML Practitioners at DoorDash. Setting an initial ambitious goal to drive model development velocity, we soon assembled a team that included both design and engineering.

Workbench goals

Internally grow a solution optimized to boost the productivity and velocity of DoorDash teams running ML-powered operations
Build a best-in-class internal tool that’s functional, useable, aesthetically pleasing, and integrates seamlessly into DoorDash’s growing internal tools ecosystem
Reduce reliance on third-party apps

*Figure 3: Construction of the ML Workbench*

Workbench development strategy

We took our usual crawl-walk-run product development approach, instilling design thinking to prioritize our sequence of operations:

Phase 1 (Q1-FY23)

Drive research to understand user pain points, current usage
Establish a product development process with cross-functional partners
Craft a short-term vision for the ML Workbench (MLW)

Phase 2 (Q2-FY23)

Design solutions for key experiences and friction areas identified during research
Run user tests with the first few versions of engineering builds
Optimize workbench performance, aiming for better velocity and productivity

Phase 3 (Q3-FY23)

Develop a feedback mechanism through product surveys
Use feedback to inform long-term vision
Extend capabilities and capture more of the ML lifecycle through feature adds and enhancements

User research

Despite our ambitious goals, we quickly learned that we couldn’t have the workbench support all four phases (Figure 2) from the get-go. We conducted interviews across multiple teams, including Search, Ads, ETA, and more that focused on each participant’s role, how they were using ML Workbench, their team’s goals, and their current pain points. We organized major user tasks using a jobs-to-be-done framework and categorized users into three buckets:

I. Admins (ML platform engineers)

Provide maintenance and support across ML platform
View ML models and associated input variables — features — across predictors and use cases for quick debugging
Set up connectors that allow users to interact with other services on ML platform

II. End users (Data scientists, data analysts, other data users)

Develop ML models end-to-end and explore currently available datasets
Deploy shadow models
Monitor models in production
Make test predictions
Track model data such as features, training runs, shadow models, and metrics

III. Operators (product managers, business leads)

Review key signals and metrics
Supervise ML team performance and efficiency

Key findings

Based on our conversations with users and their use of working prototypes in their day-to-day workflows, we surmised:

I. Which pages received the most traffic

“I use it for looking up information on predictors, features and sometimes for testing and deployment — not for model training yet.”
“I frequently check Pipeline Runs and Sensor Ticks, but, often verify with Dagit.”

II. The phase of the ML lifecycle during which the workbench was most used

“We don’t touch ML Portal during feature development work. After the feature has been deployed to production and uploaded to Redis, we start using ML Portal to check the feature.”

III. The key issues in available capabilities

“I’ve never clicked into the fabricator source on ML Portal. I didn’t know all this source information was inside.”
“I love feature search. Would be really helpful to have a dropdown box as we’re typing feature search keywords (contextual search).”

As we spoke to users, we realized that this also was an opportunity for us to observe what DoorDash’s ML pipeline looked like. Through capturing the complicated landscape better, we could identify where MLW could be most effective and perhaps slide in as an alternative for a third-party tools.

Setting a vision and scoping out a launch-ready MLW v1

Our research guided us toward what we wanted to solve, transforming into a vision of a full-scale ML Workbench, in the form of a design prototype that would be our north star. From here, we defined the first version and focused on:

Setting a strong foundation for a scalable workbench by building the front-end from scratch in React, consistent with Prism, our internal components and design system
Integrating MLW in the existing internal data tools suite that includes tools such as Experimentation Platform and Metrics Platform
Reducing time on-task for key experiences to speed velocity directly and to boost productivity through making MLW actions and capabilities easily discoverable
Creating a 45-day concept-to-production timeline to iterate consistently on new and existing workbench capabilities

*Figure 5: Pyramid of product building approach – how we design at DoorDash*

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Use Case

Problem: Feature Upload Status

Model owners often perform daily checks to ensure feature freshness. The old flow involved a few too many steps using a command-line interface, as outlined below, to check if features were being uploaded on time to the chosen feature store.

*Figure 6: Pre ML Workbench steps for upload status lookup*

*Figure 7: Development demo for looking up a pipeline upload status*

Problem: Feature values serving lookup:

As fabricator adoption grew, data scientists and ML engineers needed to ensure that the features they created were correct. Even simple tasks such as a spot check for created values required going through a tedious process from their local machines to query the feature stores in production.

*Figure 8: Pre ML Workbench steps for feature value lookup*

*Figure 9: Pre ML Workbench demo for feature value lookup*

Solution

By enabling MLW to integrate with the feature stores, we let users directly query the production data via a simple user interface. This greatly reduced an ML practitioner’s operational overhead to query the feature stores to ensure the features they are generating using. Moreover, for feature upload status spot checks, we made the process much easier and quicker by enabling MLW to interact with the feature upload service and its tables, ensuring direct interaction with the feature service from the UI.

*Figure 10: Redesigned ML Workbench showing feature value lookup and upload status*

Testimonials

Since deploying ML Workbench, our engineering and data science teams have given great feedback about how it streamlined their processes and created a much better user experience.

“These improvements are huge! New platform is already saving me time because I can send it to my xfn to check features values (for pick score) and they can validate that the features are correct & make sense.”
– ML Engineer, New Verticals

“While technically this functionality may have existed in the old platform, the UI was so difficult to work with (that) I wasn’t able to use it as a tool to accelerate my own work or get extra eyes on it to improve the quality of my work.”
– Software Engineer, Consumer Growth

What’s next?

As we continue to scale our efforts with a customer-obsessed approach, we are looking into the following areas of focus:

Drive and diversify adoption: DoorDash’s ML Practitioners already need and actively use ML Workbench, but now we want to add more personas to its user base
Improve observability: As we head into 2024, we seek to leverage ML Workbench to improve feature and model observability to increase user confidence in the platform tools

Traditionally, developing internal tools for developers has focused solely on automation, often at the expense of user experience. With ML Workbench, we challenged ourselves to develop user empathy and balance the goals of velocity and productivity with a focus on the user. Rather than limiting ourselves to niche workstreams, we wanted to create a positive impact on as many data users as possible. We took the time to understand the pain points that engineers and data scientists face, prompting us to create both a functional solution and one that our users would find easy and delightful to use. As we scale this tool to capture other phases of the ML lifecycle going forward, we’ll continue to prioritize our user-centric philosophy to drive adoption and propel ML development.

Acknowledgements

We would like to express our gratitude to Karan Sonawane and Prajakta Sathe for getting the UI development process going and making the project move from a blank slate to something tangible. We are also thankful to Andrew Hahn and Hien Luu for their guidance and assistance in the collaboration between Design and ML Platform. This was the first project between the two teams and their mentoring enabled us to navigate hurdles in the journey. We also appreciate Hanyu Yang and Steve Guo for making sure the ML Workbench was complete at the time of launch. Lastly, we are thankful to Swaroop Chitlur and Kunal Shah for their input on the implementation approach, engineering structures, and ensuring the ML Platform delivers value where it matters.

Regularly releasing updates to the App Store and Play Store is more complex than might be expected, especially for teams at scale and even more so when there are multiple apps to ship. There are so many ways to thread through release complexities that no two teams will do everything the same way.

It’s intriguing to see how other teams work. Discerning similarities and differences between teams can help reveal potentially valuable new approaches. In that spirit – and with a six-year background in the release processes for both the DoorDash and Caviar consumer apps – here is a high-level overview of DoorDash’s mobile release management.

Note: At DoorDash, we ship multiple Android and iOS apps and each team handles releases differently. This article focuses on DoorDash’s consumer iOS team and its release processes.

A general overview of the release cycle

At a high level, our release cycles are similar to those found at many other companies. Here’s a quick look at how we manage things:

We ship a new app version every week, which means we have a release cut weekly.
Testing begins when a release is cut. At that point, we have up to a week to fix any critical regressions, which we do by cherry-picking fixes into the release branch.
We submit for App Store review after testing and any fixes are complete, ideally mid-week to create a buffer for any potential rejections or unexpectedly long review times.
Once approved, the build waits for release until the end of the week.
On release day, we begin a phased rollout to 1% of users to ensure no other major issues surface that might have been missed during testing.
After closely monitoring release health, we accelerate the rollout to all users on Day 3.
At this point, the subsequent release is already in the works, so it’s a matter of rinse and repeat.

The following sections drill down a bit deeper into our processes, particularly in regards to collaboration and the human aspects of our processes.

Release management responsibilities

A small team of volunteers rotates responsibilities for managing releases and taking on release manager responsibilities. Each on-duty release manager is in charge of making sure that the current release rolls out smoothly and that the subsequent release makes it through App Store approval before its scheduled release date.

We deliberately keep the number of release managers relatively low. We have enough managers to spread the load, but not so many that anyone gets out of practice. Keeping the numbers in check helps to keep decisions consistent, especially with inevitable judgment calls that may occur. Managers not only are kept up-to-date on recent process changes and improvements; they’re part of deciding how those processes should change in the first place.

Managing communication

To maintain clear communication and keep organized throughout releases and rollouts, we create a Slack channel specific to each release — for instance #ios-release-5-0-0. This centralizes release-specific status updates and conversations in one place, reducing noise in other shared channels and making it easy to look up details from a past release if needed.

The dedicated Slack channel is especially helpful when hotfixes enter the mix. Because we have a weekly release schedule, shipping a hotfix for an issue in production means that there are two separate releases in progress simultaneously. Isolating each release’s communication in its own channel prevents confusion. For instance, in the 5.0.0 example, anything related to the 5.0.1 hotfix would be discussed in #ios-release-5-0-0 while matters related to the upcoming 5.1.0 release would be addressed in #ios-release-5-1-0.

Testing release candidates

Our apps have grown so large that it’s impossible for any one person or even a few people to fully own release candidate testing. The team dedicated exclusively to higher-level QA can’t easily manage intensive weekly regression testing, too. Considering the volume and pace of continuous changes and new feature development throughout the app, it’s difficult to ensure that the right things are being tested — and tested correctly. The people actually building features and making changes are in the best position to know what’s new, what’s different, and how to test it all properly.

That’s why our release candidate testing relies on a group of engineers we call “component owners.” Each of them is responsible for testing their own component — a feature or area of the product — in the release candidate and fixing or delegating fixes for any regressions detected during testing. Components usually map one-to-one with our product teams — for example, the login team is responsible for running testing related to the app’s login component. Each component owner has specific tests that they must run before approving the component. And every single component must be approved before the release can be submitted for review.

Of course, making sure all component owners have signed off before submission and figuring out who is still testing can get complicated. We use a mobile release management platform called Runway to make collaboration easier throughout the week. It captures the current status of component testing and can also automatically remind component owners through Slack to complete their tests.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Handling regressions and hotfixes

Although we strive for seamless releases — test, approve, and release — that’s obviously not possible 100% of the time. Sometimes we catch regressions during testing that need to be fixed before we can release the app. Because it takes time to figure out what’s causing a bug and then fixing it requires modifying code — which introduces risk late in the cycle — it’s important to carry out release candidate testing as early in the week as possible. That way we have enough time to find the correct solution to the regression and thoroughly test the fix to make sure our changes don’t break something else. Given the scale and nature of what we build at DoorDash, even small issues can have a big impact, so we approach regressions and hotfixes with rigor and care.

If a component owner identifies a regression during release candidate testing, they work with their team to triage the problem and then develop a fix on our main branch via a regular pull request. After the fix is merged, it may — or may not — be eligible for cherry-picking over onto the release branch. Because late-arriving changes are so risky, we can’t simply allow everything into the release. We have strict requirements that must be met before any change can be added to a release. For example, we allow fixes for regressions or new bugs that have a measurable effect on the user experience, but we don’t allow integration of code to fix minor bugs that don’t affect the user — and we certainly don’t use this process to squeeze in new features that might have missed the release cut. We have developed a flow to facilitate escalating fixes for possible inclusion in a release; teams submit fix requests, including explanations and evidence, that are then reviewed by the release manager.

Post-release, the process for creating a hotfix is similar to that used for requesting a cherry-pick in-cycle, except the criteria for allowing a post-release hotfix are much more strict. Spinning up a hotfix requires much more work and could impact the normal release hot on its heels. If we find a bug late within a release cycle — say, between when the app was submitted for review and its release — the decision on whether it gets a fix depends on the same strict criteria used for post-release hotfixes. Although the update is not yet public, it may be waiting for review or even approved already; to implement a fix, we would have to reject the build and resubmit the app. Because this could delay the release, we evaluate whether a fix is merited and how to approach it on a case-by-case basis. We could, for example, either developer-reject the build, apply a fix, then resubmit, or we could let the release happen on schedule and immediately spin up a hotfix afterward. Alternatively, we may determine the bug isn’t impactful enough to warrant a hotfix, putting it in for a fix during the next release cycle.

Monitoring rollouts post-release

Post-release, we rely on component owners and their teams to keep an eye out for any issues that may arise in their areas of responsibility. Each team has a unique set of key metrics they own and monitor, which makes them best-equipped to understand what may be going wrong with their components in the new release.

But release managers aren’t completely off the hook in this. They must watch higher-level measures of health, like the new release’s crash rate and any trending issues. We use Sentry to keep track of the apps’ crashes. Because we can integrate it with Runway, we can create a single source of truth for observing app health closely. If a release manager sees something unusual, they can delegate component owners to take a deeper look and make fixes as needed. But if no problems arise, we can automatically go to a full rollout.

Conclusion

As described here, releasing mobile apps at scale takes quite a bit of work and coordination. Keeping releases moving forward on schedule requires effort that involves stakeholders across many teams to test and safeguard quality; centralizing control with release managers ensures the process runs consistently and efficiently. The setup described here allows us to maintain a weekly release cadence across multiple apps while keeping quality high and team members happy.

DoorDash proactively embeds privacy into our products. As an example of how we do so, we delve here into an engineering effort to maintain user privacy. We will show how geomasking address data allows DoorDash to protect user privacy while maintaining local analytic capabilities.

Privacy engineering overview

To facilitate deliveries, users must give us some personal information, including such things as names, addresses, and phone numbers, in a Drive API request. This information is needed for Dashers to know where and to whom to deliver an order. Because this information can be used to re-identify an individual, it could be used by a bad actor to cause harm, including identity theft and doxxing.

That’s why we want to ensure that this personal data is redacted (erased or obscured) from our platform within a reasonable period of time after a delivery is completed. That way, even if a bad actor gains unauthorized access to our database, personal data will no longer be there, preventing it from being misused.

Merchant-defined policies

On DoorDash Drive, merchants can configure redaction policies about when personal data related to their orders will be redacted.

If a merchant does not configure such a policy, we will set a default time for redacting the data anyway. It will not be allowed to persist indefinitely.

Asynchronous redaction

As is common practice in the industry, each user is associated with a unique user identifier. Personal data can propagate through many layers of a distributed system, so we associate instances of this data with an identifier.

As seen in Figure 1, an asynchronous redaction eligibility job combines various inputs to determine if a user is eligible for redaction, including merchant-defined policies, time at which drive orders were placed, and a mapping of where data has propagated. If the data is eligible for redaction, the job publishes a message to a Kafka topic, signaling to purge the data associated with this user. Each service that maintains a copy of this data will have a corresponding listener, which can kick off a process to redact the data from its tables.

Each process can be executed in parallel, bounding the overall redaction time to the longest individual process. The redaction is only considered successful after every process has signaled with a success acknowledgement.

*Figure 1: Synchronous flow of personal data into storage, with asynchronous process to execute redaction.*

Address geomasking

As an example, we can delve into a specific piece of data involved in the high-level flow, the user’s delivery address. In some cases, there is a false dichotomy between being able to protect sensitive personal information and being able to leverage the data for analytical purposes, including improving recommendations and generating valuable inferences. Address data, for instance, could be used to inform merchants where most of their orders are coming from or to help make recommendations to customers who are ordering from the same area. If we redact the data completely, we lose opportunities to improve our platform. Our challenge is to find a middle ground where we are able to effectively eliminate the risk of customer address data being misused while maintaining our ability to analyze location data to generate business opportunities.

Gaussian perturbation

To achieve that middle ground, we can transform into geographic coordinates — latitude and longitude — through a process called geocoding. As shown in Figure 2, we can then displace these coordinates in both direction and distance by a degree of randomness to mask the specific data.

*Figure 2: Geomasking a given text address*

The distribution of these geomasked points follows a bimodal, Gaussian (normal) distribution. A regular Gaussian approach applies the same displacement to both latitude and longitude. A bimodal approach, on the other hand, applies different displacements to the latitude and longitude, which allows more precise control over the level of obfuscation and reduces the chance that the displacement is 0.

The new geomasked point can modeled by the following equation

(Xnew, Ynew) = (Xο + Dx * cos(?), Yο + Dy * sin(?))

Xo, Yo: the original lat/long coordinates

Dx: the distance that we are to displace the latitude, randomly selected from a Gaussian distribution with mean value 0 and standard deviation σ_x

Dy: the distance that we are to displace the longitude, randomly selected from a Gaussian distribution with mean value 0 and standard deviation σ_y

?: the angle between the original location and the geomasked location — a randomly selected value from the range [0, 2π]

This allows us to fine-tune the standard deviation (σ) of the Gaussian distribution to protect user privacy while also maintaining a sense of the original delivery area and the opportunity for location analytics.

Twitter case study

This process has been studied using Twitter location data as shown below in Figure 3:

*Figure 3:* *Diagram to compare clustering characteristics with standard deviation used.*

As the standard deviation increases, the cluster of tweet locations becomes more sparse, making it harder to pinpoint where the tweets originated.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Quantifying geomasking effectiveness

We need a way to quantify the proposed geomasking’s effectiveness; how much have we reduced the risk of re-identification? Spatial k-anonymity is the most widely used method for determining this.

Spatial k-anonymity produces a value “K” which measures the number of potential locations that could be identified as the “true location” of the user after geomasking. With this value, the probability of a bad actor selecting the true location is 1/K. The larger the value of K, the more effective the geomasking is at protecting a user’s actual location.

*Figure 4:* *Example of spatial k-anonymity*

Population density

The population density of a user’s location can have an impact on geomasking’s effectiveness. Consider this extreme case: A user orders from a rural area in which their house is the only one in a square mile. After applying geomasking, a new displaced point might still exist within the boundaries of this square mile. This would mean that a bad actor could easily re-identify the user’s home address.

Instead, we could make local population density inversely proportional to the standard deviation used for our Gaussian distributions. In urban areas, our geomasked points would deviate less than those in rural areas.

The nature of DoorDash’s business means that orders primarily are delivered to suburban or urban areas, making the rural example much less of a concern. Nonetheless, it is an important consideration in any number of other applications.

Estimating k-anonymity

We can estimate the k-anonymity of any given geomasked point by multiplying the local population density by a circular ring approximation of the Gaussian probability distribution as shown in Figure 5.

*Figure 5:* *Circular ring approximation that can be used to estimate k-anonymity*

This is inspired by the methodology used in a 2006 paper studying geomasking for epidemiology purposes:

“Since 68.26% of patients should fall, on average, within the first standard deviation, σ miles in radius from where they were originally located, we can multiply the local population density by the area, πσ² and by the probability that the patient would have been moved into that region, 0.6826. We can add to this the next ring’s population density multiplied by its area and its probability that a patient would be transplanted into that area, 0.2718. Finally, we can add the area of the last ring multiplied by its local population density by its probability density, 0.0428.”

This description can be modeled as follows:

K = (.6826 * β * A1) + (.2718 * β * A2) + ( .0428 * β * A3)

K: the degree of k-anonymity

β: local population density

σ: standard deviation in Gaussian distribution

A1: the area of the inner circle = πσ²

A2: the area of the middle circle = 4πσ² – πσ²= 3πσ²

A3: the area of the outer circle = 9πσ² – 4πσ² = 5πσ²

As an example, let’s use β = 100 households per sq mi and set σ = .25 mi

This would result in a K of approximately 34 locations from which the true location could be located.

Finding an optimal K value

According to Khaled El Emam in “Protecting Privacy Using K-Anonymity“:

“Even though a minimum K value of 3 is often suggested, a common recommendation in practice is to ensure that there are at least five similar observations (K = 5). It is uncommon for data custodians to use values of K above 5, and quite rare that values of K greater than 15 are used in practice.”

Another study on the effectiveness of geomasking on epidemiology data used a value of K = 20 as an indicator of success.

Based on these examples, we propose striving for a K value between 5 and 20.

Tuning to a distribution’s most reasonable standard deviation

Once we set our desired K and have a reasonable approximation of the population density of the subject region, we can reverse-engineer the previous formula to find a reasonable standard deviation to use:

K = ( .6826 * β * A1) + (.2718 * β * A2) + (.0428 * β * A3)

Discerning distance of address displacement

As an example, let’s say we chose a standard deviation of σ = .25 mi. We can then derive how much displacement we can expect following the 68-95-99.7 rule.

Proportion of Geomasked Points	Maximum Displacement
68% (σ)	.25 mi away from original point
95% (2σ)	.50 mi away from original point
99.7% (3σ)	.75 mi away from original point

With this displacement, we can expect 99.7% of the displaced points to fall within three-quarters of a mile of the original point.

Conclusion

We have demonstrated an effective approach to obfuscating user address data in order to help prevent re-identification, while still giving sufficient context to be able to analyze the data to improve our platform and provide a better user experience.

This November, we’re celebrating Native American Heritage Month as part of our overarching efforts to create a workplace that engages people of all backgrounds, while fostering an environment of diversity, equity, and inclusion.

In partnership with our Indigenous@ Employee Resource Group, we’re celebrating and acknowledging the rich, resilient, and diverse cultures, traditions, and histories of Indigenous members of our company and communities.

To kick off a month of celebrating Indigenous people and their accomplishments, we have a series of engaging in-person and virtual events, starting with a kick-off dinner near our Tempe corporate hub.

On Tuesday, we’ll offer professional headshots for Indigenous@ members followed by a weeklong event dubbed Rock Your Mocs, encouraging our members to rock moccasins, or any traditional regalia, to work, the store, pow wows, or anywhere.

From there, we’ll host a breathwork meditation session with Deoné Newell, learn about diabetes prevention with Geriah Paohway of Native Health Phoenix, and wrap up the month with storytelling from an Indigenous Micronesian Master Navigator of the Pacific Ocean, Pwo Larry Raigetal.

We will continue to prioritize investing in and advancing opportunities for historically underrepresented people. Our success as a company is firmly rooted in our inclusive culture and in advancing diversity throughout DoorDash to ensure we reflect the global audiences we serve, with learning and development opportunities available to corporate team members in support of allyship and belonging.

Our efforts to create a workplace where everyone feels welcomed, supported, and valued is one of the reasons we were recently recognized as one of America’s Greatest Workplaces 2023 by Newsweek. To learn more about this award and how we’ve driven the culture our employees experience over the last year, click here.

At Doordash, we value every chance to boost order conversions in the app. When users fail to complete a purchase after adding items to their carts, we send push notifications such as the one shown in Figure 1 to remind them that their orders are still pending. It has been difficult, however, to determine whether users actually have abandoned their carts or instead are simply browsing for more items or different merchants within the app.

*Figure 1: An example of a real-time push notification to prompt purchase completion.*

To polish the notification experience, we want to ensure that cart abandonment notifications are sent in a timely manner — but only when users truly stop adding to their carts and abandon the app.

To do this, we built a large stateful Flink job to track user sessions using mobile/web client events. By analyzing user activity and detecting periods of inactivity in short time frames, we now can send notifications at more opportune moments. This new solution has yielded significant conversion improvements. We delve here into the different options we considered, why we believe our new design is most efficient, and what we have learned as we implemented this solution.

Deciding whether to trigger in the frontend or backend

Backend events typically are generated by internal services, while frontend events are created by the mobile app or web. If our goal is to send notifications to users who update their cart and then stay idle for X minutes, how do we let the system know when to trigger the workflow?

Cart abandonment V1

Our old design required constantly checking the cart update status by calling the backend cart service, kicking off the send workflow when there were no updates.

As shown in Figure 2, this design does not reflect the actual status of user activities. It just blindly checks the backend services in the fixed gap and sends potentially unnecessary notifications to the user, reminiscent of my daughter when we’re driving somewhere:

Cart abandonment V2

In our new design, however, real-time session detection offers a smarter solution that eliminates redundant calls, as shown in Figure 3:

In this design, all user activities in the mobile and web applications generate analytic events. The real-time session detection’s job is to receive all frontend events continuously and group them into sessions. This allows detection of when a session ends as a result of user inactivity, generating a signal to the notification service to trigger the workflow.

Trigger: Real-time session detection

When customers use DoorDash’s app and website, a number of analytics events are generated that yield information about a user’s experience and activities. Detecting these user sessions can give the downstream notification services insights about user activities, including start time and end time. This is an opportunity to develop an intelligent notification system that can send context-aware prompts at the appropriate time with relevant content.

Sessionization traditionally is done as a batch process at DoorDash. But there are huge computational costs and latency involved in joining multiple events and detecting gaps, caused largely by having to load massive amounts of data from cold storage to memory prior to processing. Additionally, the hours required to complete a batch process are incompatible with real-time session detection.

The real-time platform team leverages our streaming platform to create sessions by grouping and identifying various mobile and web events, as shown in Figure 4. We develop a sessionization platform for the application teams to easily define and evolve sessions with different input events and session gaps. Because it is centered around processing in-memory, stream processing eliminates the need to load data from cold storage; events are processed as soon as they arrive over the network and in-memory, reducing latency and the cost of data loading. With the sessionization job, we can easily detect when customers finish a session in real-time, allowing the application to execute operations with these signals.

*Figure 4: Real-time session detection data flow.*

There are, however, a few challenges that need to be addressed when using real-time session detection.

Infrastructure for large-state computing: To build input events sessions, Flink needs to keep all user session events until the session ends. This inflight data is considered as state in Flink and is managed in the RocksDB state backend. With average DoorDash user sessions lasting around an hour, we will need to maintain an hour’s worth of user activities for each customer as its state. That means that the state would be hundreds of gigabytes at any moment. To use the RocksDB backend, we can’t use the existing infrastructure, which shares local storage across jobs. We worked with DoorDash’s infrastructure team to create a means to facilitate this kind of stateful computation with guaranteed persistent volume for each Flink task manager.
Job failure and recovery: Failure recovery is done at Flink’s framework level by restarting failed tasks from the last checkpoint. The checkpoint is persisted in S3 to make it durable and scalable. The Flink job manager is set up in high availability mode with two job managers, one as leader and the other standing by. Upon job manager failure by the leader, the standby job manager immediately assumes the lead role and resumes the job from the job manager state stored separately in S3.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Send: Cart abandonment V2 notification workflow

After the cart abandonment workflow is triggered, we will not send notifications to users if:

The cart is no longer available, for example, if the user already has checked out.
Minimum requirements for sending aren’t met, such as missing critical fields.
There are other reasons that would make a notification inefficient.

Notifications will reach consumers through the notification platform only if all eligibility checks have passed, ensuring that these messages are timely and meaningful.

Result: Outperform other notifications

The new design for cart abandonment drives significantly more order volume and revenue. During our experiment, there was a six-fold increase in notification send speeds. We believe because triggering is more accurate and timely, notifications are more relevant to consumers. As a result, consumers engaged with notifications more, including a 40% higher send-to-open rate and a 2.6% boost in visit rates over the course of the experiment.

Future Work: Cart abandonment notifications with promotions and more

Our success with redesigning cart abandonment notifications has led to a new series of experiments, including:

Adding promotions to cart abandonment notifications when there are eligible items in carts, encouraging consumers to continue checkout at a cheaper price.
Controlling the frequency of cart abandonment notifications. By default, we now send no more than one notification to a user per day. But through a sophisticated new analytics effort, we are experimenting with improving visibility by selecting target consumers to receive more than one per day.

*Figure 5: Real-time event Sending workflow.*

Because we own frontend events, adding other notifications such as merchant abandonment notifications seems to be a natural evolution. With this system in place, such an expansion should be simple because the pipeline already is in place to manage notifications associated with consumer behaviors. In fact, each new notification is only two steps away:

Define a real-time event. The sessionization platform can easily define and evolve sessions with different input events, session gaps, and session outputs. The real-time infrastructure team can build automation and UI for developers to easily onboard new real-time signals. That team also is exploring options to enable developers to define custom logic to process a session event – for example, filtering against clients who viewed merchants but didn’t place any order.
Combine real-time events with the sending workflow. By adding necessary validations, the sending pipeline will skip signals that missed critical fields, such as store name and consumer id. Because the sending pipeline is completely configuration-driven, it’ll be more flexible and much easier to onboard new sendings rather than update codes.

Advice for working with frontend events

By allowing new approaches that can’t be done by backend events alone, frontend events play a critical role in cart abandonment use cases. Before you adopt frontend events in your next design, keep in mind:

Frontend data from mobile web activities is not always reliable or stable. Data could be corrupted or important fields may be missing. Verify data quality before proceeding.
Various glitches may cause unexpected delays. During our experiment, some frontend events reported delays exceeding 24 hours. In some cases, delays were caused when the application was terminated before event syncing to services was completed; delayed events then weren’t reported until the application reopened. An application should take possible delays into account and have methods for handling them.
Make input more reliable by checking consistently with the frontend team to resolve any issues quickly.
Use both frontend and backend data. Backend calls may be able to answer questions that the frontend can’t. In the cart abandonment workflow, fields such as country code and store open status can be fetched from backend services. There is invaluable synergy between frontend and backend data. It not only enables more accurate decision-making, but also empowers an organization to deliver personalized and contextually relevant experiences to users.

Conclusion

We are thrilled with the results of this project. It not only reduces the likelihood of customers leaving essential items in their carts, but also enhances DoorDash order processing efficiency, improving overall user satisfaction. Leveraging real-time data opens a pipeline to a multitude of future possibilities. For example, the app may suggest the related store or items based on the store view and search history in the sessions. We eagerly anticipate incorporating additional experiments and taking further strides toward elevating our app’s user experience.

Acknowledgements

Applause to team members who directly contributed to the project: Mengyang Cui, Eleanore Jin, Julie Xue, Sonic Wang, Allen Wang, Yanlin Peng, Nicole Lin, James Smith, Eric Bencomo Dixon, and Kristine Fernandez.

Special thanks to the following team members who closely collaborated with us: Karthik Thota, Michael Adaikalaraj, Xavier Hodges, Praveen Goparaju, Mike Pitre, Jeremy Pian, Chun-Wei Lee, Almer Bacallan, Abhijeet Bane, Keith Lehmann, Lisa Shang, Bronson Mirafuentes, Jason Lai, and Yun Huang.

At DoorDash, the growth and development of our engineering talent is critical to our success and ability to continue innovating. Apprenticeship has had a long history of successfully cultivating new generations of talent across many different industries. Tech is no different. Designed to identify and foster exceptional engineering talent within the company, DoorDash’s Apprentice Engineering Manager Program prepares engineers to transition into a people management role effectively and autonomously. Launched in 2019, the program is a mandatory part of preparing senior engineers for promotion. The shift from individual contributor to manager can be a significant step in an engineer’s career because it involves not only taking on the responsibility of overseeing the work of other engineers but also managing their career progression. Here we explore what makes this program successful, what we’ve learned so far, and our plans for the initiative’s continued growth.

Why we created the apprentice engineering manager program

The Apprentice Engineering Manager Program was created to help senior engineers who demonstrate an aptitude for people management to further develop these skills. The program also serves other key factors:

Nurturing/retaining talent and developing leaders: Many senior engineers have built skills like leading through influence, broad scope, and impact. The Apprentice Engineering Manager Program offers them an additional platform to hone leadership capabilities. Through structured training and mentorship, this program equips engineers with the core skills and knowledge needed to excel in management. Through investing in developing internal talent, DoorDash can retain skilled engineers who might otherwise seek opportunities elsewhere to pursue their career growth. Our initial goal was to create a program for career development within a safe and familiar environment.
Building and scaling a solid management bench: DoorDash recognizes the importance of exceptional managers in the success of its engineering teams. This program ensures we have a robust and adaptable management team able to address dynamic challenges while giving us the strategic advantage of scaling our leadership capacity to meet the demands of our growing organization. By thoughtfully selecting and grooming potential managers, the company ensures effective leadership for every engineer.
Enabling empathetic leadership: When individual contributors (ICs) transition into engineering management roles, it often brings the team a valuable infusion of empathy and understanding leadership. With their deep IC backgrounds, apprentice engineering managers (AEMs) already understand the challenges their team members face, leading to more compassionate and supportive leadership. They tend to lead by example, naturally fostering a culture of collaboration and mentorship, ultimately enhancing team dynamics and performance.

Program eligibility

Engineers seeking to enter the Apprentice Engineering Manager Program must meet specific criteria, including being a senior tech leader who has demonstrated experience leading impactful projects while maintaining good performance and demonstrating strong motivation to mentor other engineers. Designed to ensure candidates have the requisite technical and leadership skills to be successful managers, these criteria guide their managers toward nominating engineers for the program. Nominees are reviewed and then take part in interviews prior to acceptance. The process encourages early conversations about a candidate’s interest in moving into management and gives engineers the opportunity to understand the program’s depth while allowing them to make an educated decision about being nominated.

Mentoring and support

Program participants partner with high-performing engineering managers from other teams. This mentorship involves biweekly 30-minute sessions to help AEMs effectively integrate into their new roles. Managers who move direct reports to an AEM are expected to participate in the transition and mentorship process. Additionally, a Slack channel is created for each cohort, allowing AEMs to share ideas, ask questions, and receive updates on the training schedule. This community support enhances their shared learning experience.

Graduating from the program

Over time, AEMs are expected to align with the level descriptions documented in Management Levels Expectations. Feedback is documented throughout the program and provided to AEMs to help them understand their progress. After six months, a final 360 review is conducted during which a committee of key stakeholders decides whether the AEM will continue as a people manager, require a longer apprentice period, or transition back to an individual contributor role.

What we’ve learned so far

The Apprentice Engineering Manager Program has yielded valuable insights and lessons since its inception which now serve as a foundation for DoorDash’s commitment to continuous improvement.

The program has highlighted the importance of identifying potential leaders within the organization early. By nominating engineers for the program, DoorDash encourages proactive discussions about career growth and leadership aspirations. Structured training sessions facilitated by senior leaders effectively equip AEMs with the necessary skills for their roles, ensuring that participants receive consistent guidance from experienced professionals. Managers of our AEMs provide continuous feedback throughout the program to track progress and encourage AEMs to sharpen their skills amongst peers.

“I appreciated how the program made the managerial world feel smaller,” said Engineering Manager Bryan Huang who works within our Launchpad vertical and graduated from the program in early 2023.

“It connected me to managers starting out just like me, to a manager mentor who had been managing for a while, and finally to experienced executives who ran our sessions. To this day, I understand the importance of connecting and empathizing with people across our org.”
– Bryan Huang

The value of collaboration also became clear to Bernard Ng, a Manager of Technical Program Management, who graduated in 2022. “I enjoyed my experience with the program as it provided a cohort of managers with similar experiences and situations,” he said.

“The content was practical and, more importantly, the exposure to senior presenters led to useful Q&A dialogue that went into more nuances. Real-life applied learnings include how I structure my hiring conversations and strategies for closing candidates. An assignment of an experienced manager mentor helped me navigate situations and set good foundations on my team.”
– Bernard Ng

The program and cohort participants foster a strong sense of community among AEMs. This platform not only facilitates knowledge sharing but also serves as a support network for participants and allows us to exchange insights and lessons with the overall engineering manager community.

Future of the AEM program

With dozens of successful managers coming out of the program over the last few years, we’re eager to continue scaling it. Through leveraging both quantitative and qualitative metrics such as attrition, overall manager effectiveness, and tenure, we have developed ambitious plans to enhance the program. Two key focus areas will be expanded programming for developing more practical skills and increased diversity to ensure our cohorts represent the full scope of our talent.

In short, DoorDash’s Apprentice Engineering Manager Program has continued to evolve with each new cohort. It is not just a training initiative; it’s a strategic investment in the company’s future. Through identifying and nurturing talent within our ranks, DoorDash is poised to maintain its competitive edge in the tech industry. With a commitment to continuous improvement and a focus on inclusivity, the program exemplifies DoorDash’s dedication to fostering leadership excellence and innovation within its engineering teams.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

The challenges of building a recommendation model

Overall framework

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

ML model deep dive

Collection retrieval

Item ranking

Addressing position bias

Diversifying our recommendations

Future personalization goals

Acknowledgments

Infrastructure-as-code and pull request automation

Bypassing approval

Plugging the leak

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

Reducing review fatigue

Management nightmare

Conclusion

Key terminology

Understanding the legacy architecture

Developing a new architecture

Capturing a small win: A use case for Kafka Super Users

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

Going for the big win: Streamlined Kafka resource creation

Savoring the win

Exploring a storage self-service future

Acknowledgments

Importance of ML at DoorDash

The concept of an ML Workbench

Workbench Evolution

ML Portal motivation and backstory

Workbench goals

Workbench development strategy

User research

Key findings

Setting a vision and scoping out a launch-ready MLW v1

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

Use Case

Problem: Feature Upload Status

Problem: Feature values serving lookup:

Solution

Testimonials

What’s next?

Acknowledgements

A general overview of the release cycle

Release management responsibilities

Managing communication

Testing release candidates

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

Handling regressions and hotfixes

Monitoring rollouts post-release

Conclusion

Privacy engineering overview

Merchant-defined policies

Asynchronous redaction

Address geomasking

Gaussian perturbation

Twitter case study

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

Quantifying geomasking effectiveness

Population density

Estimating k-anonymity

Finding an optimal K value

Tuning to a distribution’s most reasonable standard deviation

Discerning distance of address displacement

Conclusion

Deciding whether to trigger in the frontend or backend

Cart abandonment V1

Cart abandonment V2