Page 2 – DoorDash

As part of our ongoing efforts to enhance product development while safeguarding app health and the consumer experience, we are introducing metric-aware rollouts for experiments. Metric-aware rollouts refer to established decision rules to flag issues with automated checks on standardized app quality metrics during the new feature rollout process.

Every action DoorDash takes focuses on enhancing the consumer experience. Through deploying metric-aware rollouts, we aim to prevent performance degradation and other problems for our customers when we add new elements or products. This new feature allows us to experiment and innovate while continuing to safeguard app health and deliver a delightful user experience.

Navigating product development risks

Product development inherently involves risks. Experimentation and rapid product releases can lead to unexpected consequences. At DoorDash, we have occasionally seen such unintended effects in the form of increased app latency, higher error rates, and — in extreme cases — app crashes. Shipping experiments could potentially degrade app latency, directly impacting the consumer experience, which is always at the forefront of our minds

The key issues before implementation of metric-aware rollouts include awareness, metrics and process standardization, and traceability. Lacking app quality metrics leaves experiment owners in the dark about app degradation. Without standardized metrics, app health tracking becomes inconsistent. Problematic rollouts are not paused or blocked when metrics degrade. Metric owners and app page owners don’t know which experiments are impacting metrics, meaning it can take several weeks to identify the cause of degradation. The absence of a systematic detection mechanism for app quality degradation creates significant challenges for both metric/product owners and experiment owners:

App quality metric/product owners face a labor-intensive process to identify which experiments are causing app quality degradation; this often involves manual tracking and analysis. Without a detection system, problematic experiments may reach full rollout, exacerbating issues.
Experiment owners may not know which app quality metrics to monitor and what actions to take when these metrics show statistical significance. Interpreting results can be challenging, leading to frequent consultations with metric owners.

To address these challenges, the automated guardrails introduced by metric-aware rollout allow iterating and releasing products in a more controlled manner, balancing speed and quality through implementing scalable, automated solutions to mitigate app quality risks. Through more measured across-the-board tracking, we can ensure our experiments are both innovative and safe, ultimately enhancing the product development process and protecting the consumer experience.

Enter metric-aware rollouts

Metric-aware rollouts address these issues with automated checks on standardized app quality metrics. With this new process, we automatically pause rollout and alert the experiment owner when degradation is detected. This ensures consistent measurement and immediate response when deviations occur. Here’s how we achieved this:

(1) Standardized metrics: App quality metrics are now uniform across all experiments with two layers of metrics for quality checking.

Composite metrics or a single metric measure the percentage of page load and page action events that are deemed as “bad” — for example, encountered poor latency or an unexpected error or crash on any platform, including iOS, Android, or the web. By minimizing the number of app quality-related sub-metrics that the experimentation platform needs to check, we reduce the risk of false positives and computational cost/time.

However, in the case of statistical significant degradation in any of the three composite metrics, we added a second layer of supplementary checking to democratize the debugging and diagnostic process and encompass a more granular view of our app quality status.

We break down app quality measurement into five sub-metrics, each with a corresponding page- or action-level dimension cuts to let experiment owners easily pinpoint the exact issues within each experiment.
- Page load duration P95
  - App: Measures the time from the empty screen display to the data being fetched, processed on the clients, and updated on the screen.
  - Web: Measures the time from a page request until its largest element is visible on the screen, i.e. LCP time.
  - Dimension cuts: All core T0 pages
- Action load duration P95
  - App: Measures the time from the user interface receiving user input events to the action being successfully completed.
  - Web: Measures the time from initiating a GraphQL call to the client receiving a response within an action.
  - Dimension cuts: All core T0 actions
- Unexpected page load error
  - App & web: Measure the percentage of unexpected errors encountered out of all page load events, where unexpected error is defined as errors caused by the app, hence fixable on our end.
  - Dimension cuts: All core T0 pages
- Unexpected action load error
  - App & web: Measures the percentage of unexpected errors encountered out of all action load events.
  - Dimension cuts: All core T0 actions
- Crash
  - App & web: Measures the percentage of crashes encountered out of all page load events without errors.
  - Dimension cuts: All core T0 pages

(2) Automated analysis, tolerance configuration, and false positive control: App quality analyses are automatically created with analyzed metrics to thoroughly examine whether the experiment has led to any significant degradation in app quality.

The analysis follows the decision-making process outlined below. This automation streamlines complexity for users by considering the following factors:

Experiment robustness: All experiment health checks are taken into account to ensure the setup is robust and the results are valid.
False-positive control: Addressing false positives typically requires specific knowledge; however, our process handles this complexity through Multiple Hypothesis Testing correction and sequential testing.
Tolerance for degradation: We introduce the concept of tolerance because we recognize that some degree of regression on app quality metrics may be acceptable during feature development. This allows metric owners to configure tolerance levels for each app quality metric independently. We assess against these thresholds to detect significant degradation.

With the experimentation platform handling decision-making complexity, both experiment and metric owners can proceed confidently with their respective tasks.

*Figure 1: Decision tree for the automatic detection of degradation*

(3) Automatically disabled rollout or rollback: If degradation exceeds the set threshold, the rollout is paused or blocked to protect app health. Experiment owners receive alerts and can take immediate action.

Approval process and budget framework

For DoorDash to be competitive, we know we must quickly iterate ideas that are worth pursuing. During the experimentation phase, we do not expect the teams to hyper-optimize performance; instead, we encourage them to prioritize fast learning and only invest in optimization after the business value is proven. As a result, some experiments will negatively impact app quality.

We create a budget framework to allow experiment owners to roll out features that have short-term negative app quality impacts; owners remain responsible, however, for repaying those negative impacts through a clawback plan.

When a quality regression below the metric’s tolerance threshold is detected by the metric-aware rollout system, the experiment team must prepare a clawback plan and obtain sign-off from the metric owner team before continuing the rollout. The clawback should be prioritized for action no later than the following quarter. But if the regression exceeds the tolerance threshold, further rollout is blocked unless the experiment team obtains an override approval from leadership.

Impact on DoorDash and beyond

Implementing metric-aware rollouts provides significant benefits, ultimately enhancing the experience for our customers and our engineering team.

Enhanced visibility and diagnosis: By standardizing app quality metrics, we ensure that all consumer tests are visible, allowing for effective diagnosis of any issues. This means we can quickly identify and address problems, maintaining the high quality our customers expect.
Rapid response: With the ability to automatically pause or block experiment rollouts when issues are detected, the time from experiment start to rollback is significantly reduced. This proactive approach safeguards app health and ensures a consistently delightful experience for our customers.
Proactive management: Experiment owners can now monitor the impact of their projects on app quality in real time. They can identify the root causes of any degradation and take immediate action to rectify the issues, ensuring our app remains reliable and user-friendly.

Our commitment to metric-aware rollouts is not just about operational efficiency; it’s about delivering a seamless and enjoyable experience for our customers. By proactively managing and maintaining app quality, we ensure that every interaction with our app is positive and frustration-free, reinforcing our dedication to customer satisfaction.

Broader implications

Metric-aware rollouts mark a significant step forward in safeguarding app health and consumer experience. By addressing key challenges in our development process, we have created a scalable and automated solution that mitigates risks while maintaining our product development momentum.

While initially focused on app quality, our approach to metric-aware rollouts has the potential for broader applications. It provides a robust framework for managing risks across various metrics, potentially extending to other business guardrails. This standardization can streamline decision-making processes company-wide, ensuring consistent and reliable product development.

Looking ahead, we plan to expand this feature to additional company guardrail metrics, implementing smarter systems for metric association, monitoring, and reducing false alarms. Ultimately, we aim to enable auto-rollout and rollback of experiments, further enhancing our ability to innovate safely and efficiently.

This is just the beginning of what metric-aware rollouts can offer. As we continue to refine and expand this feature, we anticipate even greater improvements in app health and consumer experience, setting new standards for excellence in product development.

Acknowledgment

Without a doubt, building such a safeguard framework to support a diverse set of use cases requires collaboration between various teams, including design and build of the feature on the experimentation platform — Erich Douglass, Eric Liu, Jonathan Orozco, Baldomero Vargas, Yixin Tang, and Drew Trager; insights from the consumer platform team — Daniel Hartman and Akanksha Gupta; and Pradeep Devarabetta Mallikarjun from the Dasher team for enabling this feature on Dasher experiments.

During Veterans Heritage Month, all of us at DoorDash celebrate and honor veterans for their service. During the month of July, we are reminded of the brave individuals who have served in our armed forces in the US and Canada.

In partnership with our Veterans@ Employee Resource Group, we’ve developed a blend of in-person and virtual programming that emphasizes the unique skills, perspectives, and experiences of our veterans workforce and communities.

To kick off the month, our Veterans@ ERG participated in a group WeDash day followed by lunch at a veteran-owned local business in Tempe. Throughout the month, we hosted a veterans home volunteer event featuring donations and a painting social hour in Tempe, a food bank event in San Antonio, and a Habitat for Humanity event with Operation Playhouse to build a playhouse for a veteran family in Boston.

To round out the month, our Able@, Black@, and Indigenous@ ERGs joined our Veterans@ for a game of trivia.

To all the courageous active military, veterans, and to those who care for you: we at DoorDash thank you for your service, sacrifice, and bravery.

Click here to learn more about what it’s like to work at DoorDash and our Military Leave Policy, which we developed in partnership with our Veterans@ leadership team last year.

As a proud alumnus of UC Berkeley, DoorDash CEO and Co-Founder Tony Xu recently delivered this year’s keynote address for the College of Engineering’s 2024 bachelor’s degree ceremony.

Tony shared his experience as a child of immigrants, his journey through school, and his non-linear path to co-founding a company.

Check out the full speech below to hear more about Tony’s life, career, and the three important lessons he’s learned along the way:

DoorDash has redefined the way users explore local cuisine. Our highly interactive notification system has been an integral part of this experience by not only keeping users updated about deliveries but also by acting as a pathway to personalized restaurant recommendations.

Our notifications are meticulously designed to be an essential line of communication that keeps both utility and personalization in mind. Because we believe each meal should be an exploration and an opportunity to discover new culinary delights, we leverage personalized notifications to open unexplored avenues of cuisine for users to try. With each successful experience, users gain a stronger sense that DoorDash understands their preferences and values, fostering loyalty and engagement.

In this blogpost, we discuss how we deploy Graph Neural Network (GNN) models to create personalized restaurant recommendations to share with users through push, email, and hub notifications, as shown in Figure 1.

Real-world data often exhibits a rich, interconnected structure that can be naturally represented as a graph. A graph is a data structure consisting of two components: nodes, often called vertices, and edges. For instance, a simple graph G can be defined as G = (V, E), where V is the set of nodes and E are the edges between them. In DoorDash context, a user and a restaurant can be nodes and order can be an edge between them.

GNNs are a class of neural networks designed specifically to work with graph-structured data. They leverage the relationships represented by edges, along with the attributes of nodes to perform a range of Machine Learning tasks like node classification, edge or link prediction. Used broadly by such companies as Pinterest, LinkedIn, UberEats, Airbnb, and Meta, GNNs can analyze relationships between products and users to create a recommendations framework based on past behavior and interactions.

*Figure 1: Notification content generation and user experience*

Why GNNs?

Without manual feature engineering, traditional machine learning models struggle to discern relationships between various entities such as users, and restaurants and edges such as order. But because GNNs work with graphs that already model complex interactions and relationships, they can make recommendations based on the entire heterogeneous feature set of products and users. GNNs reduce the need for some degree of manual feature engineering by focusing on:

The interdependency of training examples that enhance each other;
Expressive and learnable aggregations; and
Flexibility across tasks and training objectives.

GNN models also have the advantage of being able to learn collaborative filtering and combine that with node features and attributes to recommend new restaurants to try based on previous interactions. GNN models can transform and smooth node features through passing and transforming them around each node’s neighborhood. The graph view of users and restaurants is powerful because it allows enriching an entity’s representation with information from other connected entities. Moreover, this feature enrichment is learned and will be optimal for the given task.

DoorDash’s Marketplace recommendation challenges, such as reaching inactive users through notifications and leveraging fragmented data, are ideally addressable by GNNs. GNNs excel in extracting personalized information by utilizing interconnected data from various sources such as users, restaurants, orders and views. It makes generating more relevant recommendations even when user interaction is minimal.

How GNNs work

GNN learns on graph data. Our DoorDash data is a rich source for graph creation with various nodes like user, and restaurant, balanced against edges such as orders, views, as shown in Figure 2, and 3 below.

*Figure 2: Our DoorDash Data in a graph*

*Figure 3: An example of data collected and used to build graphs*

GNN mechanism

Figure 4: The above two pictures represent node classification using GNN. The left side picture shows that we want to classify the node A. The right side shows how the information is passed across the neighbors to help in classifying the node A. *[Source: Professor Jure Leskovec’s slides]*

There are multiple components in the GNN models:

Message passing: In message passing, each node in the graph sends a message to its neighboring nodes, typically relaying the node’s features, which are then processed by the receiving nodes.
Aggregation: Each node aggregates messages from neighboring nodes to update its representation. This step involves combining incoming messages in a meaningful way, such as summing or averaging them. The aggregated information is then used to update that node’s features to capture the influence of its neighbors.
Stacking of layers: In GNN, multiple layers of message-passing and aggregation are stacked up to capture more complex representations of the graph’s nodes. Each layer in the stack inputs node features from the previous layer, processes them through message passing and aggregation, then updates the node representations. As information passes through multiple layers, each layer learns to capture different levels of abstraction and relationships in the graph. The deeper the network, the more complex and abstract the features are that it can learn, allowing for more accurate predictions and better performance on graph-related tasks.
Learning objective: Ultimately, we define the task based on the learning objective, such as node classification, link prediction, graph prediction, or node clustering.

Creating graph datasets

DoorDash uses multiple data from user<>restaurant interactions to build graphs and train GNNs. Heterogeneous graphs are created based on one year’s worth of data collected. The graph is defined based on a relational table as shown in Figure 5 below. Each table defines a node type, each row in a table is a node, and matched nodes define an edge.

*Figure 5: Data collected and used to build graphs in tabular form showing attributes of nodes and edges between them*

Model training

There are three broad components for GNN model training:

Encoder: A GNN’s encoder is responsible for creating the graph structure and node features in a latent space representation that captures the relational information and dependencies among the graph’s nodes. We encode the features into user and restaurant embeddings.
Model: The model lies at the core of a GNN to process the latent node representations generated by the encoder to make predictions. It typically involves additional layers of neural networks that process the node embeddings and produce the desired output. We obtained high performance in our offline evaluation results with ID-GNN (Identity aware GNN), which is designed to leverage the structural and identity-specific features of graph-structured data. In ID-GNNs to embed a given node, an ego network centered at that node is first extracted, then message passing is applied, where the messages from the center node and the rest of the nodes are computed using different sets of parameters.
Prediction: Ultimately, a GNN uses the learned model to make predictions on unseen or new data. We defined our modeling task as link prediction between user and restaurant. The model component predicts the probability that a user will order from a given restaurant within the next seven days.

We train the GNN model with the above-mentioned components and retrieve both user and restaurant embeddings.

Offline evaluation

We measure the performance of the GNN model in terms of evaluation metrics like MAP@k, and MRR@k with multiple values of k ranging from 1 to 10 against our production LightGBM model. The GNN model showed a significant lift in offline performance.

Integrating GNN recommendations in the ecosystem

We send notifications to users once per day, so we don’t need our ML system to be online. We retrieve user and restaurant embeddings from the GNN model. Because we only want to send recommendation notifications for restaurants in a user’s delivery area, we use geolocation to select restaurant candidates. Based on the geohash and distance of the restaurants from the user address, we identify up to N closest restaurants in the delivery range of each user’s address. We retrieve restaurant candidates based on user’s recently used addresses. During the ranking stage, we use GNN-based user and restaurant embeddings to compute the likely affinity between users and candidate restaurants for the user address. As shown in Figure 6 below, we then apply post-processing filters to introduce freshness, remove restaurants with low ratings and to deduplicate restaurants, for instance, by eliminating restaurants for multiple locations of the same franchise. These restaurant recommendations are then sent to our users via notification platform.

Impact

After we sent personalized recommendations with the heading “Try Something New,” powered by GNN, we saw a 1.8% relative increase in the input metrics such as push / hub engagement metrics. We also noticed stat sig lift in our top line metrics like monthly active users. The impact was widespread on all user cohorts, validating that our users are satisfied with the quality of our recommendations.

Looking into the future

We plan to extend GNN applications to generate recommendations for more notification types, helping us reach our users with more diversified and unexplored content. We will continue to iterate our GNN model architecture to improve the personalization of recommendations for notifications.

Acknowledgments

Special thanks to Zidong Yang, Lei Xu, Yanjia Xu, Richard Tian, Mengyang Cui, and Juncao Li for making GNN possible at DoorDash.

I’m thrilled to introduce our brand-new DoorDash office in São Paulo, Brazil. We’re committed to fostering an environment where Brazilian engineers can thrive and make a significant global impact. Our office is more than just a workplace; it’s a hub of innovation, collaboration, and growth.

With over 5 billion orders fulfilled, DoorDash is a technology company that connects consumers with their favorite local businesses. We build products and services enabling these businesses to innovate and grow. This year, DoorDash debuted on the 2024 Fortune 500 list among the top 500 US companies based on total revenue. We operate DoorDash and Wolt marketplaces in 30 countries, with a strong presence in the United States, Canada, Australia, Japan, New Zealand and Europe.

Life of an engineer at the São Paulo DoorDash office

São Paulo is not just the largest city in Brazil; it’s a burgeoning tech hub with incredible talent. By establishing a presence here, DoorDash is tapping into a rich pool of skilled engineers who bring diverse perspectives and innovative ideas.

Engineers who join our São Paulo office get to work on cutting-edge technology, while contributing to projects that have a worldwide impact. It’s an opportunity to collaborate with top-tier professionals, gain international experience, and grow your career in a dynamic and supportive environment.

We value in-person collaboration while also recognizing the importance of providing employees with the flexibility to work from home. For those who want that in-person connection, we meet a few days a week at our WeWork office on Av. Paulista and enjoy team lunches, snacks, coffee, and a friendly competition of foosball.

You’ll get the opportunity to work with professionals from around the world and travel to our other offices to meet some legendary figures from Silicon Valley. The problems our engineers get to solve push the boundaries in the dynamic world of logistics, and it’s one of the key motivators of working at DoorDash for software engineer Breno Silva:

There is a lot of room for innovation here. Everything we do is designed to be done in the best way ensuring we deliver quality rather than just fulfilling a delivery and then having to redo it in a few months.
Breno Silva, Software Engineer on the Money Team

Come join us at our Engineering hub in Brazil!

We are helping to write the beginning of DoorDash’s story in Brazil, shaping the local company culture for the growth that is coming in the near future.

The opportunity to work at DoorDash is truly unique. Here, I have the chance to work with the best technology professionals in the market, as well as interact with people from different backgrounds and cultures in a global environment. The daily challenges and opportunities for learning and professional growth are key motivators!
Marcela Garcia, BI Engineer on the Data Platform Team

We are looking to hire engineers for the following teams:

Data Platform
Money & Financial Insights
Trust & Safety
Risk Data
Ads

It is exciting to work in a diverse environment with people from various backgrounds and nationalities. DoorDash is a fast-paced company where you get to solve real-world challenges. For those who like working on a massive scale on an intriguing product, DoorDash will provide that!
Lucas Lopes, Recruiter

Interested? Click here to see our open positions in the São Paulo office.

No matter what she worked on, Melissa Puskar was always drawn to solving problems. She was incredibly curious about the way the world works, and after achieving a degree in industrial engineering, she began to put the pieces together on how she could make an impact.

She had varied experiences early in her career, understanding how warehouse operations work, sitting in corporate functions, and engaging in both customer- and vendor-facing roles. In 2017, Puskar had an opportunity to join DoorDash at a pivotal point in the company’s journey as it was expanding across new states and developing relationships with some of its biggest partners.

For her, the interview process gave insight into the environment she was preparing to launch the next phase of her career.

“I was really inspired by every single person that I met in the DoorDash interview process, and I knew that if I took the leap and joined this team I was going to learn a ton,” Puskar says.

She joined the Midwest local operations team based in Chicago, where she had first-hand experiences on how the merchant, customer, and Dasher functions come together. Later, Puskar moved over to the Dasher Operations team and focused on creating a centralized function.

In 2019, she moved over to the DashMart business, a convenience and grocery store offering from DoorDash. As this new branch of the DoorDash business was forming, Puskar and the team also had to navigate the world turning upside down with the COVID-19 pandemic.

“Before we knew it, the whole entire world had changed. So not only were we trying to build DashMart, but we were trying to do it in a world that had gone fully remote. We found ourselves as essential workers at a time when we least expected it,” Puskar continues. “It was rewarding to see what we could accomplish when we absolutely needed to and inspiring to know that we were making such an impact for our customers.”

Since then, Puskar has been front and center as the DashMart business has grown. With every step along her path, it’s required taking an iterative approach to her work. Puskar talks of the importance of being able to jump into the unknown, while embracing the speed at which things change at DoorDash.

These development opportunities are grounded in the concept of embracing dynamic experiences.

“What’s really cool about working at DoorDash is that you’re going to get exposure to a lot of different types of problems and a lot of different types of opportunities. So in DashMart, one way that this manifests is that we have a variety of operations that come together under one roof to support our customers,” Puskar says.

“You have the opportunity to learn about fulfillment, warehouse stocking and sorting, and everything in between. You get to contribute to meaningful solutions and learn about where your passions are. This could take you deeper into a subject matter expertise path. It could take you towards a higher leadership position. It could even challenge you within your existing role to keep learning new things, and keep having new experiences in your day to day. There’s a variety of different solutions for folks that are all in different life places and have different goals.”

Puskar says she’s continued to find herself engrossed in harder problems as her career has evolved, always aiming to find ways to make an impact. And that’s what has led her to where she is today.

“What I’ve seen since joining is that the people that we work with here have such a high caliber, that I find myself being pushed to be a better version of myself day in and day out. That quiet inspiration is done in an incredibly supportive way, and it has been the best learning opportunity of my career. I feel incredibly lucky to be part of this team and in this culture,” Puskar adds.

“What’s really inspiring to me is that at its core, the way we solve problems is still the same, the pace of operations is still the same as it was back when no one knew what DoorDash was. As market leaders, we stay customer obsessed, we’re still focused on getting one percent better every day. We’re still focused on best in class operations, and moving as quickly as possible to get the outcomes that we want for our customers and for our teams.”

Ultimately, the DashMart business continues to grow thanks to people like Puskar, who are constantly looking to solve new problems and building innovative solutions to bring the neighborhood to customers around the country.

“I’m really inspired by not only the problems that we solve for consumers, but also the problems that we get to solve for our merchant partners, where we’re going to help them bring their products to customers in a new way. We’re creating more opportunities for dashers. And we’re also creating more opportunities for our on-site teams,” Puskar continues.

“I’m so inspired and always have been by our on-site populations. I think that it is really exciting to be able to build a business where we can deliver meaningful experiences for our hourly teams, where we can make sure that our teams have the best tools possible to get the job done. Creating an environment where our teams can bring their authentic selves to work is everything to me as a leader, and we’ve built the foundation for success in having a supportive and collaborative community. I fully believe that if we empower our employees to do their best work, if we care about them as people, if we give them the tools to set them up for success, then everything else that we accomplish is better. Having a strong culture is ingrained in the very fabric of what we’ve built.”

With nearly five years on the DashMart side of the business, Puskar isn’t slowing down. She says the most exciting days are to come.

“What we’ve built so far is absolutely incredible and it’s a testament to everyone on the team. There’s a lot of growth and a lot of white space in front of us,” Puskar says.

“When I think about DashMarts, we stand for fresh, local, reliable. And as I look forward, I am excited for the day where all customers know this. And ultimately, when they think about getting an item from anywhere, their first instinct is questioning if they can get it through DoorDash, because they know they’ll get it on time, in good condition, and in a reliable manner.”

At DoorDash, we are proud to offer an inclusive environment, where everyone can be their true, authentic selves. As part of our efforts to create a workplace that engages people of all backgrounds while fostering an environment of diversity, equity, and inclusion, we’re celebrating Pride Month this June alongside our Pride@ Employee Resource Group.

This year’s Pride Month theme is “Kaleidoscope,” celebrating the diversity and vibrancy within the LGBTQ+ community by emphasizing how race, gender, sexuality, ability, and other aspects intersect to form unique individual identities.

Throughout the month of June, in partnership with our Pride@ ERG, we’re hosting a slew of in-person and virtual events designed to build community among our LGBTQ+ peers and allies.

We’ll kick off the month with a special edition of Drag Queen Bingo followed by pride themed trivia. We’ll host an artist spotlight with Geo Neptune, an empowering mental health session, and sip and slay with our resident mixologist, Jay. We’ll wrap the month of celebration with a roundtable discussion covering topics related to LGBTQ+ identity in the workplace and a letter writing event in our New York, Tempe, and San Francisco offices to support Point of Pride, an organization dedicated to providing financial aid and direct support to trans people in need of health and wellness care.

To encourage support during Pride Month and throughout the year, our Pride@ ERG also shared with our people a list of LGBTQ+ owned restaurants on DoorDash across our major office hubs.

We will continue to prioritize investing in and advancing opportunities for historically underrepresented people through our efforts as a company and by encouraging learning and development, such as Understanding and Supporting LGBTQ+ Employees and Becoming an Ally to All training opportunities.

When it comes to reducing variance in experiments, the spotlight often falls on sophisticated methods like CUPED (Controlled Experiments Using Pre-Experiment Data). But sometimes, the simplest solutions are the most powerful and most overlooked – like reducing or eliminating dilution. This unglamorous yet effective technique is free, easy to implement, and plays nicely with other variance reduction methods.

So what is dilution? Dilution happens when we track users in an experiment who could not possibly have been impacted by the treatment. To illustrate, imagine you’re testing a feature to improve conversion on the checkout page of your company’s website. You’d like to test the hypothesis that changing the web checkout button from blue (control group) to green (treatment group) will improve your conversion rate. You’ll randomly assign users to control and treatment groups using a hash function keyed on their user_id’s.

Your product manager asks whether we can increase the power of your experiment by also randomly assigning non-affected users who visit the app into the experiment’s treatment and control groups. You won’t actually change the checkout button’s color in the app – only the website. Nevertheless, you could assign app visitors to treatment and control using the same hash function and increase the experiment’s sample size.

Will simply assigning app users to control and treatment groups increase the experiment’s power, even though we are not making any changes to the app? Of course not! In fact, as we’ll show in the next section, it will reduce the power of the experiment. Despite the fact that including ineligible populations is harmful to experimental measurements, this is a common mistake made throughout the tech industry (as noted in Kohavi et al. and Deng et al.)

Numerical Example

Continuing the example from above, suppose we are changing the checkout button color on a web page, but otherwise making no change to our apps. Let’s say that 70% of users who land on the checkout page will convert to placing an order. The conversion can be represented by a Bernoulli random variable with 𝑝=0.7 and standard deviation:

Van Bell et al (2008) showed that for a metric with standard deviation and collected sample size n, the minimum detectable effect Δ we can measure with a confidence level of 95% and a power of 80% requires a sample size as follows:

If we want to measure a change in signal Δ with a confidence level of 95% and a power of 80%, then the sample size we need is:

Thus to measure a 1% change in checkout conversion rate, we would need a minimum sample size of 𝑛 = 16*0.21/0.01² = 33,600 users to land on our web checkout page.

But what if we accidentally include users who land on the app checkout page, which is not impacted by the experiment? For simplicity, let’s say that the app and web checkout pages have equal conversion rates of 70% and equal volumes of users. Let’s define the dilution factor 𝑑 as 𝑛_{𝑑𝑖𝑙𝑢𝑡𝑒𝑑} = 𝑑 * 𝑛_{𝑒𝑙𝑖𝑔𝑖𝑏𝑙𝑒} so 𝑑=2 in this case. As we dilute the experiment, the signal 𝛿 intuitively shrinks as 𝛿_{𝑑𝑖𝑙𝑢𝑡𝑒𝑑} = 𝛿_{𝑒𝑙𝑖𝑔𝑖𝑏𝑙𝑒} /𝑑 while the the minimum detectable effect, or MDE, using the formula above shrinks as 𝛥_{𝑑𝑖𝑙𝑢𝑡𝑒𝑑} = 𝛥_{𝑒𝑙𝑖𝑔𝑖𝑏𝑙𝑒} /√𝑑. Because the signal shrinks faster than the MDE, dilution from app checkout traffic — which is ineligible for the experimental treatment — will spoil our ability to measure a change in eligible web checkout traffic.

Put another way, dilution increases the sample size by a factor of d, but increases the required sample size to maintain MDE by 𝑑². This relationship is plotted in Figure 1 below.

Figure 1: A simulated A/B test showing that dilution has a non-linear impact on experiment sensitivity. Experiments with even 25% dilution will experience a doubling in duration to maintain the same level of power *gist*.

With our accidental dilution from users landing on the app checkout page, that same 1% change in web checkout conversion now corresponds to a 0.5% change in overall web + app conversion because half the conversion happens on the unchanged app. So for the diluted experiment Δ=0.005, we would need 𝑛 = 16*0.21/0.005², 134400, or four times as many users in the overall diluted experiment, which translates to twice as many users needed to land on the web checkout page.

Mistakes that drive dilution

Below are four common mistakes that cause dilution.

Mistake 1: Evaluating everything everywhere all at once

A common cause of experiment dilution is choosing to use too broad of a brush at the outset by evaluating all users at an application’s entry point by encompassing all experiments simultaneously and logging all of those evaluations as real. For example, as shown in Figure 2, an experimenter may create an evaluation class that assigns users to different experiments as soon as they engage with the app without considering whether the changes being tested might affect the user experience. This process may simplify the technical implementation of experiment assignments, but it overlooks the nuanced but critical impact of getting sensitive results. A possible resolution to the problem is shown in Figure 3.

Figure 2: This diluting approach to structuring experiment assignments within a codebase shows all experiments being evaluated and logged as a user enters the application flow. Evaluation results are then used downstream to decide which feature should be shown.

*Figure 3:* This is a corrected, undiluted design that performs and logs experiment evaluation only if a user reaches a particular feature trigger. For instance, if a user reaches the checkout page, we then can decide to show either the control or treatment version of that page.

Mistake 2: Overriding treatment evaluations

This kind of dilution can occur when implemented features are conditional on other user attributes. For example, as shown in Figure 4, the product side might require changing the color of a “purchase” button at the end of a checkout page from blue (control) to green (treatment) only on the web, leaving the mobile checkout experience unchanged.

Figure 4: This example highlights a case where the treatment assignment evaluation is being overwritten by downstream business logic. In this example, dilution is created by including non-web users who fall in the control group.

Dilution occurs if mobile checkouts are included in experiment exposure events. Mobile users see no difference between control and treatment, so they can’t possibly contribute to the experiment’s signal, or mean treatment effect. They do, however, dilute the signal. Figure 5 shows a possible correct implementation.

Figure 5: This is a corrected implementation, where evaluation is done only where the feature change is implemented. Internally, our platform supports user segments, so we can perform the evaluation while smartly taking into account a variety of user attributes (e.g, app-version, region, platform)

Mistake 3: Not managing unintentional traffic

Another potential pitfall in experimental design revolves around the inadvertent inclusion of unintentional traffic in a service’s usage data. This traffic typically falls into three categories: bots, load-testing traffic, and unintentional service integrations. Bots often are filtered out with security measures such as authentication checks; nonetheless, they may still dilute experimental results that rely on identifiers like device IDs, where no user login is required. Similarly, load testing can be managed by ensuring traffic is not logged or purging it downstream.

The biggest challenge is how to handle traffic requests from unintentional service integration. One example might be a feed service that provides an API for retrieving a user’s favorite restaurants. While this service normally is accessed by the mobile app’s backend-for-frontend (BFF) after user authentication, which indicates legitimate user activity, it might also be used by a notification service from a different team seeking to craft personalized emails around a user’s favorite restaurants. That service would use user_ids from a much broader segment than those actively engaging with the app. If the feed service fails to distinguish between requests for legitimate user activity and those from the notification service, any experiments involving “GetFavorites” logic may be diluted because of that wider segment capture. Perhaps even more problematic, as shown in Figure 6, is that if the feed service triggers downstream requests to other services, all downstream experiments could be diluted.

Figure 6: In this diagram feed service normally gets requests from authenticated users of the BFF. service. Nonetheless, it also has an integration with the notification service. Unfortunately, requests from notification don’t reflect legitimate user activity, diluting experiments involving both feed service and other downstream services.

Mistake 4: Including all predictions from a Machine Learning test

In machine learning, where experiments often aim to measure impacts from small adjustments to a model, experiments are often highly diluted unless the experimenter takes great care to avoid dilution. Small tweaks to features, slight recalibrations of models, or minor variations in algorithms can lead to performance improvements that, while potentially small individually, can compound to significantly better outcomes. The trouble comes when these subtle changes must be measured through severe dilutions in online tests.

Consider an example of two classification machine learning models that differ by the slimmest of margins. Visually, as represented in Figure 7, the decision boundaries imposed by these models might be very similar, leading to identical outcomes in the vast majority of evaluations. If the two models make the same prediction for a given point, then including that prediction in the experiment analysis dilutes the signal. Only the points where the two models make different predictions can contribute effectively to the signal. In such a high-dilution scenario, the diluted treatment effect may be so small that a poorly designed online test to detect performance differences ultimately will fail.

Figure 7: In this figure, two ML models make similar predictions, and only a few data points have different predictions. If an experimenter ran an online A/B test that included all predictions, no difference would be detected. *Image source*

Although dilution is most obvious in the case of classification models, experiments based on other model types often suffer from dilution as well. For example: an A/B test of different versions of a ranking model would suffer from dilution for any query where the two models produced the same ranked list (or when the user never viewed any differing items in the ranking). An experiment testing two different regression models could also be diluted if, for instance, we round the prediction before showing it to the end user, as we often do for models that predict the delivery duration. It’s up to the experimenter to determine the extent to which dilution will spoil the results of an experiment.

Resolving dilution

There are two ways to solve for known sources of dilution: 1) real-time adjustment or 2) post-processing adjustment.

Real-time adjustment

To prevent dilution from happening in the first place, we recommend placing your experiment exposure event — its triggering event — as late as possible in the code. This helps ensure that events are generated only for users who are eligible to see them. This best practice offers a few advantages:

Accuracy: It precisely captures the subset of users who are genuinely affected by the treatment change.
Clarity: It results in a clean dataset for analysis, which can reduce manual effort at the experiment analysis stage.
Efficiency and reliability: Because the code only evaluates one or a very small set of experiments, exposure logging footprints and time spent evaluating experiments can be reduced considerably. Within some codebases, DoorDash runs hundreds of concurrent experiments; evaluating too much at once can impact both latency and app crash rates.

While generally positive, real-time adjustment should be done with a few important considerations in mind:

Planning: Perform sufficient planning and QA at the start to ensure that the triggering event is incorporated correctly before the experiment begins.
Flexibility: Real-time adjustment may not always be feasible because of technical limitations. For example, it may be too costly or complex to evaluate counterfactual predictions from an ML model in real time or to evaluate multiple code paths in order to know which experiences are different.

Post-processing adjustment

In post-processing, we retroactively differentiate between impacted and non-impacted user interactions to apply filters on the exposures or metrics datasets. Post-processing provides a couple of advantages:

Salvaging data: Post-processing can correct exposure events after the fact that were placed poorly in the codebase. This can save experiments that might otherwise have had to be re-run.
Adaptable: Post-processing can manage complex systems in which real-time adjustments are impossible. For example, there may be a separate evaluation pipeline to compute counterfactual predictions from ML models that are used purely for post-processing filtering.

You should balance those advantages, however, against some cautionary considerations:

Log accuracy: Ensure that user-interaction logs are absolutely correct, because post-processing relies on that accuracy.
Analysis complexity: Post-processing can add a layer of complexity to data analysis because it requires more sophisticated data processing techniques. In ML applications, for example, balanced experiments require evaluating both models for both groups. To use post-filtering in this case, be certain that the synchronous model does not impact the asynchronous model in any way; for instance, features must be taken from the same moment in time for both models. In practice, this can be a greater effort to implement.

Handling dilution from unintended traffic

The two approaches above can help reduce known sources of dilution, but can’t protect a system from unintentional dilution. As we’ve mentioned, a service may be called by bots or by other internal services masquerading as real users. To manage this kind of dilution, two solutions can be implemented:

Distributed tracing: This technique tracks requests as they flow through various services and components of a distributed system and can be highly effective in identifying sources of non-real user traffic. Distributed tracing starts the moment a user interacts with an application; as the request flows through the system, the trace is tagged with a variety of metadata, for example, which upstream services were involved in the request. Usually, traces are used to understand traffic patterns and performance bottlenecks. But in the realm of experimentation, distributed tracing also serves as a reliable way to identify and filter traffic that should not be evaluated as part of experiments. For example, distributed tracing could prevent the logging of experiment evaluations that don’t originate from the BFF layer.
Authentication logs: A somewhat less ideal solution for managing unintentional dilution is to perform post-filtering of experiment evaluations based on authentication logs. For example, if there is a user present in an experiment’s evaluation logs that shows no trigger of an authentication event — in other words, no login or app visit — that’s a good indication that the user is an invalid entry triggered by interacting services, as shown in Figure 8.

Figure 8: In this example, users 2 and 3 have experiment evaluation logs but don’t exist in the authentication logs, indicating that they were created through an invalid path and are not real users visiting the app.

Filtering users based on authentication logs is less than ideal for a couple of reasons:

This method can’t reliably remove all sources of dilution. A user might have an authentication log but still be generating dilution because they don’t interact with the experimental feature.
Analysis becomes more complex because you must perform checks against real app user visits.

In the absence of better logging through tracing, however, this filtering approach is still valuable to detect unknown sources of dilution and salvage experiments with severely degraded sensitivity.

Real-world examples of dilution at DoorDash

Example 1: Fraud targeting model

As with most online businesses, DoorDash uses a model to score the probability that any given checkout is being made by a fraudster using a stolen credit card. If the probability exceeds a given threshold, we require additional verification from the user before completing checkout — for example, scanning an image of the physical credit card. When we release a new version of the checkout risk model, we tend to A/B test it against the incumbent model to make sure we get the expected reduction in fraud costs. If a user is in the control group, the incumbent model decides whether or not the transaction gets step-up verification; if the user is assigned to the treatment group, then the challenger model makes the decision.

Scoring two risk models synchronously before allowing the user to checkout would add too much latency to our system. Therefore, each user is assigned either to control or treatment first. After that, one model is scored synchronously while the other begins scoring asynchronously. Our risk checkout system is designed to handle this common use pattern, so we can guarantee that the features sent to the models will be identical whether they are scored synchronously or asynchronously.

As a result, our experiment is very diluted – all users receive an experiment assignment event at checkout, regardless of whether either model would have applied friction. Each model only applies step-up verification to a tiny fraction of checkouts, generating extreme dilution, as shown in Figure 9.

Figure 9: Ideally, we want our experiment to include the regions labeled 3, 4, 7, and 8. We can exclude regions 1 and 2 because neither model selected those users for step-up verification. We can also exclude regions 5 and 6 because both models selected thodase users, meaning there can be no possible treatment effect.

Here are some real numbers that show how powerful this technique is:

The diluted experiment, which tracked all users eligible for step-up verification over the experiment period, tracked a total of 44.5 million users. The experiment readout of our primary fraud cost metric showed a reduction of -0.9677% [-8.4302%, +6.4948%], which was nowhere near statistical significance (t = 0.25, p value = 0.799). But the post-filtered experiment included only 292,000 users and showed a statistically significant reduction of our fraud cost metric over the experiment’s time period — -9.74% [-15.7402%, -3.7397%] (t=3.18, p value = 0.001465). Removing the dilution allowed us to increase sensitivity by a factor of (3.18/0.25)^2 = 160!

Note that the metric needs to be carefully re-scaled to represent the total global impact. We can do this easily, however, using the absolute changes in metric values (not shown) multiplied by the number of users included in the undiluted analysis.

Example 2: Supply-demand rebalancing

DoorDash’s Supply-Demand team aims to build levers to manage existing and future supply-demand imbalances. Historically, most regions managed by the team had good supply, so we found fewer and fewer instances where the system had to intervene. While this was a positive outcome from a product standpoint, it challenged the team in its effort to develop and evaluate new measures designed to act during supply shortages. As our systems kept improving, there were fewer and fewer opportunities for how we could validate our ideas.

Fortunately, our experimental design benefited from advanced knowledge about the application of specific levers, allowing us to distinguish between the user experiences in treated versus untreated region-time segments. For example, in one test we wanted to check if we should just adjust the timing of the peak pay incentives Dashers get. Our hypothesis was that we should offset the incentive timing by 30m, to allow Dashers more time to get mobilized and get on the road. Given that this lever was only deployed in less than 15% of regions, by removing the regions that could not possibly be impacted by our treatment intervention, we were able to improve the sensitivity of our metrics by 10x to 15x.

Figure 10: Heatmap screen on the Dasher app that lets Dashers know about existing and future incentives. This lever is rarely deployed; removing dilution from untreated regions allowed us to increase sensitivity 10 to 15 times.

Conclusion

In closing, dilution poses a substantial threat to the integrity and effectiveness of running experiments. Through real-world examples and empirical evidence from within DoorDash, we’ve witnessed firsthand the detrimental consequence of neglecting to address dilution. Thankfully, we can remove dilution from experiments by either real-time or post-processing adjustments to recover the power of experimental measurements.

Throughout May, alongside our Asian, Asian American, and Pacific Islander (AAPI) Employee Resource Group, we’re celebrating AAPI Heritage Month in the United States and Canada. Traditionally a time dedicated for remembrance of the many contributions of the Asian, Asian American, and Pacific Islander community, we’re kicking off a series of virtual and in-person events aimed to continue building the community of AAPI peers and allies at DoorDash.

We’ll kick off the month with offsite volunteer opportunities, helping with food preparation and distribution to the elderly in New York, and supporting a garden cleanup effort in San Francisco. Throughout the month, our workplace teams will be stocking snack shelves with AAPI owned and founded products in our New York, Tempe, and San Francisco office spaces. We’ll also host a series of in-person social hours in our San Francisco, New York, Seattle, and Sunnyvale offices.

Our people will participate in an Ask Me Anything session with our AAPI@ executive sponsor, VP of Engineering Liangxiao Zhu in conversation with VP of Product Rajat Shroff. As we close out the month, New York-based Korean dessert shop Rice Blossoms will host a live, virtual cooking class making Korean rice cakes (songpyeon).

Outside of official AAPI events, our people are encouraged to keep support for the AAPI community going by joining groups such as Asian Americans Advancing Justice, AAPI Women Lead, India Literacy Project, Chinese Progressive Association, and Asian Pacific Environmental Network.

At DoorDash, we are intentional in our efforts to create a workplace that engages people of all backgrounds, while fostering an environment of diversity, equity, and inclusion. We will continue to prioritize investing in and advancing opportunities for historically underrepresented people through our efforts as a company and by encouraging learning and development, such as Bystander Training and Understanding and Supporting Asian Employees training opportunities.

DoorDash’s retail catalog is a centralized dataset of essential product information for all products sold by new verticals merchants – merchants operating a business other than a restaurant, such as a grocery, a convenience store, or a liquor store. Within the retail catalog, each SKU, or stock keeping unit, is represented by a list of product attributes. Figure 1 shows an example SKU and some of its attributes as it is stored in the retail catalog.

*Figure 1: An example SKU and some of its attributes in the retail catalog*

Having high-quality, complete, and accurate product attributes for each SKU is a critical part of a first-class shopping experience, providing:

Better selection & fulfillment — Customers can find an item on DoorDash easily, confident that what they order matches what they want. Dashers, the service’s delivery drivers, have comprehensive information to find the correct product in the store.
Better personalization. Product attributes allow DoorDash to group products based on commonalities, building a product profile for each customer around their affinities to certain attributes. These are the building blocks for providing highly relevant and personalized shopping recommendations.

When a merchant comes onboard at DoorDash, we add their internal SKU data — raw merchant data — to our retail catalog. SKU data from different merchants come in varying formats and quality; they may, for example, have missing or incorrect attribute values. To ensure our catalog’s quality does not degrade, we standardize and enrich raw merchant data. Historically, this SKU enrichment of extracting and tagging attributes has been a purely manual process led by contract operators. But outsourcing this task leads to long turnaround times, high costs, and so many inaccuracies that a second human must audit the results generated by the first. As our catalog expands, we seek new approaches driven by machine learning to auto-enrich SKU data.

Extracting attribute-value information from unstructured data is formally known as named-entity recognition; most recent approaches model the extraction task as a token classification. For instance, given the item name “Dove Silk Glow Body Wash 500 ml,” a token classifier would tag each entity in the item name as shown in Table 1.

*Table 1: Classifying item name tokens to product attributes*

Building an attribute extraction model

Building an in-house attribute extraction/tagging model from scratch requires a significant amount of labeled training data to reach the desired accuracy. This is often known as the cold-start problem of natural language processing, or NLP. Data collection slows model development, delays adding new items to the active catalog, and creates high operator costs.

Using LLMs to circumvent the cold-start problem

Large language models, or LLMs, are deep-learning models trained on vast amounts of data. Examples include OpenAI’s GPT-4, Google’s Bard, and Meta’s Llama. Because of their broad knowledge, LLMs can perform NLP with reasonable accuracy without requiring many, if any, labeled examples. A variety of prompts can be used to instruct LLMs to solve different NLP problems.

We will highlight here how we use LLMs to extract product attributes from unstructured SKU data, allowing us to build a high-quality retail catalog that delivers the best possible experience for users in all new verticals. In the following sections, we describe three projects in which we used LLMs to build ML products for attribute extraction.

Brand extraction

Brand is a critical product attribute used to distinguish one company’s products from all others. At DoorDash, a hierarchical knowledge graph defines a brand, including entities such as manufacturer, parent brand, and sub-brand, as shown in Figure 2.

*Figure 2: Brand taxonomy breaks brands into entities such as manufacturer, parent brand, and sub-brand*

Accurate brand tagging offers a number of downstream benefits, including increasing the reach of sponsored ads and the granularity of product affinity. Because the number of real-world brands is technically infinite, DoorDash’s brand taxonomy is never complete. As the product spectrum expands, new brands must be ingested to close any coverage gaps. Previously, brand ingestion was a reactive and purely manual process to fulfill business needs. This limited the volume of new brands that could be added, often failed to address much of the coverage gap, and led to duplicate brands, making it difficult to manage the taxonomy system.

To this end, we built an LLM-powered brand extraction pipeline that can proactively identify new brands at scale, improving both efficiency and accuracy during brand ingestion. Figure 3 shows our end-to-end brand ingestion pipeline, which follows these steps:

Unstructured product description is passed to our in-house brand classifier
SKUs that cannot be tagged confidently to one of the existing brands are passed to an LLM for brand extraction
The extraction output is passed to a second LLM, which retrieves similar brands and example item names from an internal knowledge graph to decide whether the extracted brand is a duplicate entity
The new brand enters our knowledge graph and the in-house classifier is retrained with the new annotations

*Figure 3: LLM-powered brand ingestion pipeline*

Organic product labeling

Consumers care about dietary attributes when building their carts and are more likely to engage with a product if it tailors to their personal preference. Last year, we stood up a model to label all organic grocery products. The end goal was to enable personalized discovery experiences such as showing a Fresh & Organic carousel to a consumer whose past orders showed a strong affinity towards organic products.

The end-to-end pipeline takes a waterfall approach, leveraging existing data where applicable to boost speed, accuracy, and coverage. This process can be broken down roughly into three buckets:

String matching: We find exact mention of the keyword “organic” in the product title. This approach offered the highest precision and decent coverage, but it missed cases where “organic” is misspelled / dropped or has a slightly different presentation in the data.
LLM reasoning: We leverage LLMs to determine whether a product is organic based on available product information. This information could come directly from merchants or via optical character recognition extraction from packaging photos. This approach improved coverage by addressing major challenges faced by string matching and has better than human precision.
LLM agent: LLMs conduct online searches of product information and pipe the search results to another LLM for reasoning. This approach further boosted our coverage.

Figure 4 shows the LLM-powered pipeline for tagging our catalog SKUs with organic labels.

By leveraging LLMs and agents, we overcame the challenge of insufficient data and answered inferential questions via searching and reasoning using external data. Enhancing coverage of organic labels enabled us to launch item carousels that target customers’ with strong organic affinity, which improved our top-line engagement metrics.

Generalized attribute extraction

Entity resolution is the process of determining whether two SKUs refer to the same underlying product. For example, “Corona Extra Mexican Lager (12 oz x 12 ct)” sold by Safeway is the same product as “Corona Extra Mexican Lager Beer Bottles, 12 pk, 12 fl oz” sold by BevMo!. We need accurate entity resolution to build a global catalog that can reshape the way customers shop while unlocking sponsored ads.

*Figure 5: Entity resolution is the backbone of sponsored ads*

Determining whether two SKUs refer to the same underlying product is a challenging problem. It requires validating that both SKUs match all attributes exactly, which means there must be accurate extraction of all applicable attributes in the first place. Products from different categories are characterized by different sets of uniquely defining attributes. For example, an alcohol product is uniquely defined by attributes such as vintage, aging, and flavor. Starting with limited human-generated annotations, we used LLMs to build a generalized attribute extraction model.

We used LLMs and retrieval augmented generation, or RAG, to accelerate label annotations. For each unannotated SKU, we first leverage OpenAI embeddings and the approximate nearest neighbors technique to retrieve the most similar SKUs from our golden annotation set. We pass these golden annotation examples to GPT-4 as in-context examples to generate labels for the unannotated SKU. Choosing examples based on embedding similarity is advantageous over random selection because the selected examples are more likely to be relevant to the assigned task and reduces hallucination. Ultimately, the generated annotations are used to fine-tune an LLM for more scalable inference.

This approach enabled us to generate annotations within a week that would otherwise require months to collect, allowing us to focus on the actual model development to de-risk our goal.

Downstream impacts

Attribute extraction not only allows us to better represent each product in the catalog but also empowers downstream ML models that improve a customer’s shopping experience. Attributes such as brand and organic tag are important features in our personalized ranking models, which recommend items that reflect a consumer’s unique needs and preferences. And attributes such as product category and size enable recommending more relevant substitutions when the original item is out of stock, giving customers a smooth fulfillment experience.

Looking into the future

So far, most of our attribute extraction models are built on top of text-based inputs. A challenge with this approach, however, is the presence of abstraction and abbreviations within written product descriptions. Fortunately, product image quality varies less across merchants. We are actively exploring recent advances in multimodal LLMs that can process text and images together; currently, we are experimenting with multimodal attribute extraction through Visual QA and Chat + OCR. Our Engineering team is also building foundational technologies and infrastructures to allow Dashers to take product photos so that we can perform attribute extraction directly on in-store items.

As we identify more areas where LLMs can be used, we are also working with our ML Platform team to democratize their use across DoorDash through a centralized model platform where anyone can easily prompt-engineer, fine-tune, and deploy LLMs.

Acknowledgments

Special thanks to Aparimeya Taneja, JJ Loh, Lexi Bernstein, Hemanth Chittanuru, Josey Hu, Carolyn Tang, Sudeep Das, Steven Gani, and Andrey Parfenov, who all worked together to make this exciting work happen!