Page 5 – DoorDash

At DoorDash, dashing is highly process dependent. Dashers require a firm grasp of the end-to-end delivery process to complete orders successfully — and earn money. The first iteration of DoorDash support content did more to explain how to dash, handle common delivery issues and pitfalls, and maximize the dashing experience than subsequent iterations have done.

Our recent effort to revamp the Dasher support hub built a system that allows Dashers to find support answers independently, which reduces support costs. However, the existing resources available to answer common Dasher issues were:

Hard to find because resources are spread across numerous channels and sources; Dashers must already know exactly where to look if they hope to locate answers, and
Content sometimes didn’t help resolve a Dasher’s issue, resulting in long calls with a live agent to reach a resolution.

FAQ articles in the revamp needed to be concise and support rapid changes, such as adding screenshots of the app or upcoming guideline changes. Building a system to support this flexibility on each mobile platform would require duplicate development and would also require each native app to be updated when the content structure changes. Native applications can take weeks to roll out whereas web rollouts can be accomplished immediately. So, we decided to build a common web app that can be hosted within the native application to provide an in-app experience while also providing the flexibility to make rapid changes.

Architecture

Figure 1: The first iteration of our solution (MVP) followed this architecture: Mobile (iOS/Android) ⇒ WebView ⇒ Backend for Frontend (BFF) ⇒ Backend Service.

As shown in Figure 1 above, both mobile clients use WebView, ensuring the same experience across clients. Some menus like in-app support chat deploy mobile client-specific callbacks to bridge between the native app and web view experience.

The BFF is used to transform the request format from HTTP to gRPC to allow the backend service to parse. The response is similarly transformed from gRPC to HTTP. No business logic is handled at the BFF layer; it exists solely to format the data from the backend service into a format suitable for the web application.

The backend service stores and serves the content. The article contents are stored as a JSON that holds not only the content but also the UI formatting. In MVP, all articles, categories, menus, and snippets are stored as JSON fixtures. Each change in the article requires manually formatting the article, updating the JSON fixture, and deploying the backend service.

MVP scope

To minimize the scope of the MVP, we kept the content localized to the U.S. market. Localized content with language and country/region-specific content would have complicated the initial launch; while we kept these features in mind, we did not implement them in the MVP.

At the start, we kept the content stored in JSON fixtures rather than investing in a full database. The JSON fixtures allowed us to iterate quickly and repeatedly through the data format. A drawback to this: We couldn’t use a more robust search system. Instead, we simplified to use a basic search solely on article titles using Levenshtein distance to account for misspellings. All the tags, categories, menus, and article content shown within the FAQ had to be manually formatted into JSON and uploaded.

Communication protocol

Many messages are sent to enable seamless transitions between the native applications and the FAQ feature running within a WebView. For example, the WebView may need Dasher context or an interaction with the UI may need to navigate back to a native view.

Both Android and iOS have methods to send messages to and from their WebViews. To send messages from the WebView to the native app, the native side must register handlers or use delegation. For example, the native side may register a handler called “WebViewModuleCallbacks.” From the WebView side, this will resolve to a property on the messages window.

// For Android
WebViewModuleCallbacks?.postMessage(...)
// For iOS
webkit?.messageHandlers?.WebViewModuleCallbacks?.postMessage({...})

To send messages from the native side to the WebView, JavaScript must be evaluated in the form of a string. For example, on iOS:

WebView.evaluateJavaScript("WebViewModule.completeCall(1234, { payload: \"somePayload\" }")

It should be noted that the evaluated JavaScript must reference something that is already available at runtime within the web application. For this reason, a global instance that could respond to messages was implemented for the FAQ feature (i.e. “WebViewModule”).

We faced a few challenges with the following messaging pattern:

By default, the receiving side of the message does not acknowledge or return a result in response.
There needed to be a way to manage the complexity of the different message types as feature requirements evolved.
Features and client versions are constantly changing; a message that may have been introduced will not be available in older application versions.

A pattern based on JavaScript promises was implemented to solve for the first identified challenge. To do this, an instance within the web application maintained a collection of ongoing message promises. These message promises could be referred to at a later point by an identifier. When the web application prepares to send a message to native, it creates a new promise in the collection along with an identifier. The identifier is then passed alongside the message contents from web to native. To complete the message, the native side evaluates JavaScript and passes back the identifier and any results, fulfilling the promise.

To resolve the second challenge, a typed message contract is formed to manage the complexity of handling multiple message types:

export type ScriptMessage = {
   name: string
   payload?: Record<string, unknown>
}

The “ScriptMessage” type carries the name which identifies the message’s intent. This name allows the native and web applications to switch the specific message handling. The message’s receiver should know how to unpack the payload’s contents.

We’re still working to resolve the third challenge — the moving target of client versions — through incremental changes. Considering the promise-based pattern described earlier, if the web application sends a message that is not supported by the native side, it could stall the application while it waits for the promise fulfillment. To prevent this, the native applications currently forward their version in an initial configuration message. The web application maintains known versions of the native clients and compares against them when handling messages.

WebView

To a web engineer, a WebView is simply another application with some embedded protocols that allow for bi-directional communication between the mobile and web layers.

For the WebView tech stack, we leverage DoorDash’s existing frameworks and libraries, which include React, React Query, and our internal design system. We chose to use React query as our data fetcher/state management because it has cache and cache invalidation capabilities out of the box and works well with our BFF through REST (representational state transfer) API calls.

WebView entry and exit points

Our application exposes multiple entry points based on the user’s origin within the Dasher app, such as account settings → WebView pre-filtered content to account FAQs. This helps eliminate the need to look through all of the content. The protocol to deeplink from mobile to WebView looks something like this:

// The entry point will be specified by the callee of the FAQ module
enum EntryPoint {
    case dashing
    case earnings
    /* ... */
}
// Based on the above entry point the filter query param will be populated
webView?.load(
    URL(string: "https://<app-base-url>?filter=\(entryPoint.rawValue)")!
)

We also use a messaging protocol to minimize an article. That way, when a user decides to navigate the mobile app after viewing article content and playing around with the features, they won’t lose access to the article they were just viewing. The handlers are attached to an icon; upon click, it sends a message to the native mobile client to listen for the event.

Message Handler Definition

const postMessageToNative = (args) => {
  if (webViewHost === 'iOS') {
    setTimeout(() => {
      webkit?.messageHandlers?.webViewModuleCallbacks?.postMessage(args)
    }, 0)
  } else if (webViewHost === 'Android') {
    webViewModuleCallbacks?.postMessage(JSON.stringify(args))
  } else {
    // Web code: do nothing
  }
}

Server-driven UI content

To support a more robust content-based system, we added all of our FAQ article content on the backend. We then fetch the data through WebView. This enables us to have a single source of truth and makes updates easier through a configuration tool. Additionally, if we ever decide to expose this app as a standalone web app, we will still have the content.

From this project, we have developed a prototype content configuration tool that creates the article content rather than adding it on the client as a fixture. The complex JSON describes the FAQ article structure. This enables future knowledge base collection expansions without development effort.

Here is a sample of the content structure. The article page’s anatomy is partially server-side driven, partially client-side driven:

Figure 2: Page layout of an article, partially client side driven, and partially server side driven

This tool allowed us to build a component library that maps the response object into text node components and exports a renderer utility function for the Dasher WebView to consume.

We believe this tool will pay tremendous future dividends when we need to expand the capabilities to other server-driven UI initiatives, such as a screen builder for page layouts.

Development and initial rollout challenges

There are additional challenges in development, debugging, and validation when creating a feature in a native application as presented within a WebView. To understand if something truly works — and if it doesn’t, where it broke — requires knowledge of the native and web stacks.

For example, in developing the FAQ feature there was a defect in the promise-based communication protocol. To debug this properly, we needed to run the web application locally, emit a source map for the transpiled TypeScript, allow and direct the native application to the locally running web application, mark the WebView as inspectable within iOS or Android, and attach to the running WebView from Safari or Chrome. This could be challenging to a lone web developer or native developer, so we often practiced pairing.

Results and future work

There were many noticeable benefits from developing the Dasher support hub as a WebView, notably:

Quick rollouts: Web deployments are quick and simple, with no friction for user adoption. Mobile app releases, on the other hand, take roughly a week to deploy and around two weeks for meaningful adoption percentages. WebView changes require less than a day.
Content updates: Because content is stored on the server, it can be updated quickly and often, sending the most recent version to the client without the need for mobile releases.

This also allowed us to create additional projects that will further aid in creating support tools for our customers and agents and will be extended, including:

Content configuration tool: We can build out this tool further to configure other web flows, reducing development effort to create similar flows in our mobile and web products.
Component library for server-driven UI content: These components handle rendering the complex JSON, which can also be extended to handle different layouts.

Conclusion / Summary

Creating an improved Dasher support hub in the app was crucial for providing Dashers better access to information. Dashers need clear, concise, and readily available content that provides solutions to problems quickly, without needing to go through live agents. The information also needs to be easily updatable, requiring a unique solution to avoid long deployment times or duplicate work on Android and iOS. Building a common web app allowed us to circumvent the challenges of updating native applications, massively reducing rollout time and giving Dashers access to the most up-to-date information available.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

As DoorDash’s microservices architecture has grown, so too has the volume of interservice traffic. Each team manages their own data and exposes access through gRPC services, an open-source remote procedure call framework used to build scalable APIs. Most business logic is I/O-bound because of calls to downstream services. Caching has long been a go-to strategy to improve performance and reduce costs. However, the lack of a uniform approach to caching has led to complications. Here we explain how we have streamlined caching through a Kotlin library, offering backend developers a fast, safe, and efficient way to introduce new caches.

Boosting performance while supporting business logic

In the world of DoorDash microservices, the focus lies more on implementing business logic than on performance optimization. While optimizing I/O patterns in the code could improve performance, rewriting business logic to do so would be time-consuming and resource-intensive. The problem, then, becomes how to boost performance without overhauling the existing code.

One orthodox solution is caching — the practice of storing copies of frequently accessed data in close proximity to where it is needed to improve speed and performance for subsequent requests. Caching can be added transparently to business logic code simply by overloading methods used to retrieve data.

The most common caches at DoorDash are Caffeine for local caching and Redis Lettuce for distributed caching. Most teams use Caffeine and Redis Lettuce clients directly in their code.

Because there are common problems with caching, many teams were running into similar issues while implementing their own independent approaches.

Problems:

Cache staleness: While implementing caching for a method is straightforward, it’s challenging to ensure that the cache remains updated with the original data source. Resolving issues that arise from outdated cache entries can be complex and time-consuming.
Heavy dependency on Redis: Services frequently encountered a high rate of failure whenever Redis was down or experiencing issues.
No runtime control: Introducing a new cache can be risky because of the lack of real-time adjustments. If the cache encounters issues or requires tuning, changes require a new deployment or rollback, which consumes both time and development resources. Additionally, a separate deployment is required to tune cache parameters like TTL
Inconsistent key schema: The absence of a standardized approach for cache keys complicates debugging efforts. Specifically, it’s difficult to trace how a key in the Redis cache corresponds to its usage in Kotlin code.
Inadequate metrics and observability: The absence of uniform metrics across teams resulted in a lack of critical data, such as cache hit rates, request counts, and error rates.

Difficulty in implementing multilayered caching: The previous setup didn’t easily support the use of multiple caching layers for the same method. Combining a local cache and a more resource-intensive Redis cache could optimize results before resorting to fallback.

Dream big, start small

While we ultimately created a shared caching library for all of DoorDash, we started with a pilot program to tackle caching problems for just one service — the DashPass backend. We wanted to battle-test and iterate on our solution before adopting it elsewhere.

At the time, DashPass was experiencing scaling challenges and frequent brownouts. DoorDash was growing rapidly and seeing increasing traffic every week. DashPass was one of the highest users of our shared Postgres database; a database on which almost all of DoorDash relied; if it went down, customers would not be able to place orders.

Simultaneously, we were also rapidly developing new features and use cases for DashPass, so the developer bandwidth for performance tuning was low.

With all of this critical activity occurring alongside pressure to stabilize the service — even as most engineers were busy managing business-related features — we decided to develop a simple caching library that could be integrated transparently and with minimal disruption.

Single interface to rule them all

With each team using different caching clients, including Caffeine, Redis Lettuce, or HashMaps — there was little consistency in function signatures and APIs. To standardize this, we introduced a simplified interface for application developers to use when setting up new caches, as shown in the following code snippet:

interface CacheManager {
    /**
   * Wraps fallback in Cache.  
   * key: Instance of CacheKey.  
   *      Subclasses of CacheKey define a unique cache with a unique 
   *      name, which can be configured via runtime.
   * fallback: Invoked on a cache miss. The return value is then cached and 
   *           returned to the caller.
   */
    suspend fun <V> withCache(
        key: CacheKey<V>,
        fallback: suspend () -> V?
    ): Result<V?>
}
/**
 * Each unique cache is tied to a particular implementation of the key.
 *
 * CacheKey controls the cache name and the type of unique ID.
 *
 * Name of the cache is the class name of the implementing class.
 * all implementations should use a unique class name.
 */
abstract class CacheKey<V>(
    val cacheKeyType: String,
    val id: String,
    val config: CacheKeyConfig<V>
)
/**
 * Cache specific config.
 */
class CacheKeyConfig<V>(   /**
     * Kotlin serializer for the return value. This is used to store values in Redis.
     */
    val serializer: KSerializer<V>
)

This allows us to use dependency injection and polymorphism to inject arbitrary logic behind the scenes while maintaining uniform cache calls from business logic.

Layered caches

We wanted to adopt a simplified interface for cache management to make it easier for teams that previously used only a single layer to enhance performance through a multi-layered caching system. Unlike a single layer, multiple layers can boost performance because some layers, such as local cache, are much faster than layers involving network calls — for instance, Redis — which already is faster than most service calls.

In a multi-layer cache, a key request progresses through the layers until the key is found or until it reaches the final source of truth (SoT) fallback function. If the value is retrieved from a later layer, it’s then stored in earlier layers for faster access on subsequent requests for the same key. This layered retrieval and storage mechanism optimizes performance by reducing the need to reach the SoT.

We implemented three layers behind a common interface as shown in Figure 1:

Request local cache: Lives only for the lifetime of the request; uses a simple HashMap.
Local cache: Visible to all workers within a single Java virtual machine; uses a Caffeine cache for heavy lifting.
Redis cache: Visible to all pods sharing the same Redis cluster; uses Lettuce client.

Figure 1: Multi-layer cache request flow

Runtime feature flag control

Various use cases may call for different configurations or turning entire caching layers off. To make this much faster and easier, we added runtime control. This allows us to onboard new caching use cases once in code, then follow up via runtime for rollout and tuning.

Each unique cache can be controlled individually via DoorDash’s runtime system. Each cache can be:

Turned on or off. This can be handy if a newly introduced cache strategy has a bug. Instead of doing a rollback deployment, we can simply turn the cache off. In off mode, the library invokes fallback, skipping all cache layers entirely.
Reconfigured for an individual time to live (TTL). Setting a layer’s TTL to zero will skip it entirely.
Shadowed at a specified percentage. In shadow mode, a percentage of requests to cache will also compare cached value against the SoT.

Observability and cache shadowing

To measure cache performance, we collect metrics on how many times a cache is requested and how many times requests result in a hit or miss. Cache hit ratio is the primary performance metric; our library collects hit ratio metrics for each unique cache and layer.

Another important metric is how fresh cache entries are compared to the SoT. Our library provides a shadowing mechanism to measure this. If shadowing is turned on, a percentage of cache reads will also invoke fallback and compare cached and fallback values for equality. Metrics on successful and unsuccessful matches can be graphed and alerted. We also can measure cache staleness — the latency between cache entry creation and when the SoT was updated. Measuring cache staleness is critical because each use case will have a different staleness tolerance.

In addition to metrics, any misses also generate error logs, which itemize the path in the objects that differs between cached and original values. This can be handy when debugging stale caches.

Providing observability into cache staleness is key for empirically validating a cache invalidation strategy.

Example usage

Let’s go over an example and dive deeper into library API.

Each cache key has three main components:

Unique cache name, which is used as a reference in runtime controls.
Cache key type, a string representing the key’s type of entity to allow categorization of cache keys.
ID, a string that refers to some unique entity of cache key type.
Configuration, which includes default TTLs and a Kotlin serializer.

To standardize key schema, we chose the uniform resource name (URN) format:

urn:doordash:<cache key type>:<id>#<cache name>

The library provides a CacheManager instance, which is injected and has a `withCache` method that wraps a fallback or another Kotlin suspend function to be cached.

For instance, if we have a repository UserProfileRepository with a method GetUserProfile that we want to cache, we could add the following key:

class UserProfileRepositoryGetUserProfileKey(userId: String): CacheKey<UserProfile>(
cacheKeyType = "user",
id = userId,
config = CacheKeyConfig(serializer = UserProfile.serializer())
)
...
suspend fun getUserProfile(userId: String): UserProfile = CacheManager.withCache(UserProfileRepositoryGetUserProfileKey(userId)) {
... <Fetch user profile> ...
}.getOrThrow()

A key for the user with id “123” would be represented as an URN as follows:

urn:doordash:user:123#UserProfileRepositoryGetUserProfileKey

Note that any other CacheKey that uses “user” as the cache key type will share the same prefix as UserProfileRepositoryGetUserProfileKey.

Standardizing how keys are represented is great for debugging observability and opens up unique opportunities for pattern-matching keys.

Use case guidance

Once we created and battle-tested the library in DashPass, the next step was to get it to developers and help them integrate it into their work as seamlessly as possible. To do so, we gave high-level guidance on when and how to use caching — and, just as importantly, when not to use it.

When to use caching

We can break up caching use cases by eventual consistency constraints.

Category 1: Can tolerate stale cache

In certain use cases, it’s acceptable to have a few minutes of delay for updates to take effect. In these situations, it’s safe to use all three caching layers: request local cache, local cache, and Redis layer. You can set the TTL for each layer to expire in several minutes. The longest TTL setting across all layers will determine the maximum time for the cache to become consistent with the data source.

Monitoring cache hit rates is crucial for performance optimization; adjusting the TTL settings can help improve this metric.

In this scenario, there’s no need to implement shadowing to monitor cache accuracy.

Category 2: Cannot tolerate stale cache

When data is subject to frequent changes, stale information could adversely affect business metrics or user experience. It becomes crucial to limit the maximum tolerable staleness to just a few seconds or even milliseconds.

Local caching should generally be avoided in such a scenario because it can’t be invalidated easily. However, request-level caching might still be suitable for temporary storage.

While it is possible to set a longer TTL for the Redis layer, it’s essential to invalidate the cache as soon as the underlying data changes. Cache invalidation can be implemented in various ways, such as deleting the relevant Redis keys upon data updates or using a tagging approach to remove caches when pattern-matching is difficult.

There are two main options for invalidation triggers. The preferred method is to use Change Data Capture events emitted when database tables are updated, although this approach may involve some latency. Alternatively, the cache could be invalidated directly within the application code when data changes. This is faster but potentially more complex because multiple code locations can potentially introduce new changes.

It is crucial to enable cache shadowing to monitor staleness because this visibility is vital for verifying effectiveness of the cache invalidation strategy.

When not to use caching

Write or mutation flows

It’s a good idea to reuse code as much as possible so that your write endpoint may reuse the same cached function as your read endpoints. But this presents a potential staleness issue when you write to the database and then read the value back. Reading back a stale value may break business logic. Instead, it’s safe to turn off caching altogether for these flows while reusing the same cached function outside of the CacheContext.

As a source of truth

Do not use the cache as a database or rely on it as a source of truth. Always be mindful of expiring caching layers and have a fallback that queries the correct source of truth.

Conclusion

DoorDash’s microservices faced significant challenges as a result of fragmented caching practices. By centralizing these practices into one comprehensive library, we dramatically streamlined our scalability and bolstered safety across our services. With the introduction of a standardized interface, consistent metrics, a unified key scheme, and adaptable configurations, we have now fortified the process of introducing new caches. Moreover, by offering clear guidance on when and how to employ caching, we’ve successfully staved off potential pitfalls and inefficiencies. This strategic overhaul has positioned us to harness caching’s full potential while sidestepping its common pitfalls.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

DoorDash is a career-defining place to work, and we’re proud to have been named one of America’s Greatest Workplaces 2023 by Newsweek. The list was built following a large-scale independent survey with more than 389,000 completed reviews by employees who work for companies throughout the U.S.

Organizations that are featured in the 2023 list earned this recognition based on employee satisfaction with touch points across flexibility in work, career development opportunities, benefits, and more.

DoorDash is committed to growing and empowering inclusive communities within our company, our industry, and our cities. Through our work, we bring business to local merchants, convenience to customers, flexible earning to Dashers, and fulfilling opportunities to our employees across the globe.

We foster a culture of innovation, tackling new challenges across our business to quickly build, test, learn, and drive impact. Employees are encouraged to own their journey at DoorDash, with flexibility across teams while they navigate career growth and manage their work-life integration.

Read more about how we’ve driven the culture our employees experience over the last year:

DoorDash earns Equality 100 in the Human Rights Campaign’s Corporate Equality Index — the highest score an inclusive company can receive — for the second consecutive year

One Team, More Flex: DoorDash commits to flexible work‍

‍DoorDash launches new Employee Resource Group aimed at celebrating physical and neurological diversity‍

‍DoorDash kicks off Veterans Appreciation Month by rolling out new Military Leave Policy‍

‍Teams at DoorDash are creating space to recharge, reset, and return to work even stronger‍

‍DoorDash invests in Employee Resource Groups by compensating leaders

Across September and October, we celebrated National Hispanic Heritage Month alongside our Unidos@ Employee Resource Group. As our ERG leaders began building out the inclusive and engaging slate of events to celebrate this heritage month, many team members expressed an interest in learning from Hispanic leaders in the industry on how they’ve navigated and explored various career paths, identified mentors, and built relationships.

In partnership with our Chief People Officer and the Unidos@ executive sponsor, Mariana Garavaglia, innovative Hispanic leaders across tech, retail, and e-commerce joined us to share their respective experiences across two fireside chats.

Speakers included:

– Mariana Garavaglia, our Chief People Officer and Unidos@ executive sponsor
– Anna Fabrega, Chief Executive Officer at Local Bounti
– Ariel Dos Santos, VP of Product at Redfin
– Cris Rivera, Regional President at Cresco Labs
– Andres Gonzalez, VP of eCommerce Operations at Walmart
– Jerry Guerrero, Director of Diversity, Equity, and Inclusion at Roku
– Andrea Perez, SVP Global Marketing – Product, Brand and Services at Sony

Given that each of our speakers are at a different stage in their respective careers — from Director to CEO — no one leader had the same story or advice to share. The common thread all leaders highlighted involved the importance of being intentional about what employees seek in their role, which mentors to seek out, and the impact mentors and the relationships they built had on their careers. ‍

“As a proud Latina and the executive sponsor of Unidos@ DoorDash, I’m deeply honored to celebrate Hispanic Heritage Month with our vibrant community. ‘Somos Poderosos’ was more than just a fireside chat; it was a platform for inspiration, connection, and growth, highlighting the incredible journeys of Hispanic leaders in the tech industry and igniting our employees’ aspirations.

“Investing in their growth and development through events like this is an investment in our collective future while building a more inclusive, thriving, and empowered community. Together, we stand proud and united during Hispanic Heritage Month and every month.” – Mariana Garavaglia, our Chief People Officer and Unidos@ executive sponsor

Through our company culture, community efforts, and engagement with merchants, we’ll continue our support of Hispanic and Latin merchants, as well as the communities we serve, year-round.

Our efforts to create a workplace where everyone feels welcomed, supported, and valued is one of the reasons we were recently recognized as one of America’s Greatest Workplaces 2023 by Newsweek. To learn more about this award and how we’ve driven the culture our employees experience over the last year, click here.

Experimentation isn’t just a cornerstone for innovation and sound decision-making; it’s often referred to as the gold standard for problem-solving, thanks in part to its roots in the scientific method. The term itself conjures a sense of rigor, validity, and trust. Yet as powerful as experimentation is, its integrity can be compromised by overlooked details and unforeseen challenges. One of these challenges is sample ratio mismatch, or SRM.

SRM represents one of the most egregious data quality issues in A/B tests because it fundamentally compromises the basic assumption of random assignment. For example, if two reasonably sized groups are expected to be split 50/50, but instead show a 55/45 split, the assignment process likely is compromised. This means there is a strong possibility that any experimental results and decisions based on them will not be valid.

At DoorDash, we constantly innovate and experiment. To do so effectively, we had to find ways to reduce our SRM rate. In this post, we explore some of the common examples of SRM failures we experienced, the solutions we’ve implemented to solve these issues, and how we raised awareness of these solutions internally to dramatically reduce our SRM rate.

Figure 1: If we have two groups that are expected to have a distribution of 50/50, we expect the SRM check would pass if that 50/50 split is indeed observed. We should be concerned, however, if there is instead a split of 55/45.

Cautionary tales of faux gains and real losses

Example 1: The $10 Million Mirage

Imagine that your target is to improve weekly revenue per user. After setting up the experiment with a 50/50 split between control and treatment groups, you run the experiment for a week and see that revenue has improved 2% — a $200,000 weekly incremental revenue impact. Annualized, this results in more than $10 million per year. High-fives are exchanged and your team is already setting its sights on the next big project.

Fortunately, a more observant experimenter has noticed that the SRM check failed. Instead of a 50/50 split between treatment and control, there actually was a 49.5/50.5 split — more than enough to trigger an SRM failure. Careful examination shows that the root cause for the failure is that all employees were exposed to the treatment. Almost every customer-focused company has an internal practice of dogfooding in which internal employees get the latest features by default. Because employees engage with the product much more frequently than outside users, the ~1% contribution to the total sample was enough to skew the metrics. The team’s jubilation over a $10 million win was tragically premature.

Segment	Experiment Group	Number of samples	7 day revenue value/user	Reported Incremental Impact
All Users	Control	500,000	$10	$0
All Users	Treatment	500,000	$10	$0
Employees	Treatment	10000	$30	$200,000

Table 1: In this experiment, there is a 49.95/50.05 between control and treatment groups. The imbalance is accidental, driven purely by including employees in the treatment group. Because employees engage with the product more, they skew the revenue impact by 2%, leading to the reported weekly $200,000 impact. When the employee segment is excluded, the real incremental impact is $0.

Note that a small change in the absolute size of the groups (1%) can introduce a very large change in the experiment metric (2%), which means that the size of the SRM doesn’t set a ceiling on its impact on the metric readout. Although the example above is fictitious, we have observed that on our platform experiments that have SRM have twice as many statistically significant metrics. Simply put, by turning a blind eye to imbalance, teams could erroneously double their rate of “statistically significant” findings, leading to potentially faulty conclusions.

Example 2: The bugfix bias

Bug fix handling is another area in which users can inadvertently introduce SRM. Imagine, for example, that there’s a bug in the implementation of a recently introduced new treatment. An engineer detects that bug and fixes it mid-experiment. They subsequently adjust the experiment’s start date so that it does not include metric data collected prior to the bug fix.

Figure 2: SRM imbalance introduced after mid-experiment bug fix

Figure 2 shows the time lapse of exposure allocation change. Because users don’t forget history, the introduced bug leads to an uneven distribution that deviates from the expected sample ratio; low-intent users likely have left the platform while remaining treatment users have decided to stay. This imbalance leads both to SRM and pre-experimental bias in how results from this experiment can be interpreted.

The solution to this problem is to reshuffle the exposure for both control and treatment after the bug gets fixed. Restarting the experiment through reshuffling resolves the imbalance and brings control and treatment groups back to the same starting point.

Figure 3: Reshuffling experiment after bug fix resolves SRM

These are just two examples of how SRM can slip into experimentation. Unfortunately, there is general agreement across the industry that SRM is easy to detect, but it is excruciatingly hard to diagnose and correct, even for seasoned experimenters. SRM can be driven by data quality issues, experiment configuration problems, incorrect filtering when joining data, improper rollouts/rollback procedures, interaction between experiments, inconsistent definitions for ConsumerIds (UUID vs. incremental ids), and any number of additional issues. Our challenge has been to create solutions to help our experimenters avoid, identify, and correct SRM.

Solutions from the trenches

At DoorDash, we have pursued several approaches to reduce the platform’s rate of SRM, including innovating on methodology for how we diagnose SRM, improving our real-time observability and alerting, and focusing on education and awareness.

Statistical approaches for identifying imbalance

The most common approach for identifying SRM is to use a chi-square test that can quickly detect when something is wrong. These tests, however, can’t help identify why an imbalance has happened. So, as a follow up, some platforms allow experimenters to perform “eyeball statistics.” This allows segmenting data to understand which attribute might be driving the imbalance. For example, information collected and randomized upon platform usage could be segmented into Android and iOS to allow a visual check for irregularities.

An improvement on such ad-hoc segmentation would be to run a chi-square test repeatedly on segment subpopulations or to run permutation tests against contingency tables. The latter approach bests a chi-square test because it can flag which user segments are driving the imbalance and provides inferential statistics for use in making decisions. That said, there are three problems with running permutation tests tests or ad hoc chi-square tests:

They don’t generate orthogonal effects. Figure 3 below shows an example in which country and platform segments are analyzed separately, leading to imbalances in both. For example, the imbalance might be caused by Android exposures, yet because the U.S. has more Android users than other parts of the world, the country attribute will also be flagged as an imbalance. Lack of orthogonal effects is the biggest disadvantage of current methods.
They don’t provide a good tradeoff between false positives and the power to detect SRM. Running permutation tests and chi-square tests against dozens of segments requires aggressive p-value corrections, thus reducing the sensitivity of the SRM check.
They are computationally inefficient. Running permutation testing at scale with tens of thousands of daily checks can quickly generate a very inefficient infrastructure footprint.

Instead, we wanted an approach that allows us to generate orthogonal effects, scales well, and doesn’t sacrifice power.

Figure 4: To demonstrate why orthogonal effects are important, note that Android is the root issue leading to imbalance. But because the U.S. has more Android users than other countries, an experimenter might mistakenly assume that the problem can be attributed to the user country.

Our approach: Regression is all we need

When we randomly assign users to treatment and control groups, we assume that there is nothing else in the world but randomness driving those assignments. As illustrated in Figure 5 below, if we were to assign an estimator to check for any relationships between user attributes before randomization and assignment, there should be zero correlation or regression coefficients.

Figure 5: Attributes collected prior to randomization should have no impact on treatment assignment.

One estimator that provides us with orthogonalization properties and generates simple to interpret statistics that allow us to verify if something is related to treatment assignment is linear regression.

We will use two dimensions to more clearly explain how to use linear regression to identify imbalance. Let’s assume that we have two attributes we collect at the time of randomization:

Country: USA, Canada, Australia
Platform: Web, iOS, Android

We will test our approach with three scenarios:

No imbalance
Imbalance due to country=Australia
Imbalance due to country=Australia OR platform=Web

Figure 6: In scenario 1, there is no imbalance. In Scenario 2, there is an imbalance driven by having a split of 80/20 instead of 50/50 in Australia. In Scenario 3, there is an imbalance driven by both the Australia segment and the Android segment, which also has an 80/20 split.

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
def generate_data(n,
                  platforms=["iOS", "Android", "Web"],
                  countries=["USA", "AUS", "CAN"]):
    np.random.seed(42)
    expected_distribution = [0.5, 0.5]
    experiment_groups = [1, 0]
    df = pd.DataFrame(
        {
            "user_id": range(1, n + 1),
            "platform": np.random.choice(platforms, size=n),
            "country": np.random.choice(countries, size=n),
        }
    )
    # Scenario 1: No imbalance
    df["scenario_1"] = np.random.choice(experiment_groups, size=n, p=expected_distribution)
    # Scenario 2: Australia 80/20 imbalance
    df["scenario_2"] = df["scenario_1"]
    mask_2 = df["country"] == "AUS"
    df.loc[mask_2, "scenario_2"] = np.random.choice(experiment_groups, size=sum(mask_2), p=[0.8, 0.2])
    # Scenario 3: Australia or Android 80/20 imbalance
    df["scenario_3"] = df["scenario_1"]
    mask_3 = (df["country"] == "AUS") | (df["platform"] == "Android")
    df.loc[mask_3, "scenario_3"] = np.random.choice(experiment_groups, size=sum(mask_3), p=[0.8, 0.2])    
    return df

In this code snippet, we generate data with three randomization options: one with completely random assignment, one where the distribution skew is driven by platform, and one where the distribution skew is driven by country or platform.

Fundamentally, if we want to understand what attributes are related to treatment assignment, we simply have to fit a regression with the following form:

is_treatment ~ country + platform

The code below allows us to run this regression.

def run_model(df, scenario_name):
    # center the outcome variable around expected ratio
    df['is_treatment'] = df[scenario_name] - 0.5
    formula = "is_treatment ~ 1 + platform + country"
    # fit the regression 
    m = smf.glm(formula, data=df).fit(cov_type="HC1")
    # get the p-values for the main effect using a Wald test
    wald_p_values = m.wald_test_terms(scalar=True).table
    return wald_p_values

Here we run a regression to explain if treatment assignment is a function of platform and country variables. Given that we’re interested only in the main effects of those variables, we follow it up with a Wald test that asks: Is the main effect of a particular predictor related to treatment assignment?

Note that when we run the regression, we are interested in main effects (e.g., “Is there a main effect of the platform on treatment assignment?”), so we use a Wald test to get the p-values for the main effect. Figure 7 shows the Wald test output for each of the three scenarios. We can immediately draw these conclusions:

In Scenario 1, none of the attributes are related to treatment assignment.
In Scenario 2, we can see that country has a very low p-value. There is a main effect from country, but we don’t know yet which specific country segment is responsible for the imbalance.
In Scenario 3, we can see that both platform and country are drivers of imbalance.

Figure 7: The Wald test results show: In scenario 1, none of the attributes are related to treatment assignment. In scenario 2, only country is related to treatment assignment (p<0.0001) and in scenario 3 both platform and country are predictive of treatment assignment (p<0.0001).

Note that our internal implementation is more complex. The example process described so far only allows us to find the main effects. Internally, we apply a recursive algorithm to eliminate subsets of the data that contribute most to imbalance, similar to what an experimenter would do in the process of salvaging data from SRM. Moreover, we apply a correction to p-values when we perform multiple checks and handle a variety of edge cases, such as when a segment has zero variance or the regression is not invertible because of perfect multicollinearity.

Optimizations

As shown in the code snippet below, an important optimization we can do relies on weighted regression when performing the computation. Using weights in regression allows efficient scaling of the algorithm, even when interacting with large datasets. With this approach, we don’t just perform the regression computation more efficiently, we also minimize any network transfer costs and latencies and can perform much of the aggregation to get the inputs on the data warehouse.

df_aggregated = df.groupby(['country', 'platform', 'scenario_1'], as_index=False).size()
model1 = smf.glm(formula, data=df, freq_weights=df.size.df_aggregated).fit(cov_type="HC1")

In this example, we aggregate the data to the platform, country, and experiment group level. This aggregation allows us to reduce the data size from millions to hundreds of rows, making the SRM computation many orders of magnitude more efficient. This aggregation is done on the data warehouse side.

Extensions

After ruling out all causes of imbalance, an extension of this methodology would be to use the regression approach to correct for SRM, thus salvaging the collected data. This should be done only after clear causes of imbalance have been identified. To apply correction, you simply need to fit a regression with this form:

metric_outcome ~ is_treatment + country + platform

Adding the two regressor variables not only corrects results for SRM, but could also contribute to variance reduction because the additional covariates could explain some of the variance in the metric outcome.

Observability

DoorDash also has explored ways to give users better observability on experiment exposures beyond the statistical methods mentioned previously. Experiment exposures are one of our highest volume events. On a typical day, our platform produces between 80 billion and 110 billion exposure events. We stream these events to Kafka and then store them in Snowflake. Users can query this data to troubleshoot their experiments. However, there are some issues with the current tools:

There is a lag on the order of tens of minutes between when the exposures are created and when they are available for querying in Snowflake.
The queries take a long time to run on Snowflake, even after we apply partitioning. Running complex queries takes even longer.

We want to give users an easy-to-consume dashboard to let them monitor and observe experiment exposures in real time. This allows users to gain immediate insights into the performance and health of their ongoing experiments. As an added benefit, the dashboard reduces our reliance on Snowflake to troubleshoot experiments thereby reducing overall operational costs.

We needed to do two things to create the dashboards. First, we had to aggregate the exposure stream across different dimensions. We included dimensions like experiment name, experiment version, variant, and segment, which represents the population group for the sample as defined when setting up the experiment. For this task, we used Apache Flink, which supports stream processing and provides event-time processing semantics. Supported internally at DoorDash, Flink is used by many teams to run their processing jobs on streaming data. We use Flink’s built-in time-window-based aggregation functions on exposure time. We then send this aggregated data to another Kafka topic.

Next, we had to save the data that is aggregated by the time-window into a datastore. We wanted an analytics OLAP datastore with low latency. For this we used Apache Pinot. Pinot is a real-time distributed OLAP datastore which can ingest data from streaming sources like Kafka, scale horizontally, and provide real-time analytics with low latency. We rely on Pinot’s built-in aggregation functions to produce the final results, which are fed into the user dashboards to provide various visualizations. Figure 8 below shows the high-level overview of our solution:

Figure 8: Here we summarize our real-time stream processing.

We added another layer of transparency by embedding these dashboards into our experiment configuration platform. With this tool, users can quickly troubleshoot a number of issues associated with SRM, including:

Did I launch one treatment variant sooner than another?
Are there more exposures in one group versus another?
Are there any anomalies present in the time series of exposure logs?

Below are sample charts from our dashboards.

Figure 9: This shows the timeline of exposure count by variant. A user can access this data within minutes after launching an experiment.

Figure 10: Visualizations of the distribution of exposures by each variant allow a user to check for any major irregularities.

Real-time insights not only help in diagnosing issues with experiments but also generate greater confidence that a rollout is proceeding as expected.

Alerting

To further minimize the rate of SRM occurrences, users can subscribe to an experiment health check alert system that notifies them quickly — often within 24 hours — if an imbalance is detected within their experiment. This allows for timely, proactive adjustments that can virtually eliminate the need to discard otherwise valuable data down the line due to invalidated results.

Figure 11: When setting up an experiment, users can subscribe to our alert system.

Education is key: The role of awareness in reducing SRM

In our quest to reduce the incidence of SRM on the platform, we’ve explored and implemented a variety of technical solutions — from real-time monitoring systems to new algorithms that identify imbalance sources. While these advancements have been crucial in minimizing SRM, we found that human intervention through awareness and education remains indispensable and moves the needle most. Recognizing this gap, we initiated a multi-pronged educational approach, including:

Training sessions: We organized an internal bootcamp focused on best practices for experiment configuration, highlighting how to prevent imbalance due to simple configuration problems.
Documentation: We provided comprehensive guides with case studies that a non-technical person can understand. We even renamed the terminology internally from “Sample Ratio Mismatch,” which is a technical mouthful, to “Imbalance Check.”
Stronger language: We changed documentation language and how we communicate SRM failures to be more in line with the size of impact that it has on decision making. Although there are rare cases in which SRM failures can be overlooked, the revised language emphasizes that imbalanced experiment results can’t be trusted.
Proactive user engagement: The reactive nature of problem-solving poses a challenge to minimizing SRM. Users may only become aware of SRM after they encounter the problem, which often leads to delayed actions. Instead of waiting for users to join our training session or open the documentation and diagnostic tools, we engage them early through team-specific knowledge share sessions.

Sometimes the best solutions aren’t just about building a better mousetrap, but instead ensuring that everyone knows how to use the new tools effectively. For us, education and awareness have made all the difference. Writing this blogpost is itself an attempt to push for greater awareness and education.

Results

Within six months of starting our work, we saw a 70% drop in SRM incidents on the platform. This means that hundreds of experiments which might have been plagued by incorrect conclusions instead lead to legitimate results. Beyond the numbers, there has been a palpable shift in team dynamics. With heightened awareness, A/B tests are set and reviewed more thoroughly and executed more successfully. Teams no longer must expend valuable resources and experimental capacity on restarting tests or reconciling unexpected outcomes caused by imbalance failures.

Future Work

Although we have made great progress toward reducing the incidence of SRM at DoorDash, we believe even more improvements can be made through real-time observability, automatic correction, and standardization.

Real-time observability can be improved by integrating more tightly with the algorithms used in diagnostic checking. It is computationally inexpensive to run Wald tests and weighted regression on count data, so we would like to run it on the query outputs from Pinot whenever a user examines real time exposures.
Automatic correction will allow us to fix common SRM problems and adjust experiment results without compelling the user to take any additional action. As shown previously, if we can identify the source of imbalance, we can often salvage the analysis result by adding additional covariates to our estimator.
Standardization offers a safeguard against common pitfalls, thereby reducing the likelihood of user errors. For example, if a user fixes a bug and relaunches an experiment, our system could proactively identify potential repercussions of the changes and adjust the strength of warnings or guidelines accordingly.

Through such measures, we can further elevate the robustness and credibility of experimental results.

Acknowledgements

Many thanks to Drew Trager, Sharon Weng, Hassan Djirdeh, Yixin Tang, Dave Press, Bhawana Goel, Caixia Huang, Kevin Tang, and Qiyun Pan, who have been instrumental in their feedback, execution, and collaboration across many of the initiatives outlined above. Finally, many thanks to the Eng Enablement team: Janet Rae-Dupree, Robby Kalland, and Marisa Kwan for their continuous support in reviewing and editing this article.

If you’re passionate about building innovative products that make positive impacts in the lives of millions of merchants, Dashers, and customers, consider joining our team.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

As part of our inaugural ERG Summit, leaders from our Employee Resource Groups and our Diversity, Equity, and Inclusion team joined more than 100 other visionaries from across the tech industry.

The concept for the ERG Summit began as an internal event dedicated to bringing our employee leaders together in person. Those who serve in these voluntary roles help shape the type of workplace they’d like to experience, and the ERG Summit was designed to help deepen community and connect with executive stakeholders for a full day of learning and development. But when Kondja Kamatuka, our Program Lead on the DEI team, connected with others in the DEI community, the clear need to broaden the summit into an industry-wide event became evident.

“I met with Chanise Simms-Robinson, who owns Out Of Office events, and they do a lot of impactful DEI work, specifically in the Hispanic and Black communities. The more I talked to people and the more Chanise talked to people in her network, it was so clear that this is a need industry-wide within this space,” Kondja said.

“Being able to live our values like having a bias for action in launching this event and make room at the table for others in the industry, I feel like we’re moving the needle forward. That’s the biggest driver and the thing I feel the most proud of. The agenda is great, and I’m so proud to innovate in this space in a way that feels really special.”

Kondja helped build the framework for DoorDash’s first-of-its-kind initiative, bringing together inclusive speaker panels, hands-on engagement opportunities, learning & development, and keynote speakers. Our employee leaders took part in engaging panel discussions ranging from the state of DEI, how to advocate for your community, building your personal brand, and more. The Summit offered more than just bringing like-minded professionals together, as the Summit offered an opportunity to share unique learnings, perspectives, and ways to unlock further ERG engagement and personal growth.

The Summit ended with an in-depth conversation with Tina Knowles that touched on the importance of uplifting youth and pushing them beyond the boundaries of what they think is possible.

This is just one of the many ways we’re constantly working to empower and support our people.

Check out how our ERGs are continuing to drive the culture at DoorDash:

DoorDash invests in Employee Resource Groups by compensating leaders

DoorDash launches new Employee Resource Group aimed at celebrating physical and neurological diversity

DoorDash kicks off Veterans Appreciation Month by rolling out new Military Leave Policy

During National Disability Employment Awareness Month (NDEAM) this October, we’re taking time to recognize and honor the invaluable contributions of individuals with disabilities to our workforce and, most importantly, to our society as a whole alongside our Able@ Employee Resource Group.

This year’s theme of ‘Advancing Access and Equity’ underscores the essential role that people with disabilities play in shaping a diverse and inclusive workforce, one that mirrors the rich tapestry of American society. Just as we embrace diversity in all forms, NDEAM calls on us to eliminate any remaining barriers to employment and champion equal opportunities for all.

As we bring the proper attention to NDEAM, we’re kicking off the month of inclusive programming with a Q&A session with Samantha Hiew, the founder of ADHD Girls on neurodiversity, its impact in the workplace, and ways to support a neurodiverse workforce.

From there, we’ll host a virtual painting party, hear from Jose Rosario, who will share his personal experiences navigating systemic inequality and stigma to contextualize the connection between identity and mental health, and we’ll have a session that aims to demystify benefits and accommodations for those with disabilities in the workforce.

Our efforts don’t stop after October, as we’re always encouraging employees to engage in training around supporting workers with disabilities, psychological safety in the workplace, hiring and supporting neurodiversity in the workplace, and more.

Creating a workplace where everyone feels welcomed, supported, and valued is one of the reasons we were recently recognized as one of America’s Greatest Workplaces 2023 by Newsweek. To learn more about this award and how we’ve driven the culture our employees experience over the last year, click here.

As a tech company, it’s our products and platform – and the engineers that build them – that power what DoorDash is able to offer our Consumers, Dashers, and Merchants every day. We thrive on tackling challenging technical problems and creating opportunities for customers, and take a lot of pride in what we do (and, as regular readers know, periodically share details about our work on this blog).

For our engineering teams to build both great software and meaningful careers, we as an org must periodically ask for feedback, revisit our vision for what makes us successful together and individually, and evolve our culture towards it. As our VP Ryan Sokol regularly says, we want engineers to build with us for the long term, and part of making that possible is to clarify what success looks like at each stage of an engineering career.

It’s for this reason that in Q2 2023, we revisited our performance expectations for all engineers at DoorDash. We started by gathering a group of engineers to discuss which existing expectations were still relevant, and which ones were no longer serving us. We defined what we see as the traits of our most successful engineers at each level based on our three pillars: Business Outcome (how engineers deliver impact based on our direction and goals), People (how well we collaborate as a team and invest in each other’s development and success), and Engineering Excellence (the quality of our products and systems, how fast we can move, and how efficiently our systems use resources). Starting at the Staff Engineer level, engineers have the opportunity to transfer to the engineering management ladder, where Staff Engineer ⇔ Manager, Senior Staff Engineer ⇔ Senior Manager, etc. (see this blog post for how we think about great engineering management).

For each of these pillars, we provide a tangible set of expectations and potential growth ladder for our engineers. During the performance evaluation process, eng managers review the guidelines with their reports and provide feedback based on team goals over the past year (for example, one engineer’s role may have mainly been devoted to improving Eng Excellence, while another may be mostly focused on Business Outcome). Our goal is to balance the importance of business needs, people strengths and engineering excellence in context, and ensure that our expectations are seen as guidelines (rather than a list of requirements) to grow as an engineer.

Now that we have a revised version, we’ll be building out additional tailored resources and support for engineering development with the aspiration that every engineer at DoorDash can have a rewarding, meaningful career. We also decided to share the expectations publicly, so that any engineer can have a deeper understanding of our culture and career development path should they choose to build with us at DoorDash.

Business Outcome comprises two subcategories: Delivery and Direction. Our expectation is that engineers will create impact by executing projects effectively, and providing technical or product direction for themselves and others to do so. While Directional impact is not as crucial for engineers at the E3 and E4 levels, we do anticipate that engineers at the E5+ levels will contribute to the Direction of their scope as part of their overall responsibility for the Delivery of an area’s success.
People includes three subcategories: Collaboration & Alignment, Peer Growth & Community and Interviewing & Recruiting. Collaboration & Alignment assesses how effectively an engineer works within their team, as well as across teams and functions, to enable and empower others and promote a one team, one fight culture. Peer Growth & Community defines the expectations for engineers in terms of supporting their peers’ growth and the Engineering community as a whole. Interviewing & Recruiting outlines expectations for engineers to assist in recruiting talented individuals for the company (based on hiring velocity).
Engineering Excellence encompasses three subcategories: Quality, Velocity and Efficiency. Quality refers to the craft of engineering, delivering software that is reliable, efficient, accurate, scalable, extensible and secure. Velocity refers to how an engineer can facilitate speedy development for their colleagues by creating tools, designing an extensible architecture or providing reusable infrastructure, among other things. Efficiency refers to building software that meets intended functions with appropriate resources through efficient code, optimization and reusing existing platforms. By focusing on Quality, Velocity and Efficiency, engineering teams can deliver products that meet the needs and expectations of their customers and ensure long-term success and continued innovation.

Software Engineer (E3)

Tl;dr: You are a Productive Coder, translating clear requirements into high quality code that is rolled out safely. You follow engineering best practices while making contributions to engineering excellence through on-call and incremental fixes. You know how to unblock yourself and ask for help when needed.

Business Outcome

Delivery

Understands how assigned tasks are related to the overall feature work.

Translates functional and non-functional requirements into code, with some guidance from peers.

Ensures on-time completion of work by asking peers for help if they have researched their problem and are still blocked.

Uses developer tools to successfully roll out changes to production.

Keeps track of progress of assigned tasks and communicates status effectively with stakeholders.

Direction

Unblocks self in completing tasks via available internal and external documentation.

People

Collaboration & Alignment	Works effectively with immediate team members to solve problems and deliver features. Participates in planning and regular team updates.
Peer Growth & Community	Engages in team building efforts.
Interviewing & Recruiting	If nominated, obtains and retains interview certification status, which includes conducting interviews for new grad and intern candidates, providing quality and timely feedback, and, as needed, participating in hiring debriefs.

Eng Excellence

Quality	Aware of business and product areas affected by quality objectives. Monitors quality metrics, including success and check metrics for all applicable production changes. Tests and dog-foods the products they work on. Promptly addresses customer and partner feedback. Debugs and fixes issues in development, test and production. Solicits code review in accordance with the DoorDash Code Review Process. Provides timely code reviews for peers. Serves as first responder with minimal oversight. Follows OnCall playbooks to mitigate issues in a timely manner.
Velocity	Effectively use tools to minimize manual tasks. Constantly learns about available tools and infra support to improve productivity. Proactively improves existing documentation as needed.
Efficiency	Follows cost efficiency best practices that will improve CPU/memory utilization.

Software Engineer (E4)

Tl;dr: You are a Problem Solver who can design and implement solutions to defined problems independently with high quality. You collaborate with partners and peers to clarify problem definitions and make adjustments to them when necessary. Your expertise in areas that your team supports enables you to effectively help partners and peers make progress toward their goals. You help the team uphold the engineering excellence bar through insightful code reviews and guide your peers through execution.

Business Outcome

Delivery

Articulates key customer problems and technical decisions related to assigned projects.

Breaks down assigned feature work into tasks, and provides accurate cost estimates.

Writes Technical Design Docs to document feature-level technical decisions.

Independently translates functional and non-functional requirements into code.

Proposes changes to Product Requirement Docs and/or Technical Design Docs to improve outcome of deliverables and/or unblock progress.

Keeps track of progress of own work and dependencies, adjusting estimates and priorities of tasks when needed.

Direction

Uses data analysis and customer/partner feedback to identify opportunities for feature-level improvements.

Contributes to team planning and retrospectives.

People

Collaboration & Alignment	Unblocks other team members by proactively answering questions, giving tech talks about areas within scope, etc. Works effectively with partners from various functions or teams to unblock self on development. Effectively communicates project updates and blocking issues to stakeholders. Independently builds effective relationships with key partners.
Peer Growth & Community	Engages in team building efforts. Acts as a role model for other engineers, demonstrating and championing behaviors that exemplify company values through day-to-day work and interactions. Shares constructive feedback and guidance frequently and empathetically.
Interviewing & Recruiting	If nominated, obtains and retains interview certification status, which includes conducting interviews for candidates, providing quality and timely feedback and actively participating in hiring debriefs.

Eng Excellence

Quality	Upholds and improves quality objectives and processes within the team, such as by articulating the business and product impact of quality metrics in trade-off discussions, providing code-review feedback on conformance to DoorDash and industry standards, and improving metric quality (e.g. fixing logging and pipeline issues). Consistently applies architecture principles and design patterns to the software they build and maintain. Writes high-quality code that is concise, efficient, durable and easily understood, and that adheres to DoorDash standards. Serves as first-responder OnCall independently. Proactively prevents and mitigates issues to ensure quality compliance, such as by updating OnCall playbooks as needed.
Velocity	Contributes to initiatives such as tooling and automation that improve the development experience of team members. Contributes to documentation that enables other engineers to move faster. Contributes to removing tech debt.
Efficiency	Ensures cost efficiency best practices in design and code reviews. Plays an active part in system efficiency development practices, such as infra provisioning.

Senior Software Engineer (E5)

Tl;dr: You are a Project Owner who can understand customer problems and translate objectives into well defined and scoped sub-problems for yourself and peers. You are capable of making long term vs short term trade-offs and design high-quality, durable solutions. You set a high bar for craftsmanship, reliability, and scalability, and lead by example in execution. You have effective project management skills and can manage moderate levels of ambiguity. You take ownership of areas within your team’s charter and make improvements that reduce toil for your customers and peers. You identify gaps and provide recommendations to your cross-functional partners with regularity. You are a role model within your own team, showing up as an empathetic and influential teacher and leader. You collaborate with your partner teams with empathy and can exert influence on their execution to provide the best solutions for DoorDash.

Business Outcome

Delivery

Delivers one or more OKR projects, either independently or by leading a group of engineers.

Makes and owns most technical decisions within scope without oversight.

Actively participates in defining key customer problems.

Breaks down projects into feature level work that can be executed in parallel, identifies dependencies and provides accurate cost estimates.

Writes Technical Design Docs to document project-level technical decisions and how sub-features interact with each other.

Produces exemplary code for core functionalities of key technical components.

Sets up and maintains processes (e.g., scrums, status reports) to ensure effective coordination among Eng peers and/or cross-functional partners.

Keeps track of project-level dependencies and risks, identifies opportunities for loading balance and assignment changes within the project and escalates to stakeholders whenever necessary to drive for on time delivery.

Direction

Uses data analysis and customer/partner feedback to identify opportunities for product/technical areas within team charter.

Actively incorporates feedback from customers and partners into team planning.

Creates project proposals and researches feasibility and risks.

Directly influences team roadmaps by adding, triaging and prioritizing project backlogs.

People

Collaboration & Alignment	Ensures quality of project intake by asking questions and sourcing information to ensure best possible outcomes. Identifies key stakeholders and ensures they are included in the decision making process at the appropriate time. Unblocks other teams by conducting code and design reviews, proactively answering questions, giving tech talks, connecting peers with potential solutions, etc.
Peer Growth & Community	Engages in team building efforts. Acts as a role model for other engineers, demonstrating and championing behaviors that exemplify company values through day-to-day work and interactions. Shares constructive feedback and guidance to teammates with empathy. Consistently highlights and recognizes working partner contributions in project and/or Eng win reports.
Interviewing & Recruiting	If nominated, obtains and retains interview certification status and conducts interviews and/or sell calls for candidates, actively participating in hiring debriefs. Based on hiring needs, participates in hiring events, such as publishing an Eng blog post, delivering tech talks or participating in panels at recruiting events.

Eng Excellence

Quality	Enforces team-wide quality objectives, such as by calibrating metrics to ensure accurate representation of software quality, driving accountability and leading by example with incident mitigation, improving logging and monitoring, and leading design reviews. Proactively gathers customer feedback and monitors bug trends, such as from support channels and internal groups, promptly addressing issues with engineers and cross-functional partners. Influences system and team planning and prioritization by articulating the impact of quality initiatives to business outcome. Serves effectively as first-responder and/or incident commander; leads by example via building and modeling best practices.
Velocity	Drives initiatives such as tooling, automation and documentation that improve the development experience and productivity of team members. Designs solutions that are reusable, extensible, and self-serviceable. Reduces tech debt where possible.
Efficiency	Considers cost efficiency as part of key technical designs. Strikes balance between business objectives, reliability and cost. Identifies and drives architecture improvements that will significantly improve system efficiency within the team.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

Staff Engineer (E6)

Tl;dr: You are a Technical Leader and trusted advisor to your manager and peers within and outside of your team. You are a key influencer in the roadmaps of your team and area of expertise. You are able to manage high levels of ambiguity, identify opportunities and translate them into objectives for your team. You are responsible for the formation and delivery of your team’s OKRs. You drive impact through outsized direct contributions, being a force multiplier, eliminating single points of failure (including yourself) and/or uplifting your team members’ skill set. You collaborate with partner teams, influencing their roadmaps and delivering on joint goals. You set the engineering excellence bar for your team and consistently identify areas for improvement.

Business Outcome

Delivery

Leads entire team on solving mission-critical customer and business problems, filling in any Eng and/or cross-functional gaps when needed.

Works independently to tackle critical problems for the team and creates technical solutions that can be easily leveraged by other engineers.

Makes and owns decisions independently, even in ambiguous situations.

Sets up and maintains processes (e.g., project reviews, scrums of scrums) to ensure effective coordination among ongoing projects.

Proactively builds alignment with key stakeholders, anticipating and mitigating risks across orgs to drive for on-time delivery.

Direction

Uses data analysis and customer/partner feedback to shape (and reshape, as needed) product direction and technical thinking.

Creates vision for the team that defines key customer problems and outlines major long-term technical and product investments.

Creates technical roadmaps for the team that have clear milestones.

Proactively aligns team vision/roadmaps with partner teams, building joint goals and escalating when necessary.

People

Collaboration & Alignment	Builds strong communication channels to ensure delivery of quality feedback and diversity of opinions. Consistently demonstrates listening, curiosity and empathy when working with team members and stakeholders to build alignment and roadmaps. Balances engineering principles with business objectives for the best long-term outcome when addressing conflicts arising within teams or with cross-functional partners. Consistently seeks and flags redundancies and duplication of effort across teams. Serves as a virtual team leader for cross-org projects, promoting an engineering culture of One team One fight while encouraging deep thinking, innovation and excellence.
Peer Growth & Community	Supports their manager in proactively identifying and closing gaps in experience and belonging of team members. Participates in activities that contribute to the larger engineering community (e.g., training facilitation and design, tech talk/summit organization, site leadership, etc.). Acts as a role model for other engineers, demonstrating and championing behaviors that exemplify company values through day-to-day work and interactions. Proactively invests in the growth of their teammates by sharing constructive feedback and guidance frequently and with empathy.
Interviewing & Recruiting	If nominated, obtains and retains interview certification status and conducts interviews and/or sell calls for candidates, actively participating in hiring debriefs. Based on hiring needs, participates in recruiting efforts, such as publishing an Eng blog post, delivering external tech talks, or participating in panels at recruiting events. Acts as champion for Eng candidate experience by leading interview training, improving tooling and/or actively overseeing calibration of interviewer skills and question bank quality.

Eng Excellence

Quality	Improves quality objectives within the team and for partners, such as by ensuring cross-team alignment, defining and operationalizing metrics that effectively represent the quality of work by the team, and driving adoption of best practices that prevent issues from reaching production. Defines the requirements for a high bar of craftsmanship, reliability and scalability, and upholds objective standards for evaluation and tracking. Ensures longevity of deliverables and removes single-points-of-failure (including themselves) by defining onboarding processes, runbooks, migration paths, etc. Sets up processes for continuous improvements to team Oncall, holds others accountable, and drives sustained success in Oncall compliance.
Velocity	Drives initiatives such as tooling, automation and documentation that improve the development experience and productivity of team members and partners. Identifies and drives architecture improvements that accelerate developer efficiency for the team and partners. Ensures sustained investment in velocity for their team (e.g. team backlog triages, velocity resource allocation in team planning, etc). Threads the needle to push the business forward without over-building or under-building
Efficiency	Identifies and drives architecture improvements that will increase efficiency for their team and partners. Operationalizes cost metrics for the team and proactively monitors trends in relation to target. Incorporates cost projections into team planning.

Senior Staff Engineer (E7)

Tl;dr: You are a Technical Leader and trusted advisor to your org lead. Your scope is dramatically increased from E6. The complexity of your problems span multiple teams, potentially in different organizations. Influence becomes an essential skill given the number of systems and people you are tasked with aligning. Your influence is scaled through documents more than code, but you still write software, prototype larger solutions and debug hard technical problems to maintain credibility and a level of detail. Your influence is beyond technical solutions – you’re able to shift an org’s product or business direction by providing long-term technical vision and strategy. You make the teams around you better, but not dependent upon you, through teaching, guidance and followership.

Business Outcome

Delivery

Leads entire org on solving complex, mission-critical customer and business problems, filling in any Eng and/or cross-functional gaps when needed.

Works independently to tackle critical problems for the org and makes foundational technical contributions that can be easily leveraged by other engineers.

Proactively builds alignment with key stakeholders, anticipating and mitigating risks across orgs to drive for on-time delivery.

Direction

Uses data analysis and customer/partner feedback to influence business and product strategy for the org.

Creates vision for the org that defines key customer problems and outlines major long-term technical and product investments and milestones.

Proactively aligns org vision with partner orgs by building shared objectives and principles.

Advises org leads on org structure in support of business and people growth and sustainability.

People

Collaboration & Alignment	Effectively collaborates with leaders from other functions or orgs. Consistently demonstrates listening, curiosity and empathy when working with team members and stakeholders to build alignment and roadmaps. Clarifies responsibilities across orgs to resolve conflicts and build alignment on technical vision, direction, priorities, timeline and execution plan in a way that balances the needs of different teams for the best long term outcome. Ensures we don’t simply build based on org structure. Serves as a virtual team leader for cross-org projects, promoting an engineering culture of One team One fight while encouraging deep thinking, innovation and excellence. Drives the adoption of platform, infra and framework across the company. Proactively builds effective communication channels across teams to ensure delivery of quality feedback and diversity of opinions.
Peer Growth & Community	With their manager, drives people programs and initiatives for their own organization to improve experience and belonging. Drives and participates in activities that contribute to the larger engineering community (e.g., training facilitation and design, tech talk/summit organization, site leadership, etc.). Acts as a role model, teacher, guide and influencer for other engineers, demonstrating and championing behaviors that exemplify company values through day-to-day work and interactions. Proactively invests in the growth of other engineers by sharing constructive feedback and guidance frequently and with empathy.
Interviewing & Recruiting	If nominated, obtains and retains interview certification status and conducts interviews and/or sell calls for candidates, actively participating in hiring debriefs. Proactively reviews interview feedback and helps calibrate and improve interviewer skills across the company. Based on hiring needs, participates in hiring events, such as publishing an Eng blog post, delivering tech talks or participating in panels at recruiting events. Acts as champion for Eng candidate experience by leading interview training and/or actively overseeing calibration of interview questions.

Eng Excellence

Quality	Drives alignment of quality objectives for all teams within the org and with partners.Defines a high quality bar and ensures consistent application within org and with partner teams. Identifies and drives org-wide improvements of common deficiencies in systems, tools, processes and product areas.
Velocity	Identifies and drives architecture improvements that accelerate developer efficiency for their org and partners.Identifies and drives solutions to improve large organization processes in support of velocity (e.g. design review process, knowledge sharing sessions, etc). Proactively reviews investment areas for the entire org to avoid over-building.Creates operating principles in support of independent and effective decision-making across the org.
Efficiency	Identifies and drives architecture improvements that will increase efficiency for their org and partners. Influences product direction to achieve better ROI. Identifies and drives cost efficiency best practices for the org, such as evaluating feasibility of adopting external technical solutions, and advises org leads on buy vs. build decisions on long-term investments. Influences org-level cost targets for the next 6-12 months and creates technical roadmaps to achieve them.

Principal Engineer (E8)

Tl;dr: You are a Creator and Driver of best practices Eng-wide. You are a high-level thinker who defines and drives OKRs for the company, and you are often directly responsible for their success. You achieve that success by 1) defining and implementing a technical vision across multiple domains, 2) influencing the business and product strategies at the highest level, 3) driving change through end to end execution, some of which will be your direct contribution, and 4) elevating engineering culture and excellence org wide. You utilize your vast experience, judgment and communication skills to summarize hard problems, influencing leadership and all of engineering to write better software in service of our customers.

Business Outcome

Delivery

Leads company-wide effort that require complex coordination across multiple product surfaces or platform/infra systems in different orgs, filling in any cross-functional gaps when needed.

Works independently to tackle critical problems across orgs and makes foundational technical contributions that can be easily leveraged by other engineers.

Serves as the tiebreaker for technical decisions when alignment cannot be reached.

Direction

Uses data analysis and customer/partner feedback to influence business and product strategy for the company.

Creates long term vision for company-level Eng excellence (e.g., reliability, velocity, efficiency) and outlines major long-term technical and product investments.

Proactively aligns company-level Eng roadmaps with key stakeholders across all orgs and ensures cascading OKRs at the org level.

Advises Eng staff on org structure for best results.

Creates operating principles and best practices for the entire company and proactively builds alignment across orgs.

People

Collaboration & Alignment	Collaborates efficiently with leaders from other functions or orgs. Clarifies responsibilities across the company to resolve conflicts and build alignment with company leaders on technical vision, direction, priorities, timeline and execution plan. Serves as a virtual team leader for cross-company projects, promoting a culture of One team One fight. Drives the adoption of platforms, infra and frameworks across the company.
Peer Growth & Community	Main contributor in defining, establishing and supporting the culture of DoorDash engineering via engineering-wide or company-wide people programs/OKRs. Sponsors external-facing efforts to advocate DoorDash culture more broadly. Drives and participates in activities that contribute to the larger engineering community (e.g., training facilitation and design, tech talk/summit organization, site leadership, etc.). Acts as a role model, teacher, guide and influencer for other engineers, demonstrating and championing behaviors that exemplify company values through day-to-day work and interactions. Proactively invests in the growth of other engineers by sharing constructive feedback and guidance frequently and with empathy.
Interviewing & Recruiting	If nominated, obtains and retains interview certification status and conducts interviews and/or sell calls for candidates, actively participating in hiring debriefs. Proactively reviews interview feedback and helps calibrate and improve interviewer skills across the company. Based on hiring needs, participates in hiring events, such as publishing an Eng blog post, delivering tech talks or participating in panels at recruiting events. Acts as champion for Eng candidate experience by leading interview training and/or actively overseeing calibration of interview questions.

Eng Excellence

Quality	Evolves company-wide quality objectives to meet company goals. Evolves company-wide quality bar to be top tier in the industry. Identifies and drives company-wide improvements of common deficiencies in systems, tools, processes and product areas. Identifies and drives solutions to shape and sustain a company culture that values quality.
Velocity	Creates operating principles in support of independent and effective decision-making across the company. Identifies and drives solutions to shape and sustain a company culture that values velocity.
Efficiency	Identifies and drives cost efficiency best practices company-wide. Influences company-wide cost targets for the next 12+ months and creates technical roadmaps to achieve them. Identifies and drives solutions to shape and sustain a company culture that values efficiency.

As an Engineering org, we are tremendously proud of our accomplishments throughout the history of DoorDash, particularly in recent years as we’ve grown and scaled in service of our customers. This success has been heavily influenced by the strength and leadership of our Eng management team; great companies are built by great people who have great managers. We believe that defining what great Engineering management looks like for DoorDash is critical for our success as a business and, ultimately, for the success of all the Consumers, Dashers and Merchants we serve.

Our agility and bias for action are some of the defining characteristics of our Engineering and company culture, yet even as we grow, we must reflect from time to time and define what has made us successful to date and what will make us successful going forward. It’s for this reason that in Q2 2023, we revisited our performance expectations for all engineers at DoorDash, including our Engineering managers (see this blog post for how we think about fostering meaningful engineering career development).

First, we spent time talking to current Engineering managers and identifying which existing expectations were still relevant, and which ones were no longer serving us. We articulated what we see as the traits of our most successful managers at each level of our management structure based on the culture we are cultivating as an org. We also wanted to provide guidance on how we think about our three management pillars of Business Outcome (how managers set direction and drive impact based on our strategic goals), Team (how managers support individuals, build team culture and partner with other teams) and Engineering Excellence (the quality of our products and systems, how fast we can move, and how efficiently our systems use resources). Now that we have a revised version, we’re leveraging it to provide better support to our Eng managers and demonstrate our company value of getting 1% better every day.

Once we published them internally, we could have stopped there. Yet we know that many other companies, and Eng orgs in particular, are in the same boat with us, navigating a complex working environment as we grow and scale as a business. We also believe in transparency, and want to give Eng manager candidates a clear idea of what kind of management we value in DoorDash Eng. We’ve therefore decided to share them externally, in the hopes that other people and companies can benefit from our perspective on great Engineering management, and leverage our work to shape expectations that feel supportive, fair and transparent in our constant pursuit of a strong culture within Engineering.

Business Outcome comprises two subcategories: Direction and Delivery. We expect our Engineering Managers to drive business impact by providing technical or product direction to enable engineers to execute projects effectively.
Team includes three subcategories: Team Building, Career Growth & Coaching and Collaboration & Alignment. Team Building measures how effectively we are building and shaping our teams. Career Growth & Coaching focuses on developing employees and providing consistent ongoing feedback that allows them to grow. Collaboration & Alignment measures how Managers work across different teams and cross-functional partners to drive impact and exemplify our One Team, One Fight mentality.
Engineering Excellence encompasses three subcategories: Quality, Velocity, and Efficiency. Quality refers to the craft of engineering, delivering software that is reliable, efficient, accurate, scalable, extensible and secure. Velocity is the speed and agility of engineers in developing, testing and deploying code to reduce time to market while ensuring confidence. Efficiency refers to building software that meets intended functions with appropriate resources through efficient code, optimization and reusing existing platforms. Engineering Managers must understand the value of engineering excellence, be adept at making effective tradeoffs and establish a culture of excellence that can scale across the organization. By focusing on Quality, Velocity and Efficiency, engineering teams can deliver products that meet the needs and expectations of their customers and ensure long-term success and continued innovation.

Manager (M1)

Scope: Directly supports an engineering team responsible for the charter and success of a product or platform primitive.

Business Outcome

Delivery

Consistently delivers against team’s OKRs.

Maintains up-to-date knowledge of the team’s execution status. Maintains sufficient technical know-how and product sense to identify risks and remove roadblocks, and keeps stakeholders informed.

Minimizes randomization and keeps the team focused with proper processes or tools.

Direction

Defines a clear team vision with cross-functional partners on a 6-12 months time horizon.

Understands strategies and OKRs of the business unit or platform charter that the team belongs to, and communicates strategic context with their team.

Drives their own team’s strategies and OKRs.

Drives project prioritization and adjusts priorities based on projected impact, new learnings or business needs.

Enables the team to identify opportunities for impact through roadmap planning.

Team

Team Building	Accountable for the outcome of team health measurements, e.g., EngageMe, upward feedback, hiring goals, people OKRs. Builds toward a team structure based on projects or stack, and minimizes single points of failure on the team. Able to hire engineers when needed, considering diversity and a balanced composition of skills and seniority within the team. Sets up a safe and inclusive environment for team members to voice their concerns and feedback. Participates in activities that contribute to the larger engineering community, e.g., interviews, learning and sharing.
Career Growth & Coaching	Sets clear expectations, gives timely feedback and fosters career discussions with growth plans for all team members. Supports internal mobility for eligible team members. Adheres to performance management standards.
Collaboration & Alignment	Builds alignment with cross-functional partners or partner teams on team strategies, OKRs and roadmaps. Supports the success of partner teams. Works with the team to identify dependencies and stakeholders, builds alignment with them on ownership and priorities, and communicates unresolved misalignment and ambiguities.

Engineering Excellence

Quality	Defines, measures and drives quality OKRs, e.g., SLOs, and adopts best practices. Builds alignment with cross-functional partners or partner teams to balance quality and speed. Maintains a deep understanding of how to build quality software. Ensures the team adheres to quality standards on architecture, coding, testing, deployment, reliability, security and compliance.
Velocity	Identifies areas that slow down velocity, e.g., tech debt, tools and documentation, process friction, etc. Proactively optimizes team velocity with 3-6 month solutions. Makes tradeoffs and drives decisions between adopting existing tech vs. building new. Drives velocity outcomes with measurable results.
Efficiency	Tracks the team’s hardware and service footprint. Drives best practices and identifies opportunities to reduce costs with quality and velocity tradeoff considerations. Collaborates with partner teams on adapting technologies and processes to improve efficiency.

Senior Manager (M2)

Scope: Supports multiple products or platform primitives with a coherent charter. The charter typically consists of 3-5 teams supported through M1s and senior individual contributors, with exceptions in specialized areas.

Business Outcome

Delivery

Consistently delivers against the OKRs of all sub-teams.

Maintains check points of teams’ execution status, and proactively identifies and closes execution gaps.

Enables capacity reallocation across sub-teams for greater impact.

Direction

Defines a clear and cohesive team vision with cross-functional partners on a 6-12 months time horizon.

Contributes to strategies and OKRs of the business unit or platform charter that the team belongs to, and brings their teams onboard.

Drives their own team’s strategy and defines OKRs that ladder up to those of the business unit or platform charter.

Drives prioritization and deprioritization across subteams based on short-term and long-term impact, new learnings or strategy changes of the larger organization.

Ensures effective roadmap planning on sub-teams.

Team

Team Building	Accountable for the outcome of team health measurements. Creates a team structure with clear ownership. Develops autonomous teams through effective delegation to M1s and senior individual contributors, and minimizes single points of failure for critical roles on the team. Able to hire E6+ and M1s when needed, and builds toward a balanced composition of skills, seniority and diversity. Proactively identifies and closes gaps in experience and belonging of team members. Participates in activities that contribute to the larger engineering community.
Career Growth & Coaching	Assesses manager effectiveness in their career growth/coaching responsibilities. Helps identify across-subteam projects to facilitate senior individual contributor growth. Advocates for internal mobility outside their own team. Maintains a consistent bar on level expectations and performance assessment across subteams. Identifies and closes gaps in performance management.
Collaboration & Alignment	Independently drives cross-functional, cross-team alignment on team strategies, OKRs and roadmaps. Contributes to the success of partner teams. Proactively manages dependencies and stakeholders, actively resolving ambiguities and misalignment on ownership and priorities.

Engineering Excellence

Quality	Identifies gaps and strategizes quality OKRs across teams. Advocates for these OKRs with cross-functional partners, balancing quality with business needs. Drives collaboration and alignment across functions and teams to deliver against quality OKRs. Defines principles to guide the team to make tradeoffs between quality and speed. Maintains check points of software quality, minimizes regressions, and ensures longevity of software deliveries.
Velocity	Maintains a holistic view of their team’s velocity status. Proactively prioritizes work and adjusts processes to improve team velocity. Anticipates business needs and technical tradeoffs on a 12 month time horizon, and defines principles to guide the team to make tradeoffs between short- and long-term velocity gains. Drives investments in software building blocks or platforms with velocity outcomes, and connects them with customer impact.
Efficiency	Understands the team’s cost factors and trajectory. Defines efficiency metrics and drives ROI for technology decisions. Drives and promotes abstractions and platforms agnostic to specific technologies. Strategically invests in technologies, e.g., adaption of platforms, retirement of systems, to simplify architecture and improve overall efficiency.

Director (D1)

Scope: Supports the engineering organization of a business unit or a platform charter that directly impacts the company’s success. The charter typically consists of 5-8 teams supported through M2s, M1s and senior individual contributors.

Business Outcome

Delivery

Consistently delivers against the business unit strategies and OKRs, or platform OKRs with clear business outcomes.

Creates processes to assess teams’ execution status and promote visibility. Sets up decision making frameworks to resolve ambiguities.

Enables capacity reallocation across organizations for greater impact.

Direction

Defines a clear and cohesive vision for a business unit or platform charter on a 1-2 year time horizon.

Understands company strategies and OKRs and brings their organization onboard through proactive communication.

Drives strategy and OKRs of their own business unit or platform charter and aligns those of sub-teams accordingly.

Drives prioritization of their own team, and contributes to new investment or divestment decisions based on short-term and long-term impact, new learnings and strategy changes of the company.

Drives roadmap planning process of their own organization.

Team

Team Building	Accountable for the outcome of team health measurements of their organization. Creates an organizational structure with clear accountability on team OKRs. Develops autonomous teams through effective delegation to M1s/M2s and senior individual contributors, and creates a succession plan for critical roles in the organization. Able to hire senior leaders at E6+/M1/M2 level when needed. Accountable for a balanced composition of skills, seniority and diversity. Drives people programs for their own organization to improve experience and belonging. Contributes to engineering-wide or company-wide people programs/OKRs.
Career Growth & Coaching	Proactively improves manager effectiveness in their organization. Accountable for growth plans for the whole organization. Advocates for internal mobility and growth opportunities across the company. Mentors managers across organizations. Drives performance calibration of their own organization that adheres to company standards. Identifies and closes gaps in performance management for leadership roles.
Collaboration & Alignment	Drives cross-functional, cross-org alignment on strategies and OKRs of the business unit or platform charter to serve all end customers. Contributes to the success of partner organizations. Creates processes to help the team identify and manage dependencies and stakeholders. Provides systematic decision making frameworks or working models, e.g., shared goals and guardrails.

Engineering Excellence

Quality	Creates processes to assess and improve teams’ quality OKRs, drives accountability across the organization and measures business outcomes. Empowers senior individual contributors to define a technical vision, on a 1-2 year time horizon, to evolve product and engineering quality. Accountable for the adoption of company standards and introduces technologies and methodologies to improve software quality.
Velocity	Creates processes to assess and improve teams’ velocity. Creates and maintains strategic tech investments on velocity on a 1-2 year time horizon. Owns velocity Eng OKRs and drives accountability across the organization. Accountable for customer impact from velocity gains.
Efficiency	Owns and drives org wide efficiency goals and drives accountability. Achieves business outcomes with cost efficiency. Develops a strategic 1-2 years roadmap on technology investment and direction with efficiency consideration. Be aware of state-of-the-art technologies and makes principled decisions on build vs buy.

Senior Director (D2)

Scope: Supports multiple organizations that directly impact the company’s success with a coherent charter. The charter typically consists of 3-5 organizations supported through D1s, M2s and senior individual contributors.

Business Outcome

Delivery

Consistently delivers against multiple business units’ strategies and OKRs or multiple platform OKRs with clear business outcomes.

Maintains awareness of teams’ execution status. Proactively identifies and closes gaps that impact the assessment and visibility of execution and decision making.

Enables capacity reallocation across business units or platform charters for greater impact.

Direction

Defines a clear and cohesive vision on a multi-year time horizon, for multiple business units or platform charters, for the greater success of the company.

Contributes to company strategies and OKRs, and brings their organization onboard by fostering context sharing across business units or platform charters.

Drives their own strategy and OKRs that ladder up to those of the company.

Drives new investment or divestment decisions based on new learnings, market dynamics or technology trends.

Drives roadmap planning process across organizations.

Team

Team Building	Accountable for the outcome of team health measurements of their organization. Creates an organizational structure with clear accountability for business outcomes. Develops autonomous organizations through effective delegation to M2s/D1s and senior individual contributors, and creates a succession plan for critical roles in the organization. Able to hire senior leaders at E7+/M2/D1 level when needed. Accountable for a balanced composition of skills, seniority and diversity. Identifies systemic patterns in experience and belonging across organizations; amplifies strengths and addresses weaknesses, sharing learning across organizations. Drives engineering-wide or company-wide people programs/OKRs.
Career Growth & Coaching	Maintains awareness of and creates systematic programs for manager effectiveness in their organization. Creates processes for talent identification and growth planning. Identifies internal mobility and growth opportunities across the company, contributes to engineering-wide people growth programs or frameworks. Drives performance calibration principles and processes in their own organizations.
Collaboration & Alignment	Drives alignment across business units or platform charters to contribute to the success of all end customers. Extracts patterns from ambiguities and misalignments and drives changes to decision making frameworks or working models.

Engineering Excellence

Quality	Maintains awareness of team’s quality OKRs and drives accountability across multiple organizations. Proactively identifies quality opportunities to drive business outcomes. Collaboratively builds, adopts and promotes principles, frameworks, processes and technologies to improve quality across the engineering org on a multi-year time horizon.
Velocity	Maintains awareness of velocity across organizations. Creates a vision and defines velocity eng OKRs that incentivize investments on velocity with a multi-year time horizon. Provides principles to make build or buy decisions that endure product and technology evolutions. Evolves velocity Eng OKRs and correlates them with lasting customer impact.
Efficiency	Defines and owns efficiency goals across multiple organizations. Drives efficiency OKRs and accountability across organizations. Collaboratively builds, adopts and promotes principles, frameworks, processes and technologies to improve efficiency. Makes bold bets to accelerate efficiency and reduce cost. Defines multi-year efficiency strategy including vendor selection, technology adoption, budgeting planning and tracking.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on

As part of our intentional approach to engaging people of all backgrounds while fostering an environment of diversity and inclusion, we’re celebrating National Hispanic Heritage Month alongside our Unidos@ Employee Resource Group.

Through our company culture, community efforts, and engagement with merchants, we support and empower Hispanic and Latin merchants and the communities we serve throughout the year.

Beginning September 15 through October 15, we’re hosting a series of virtual and in-person events to celebrate the vibrant cultures and enduring contributions from Spain, Mexico, the Caribbean, and Central & South America.

We’ll kick off the month of inclusive programming with a fireside chat alongside the founder of I<3Chamoy Annie Leah, followed by a cooking session with Omi Hopper, a home cook turned ‘Next Level Chef’ and a social media influencer. Our restaurant partners Sergios, Sanguich, and SuViche will join us for an engaging panel discussion, followed by a traditional Loteria game hosted by Drag Queen, Gubsy Daisy, and Charles Schwab will host a financial literacy discussion.

Across two days, we’ll also be joined by innovative Hispanic leaders across tech, retail, and e-commerce. In partnership with our Chief People Officer and the Unidos@ executive sponsor, Mariana Garavaglia, our panelists will come together through a series of fireside chats to share how they’ve navigated and charted career paths, built strong mentor relationships, and more.

Speakers will include:

To close out the month, our Mexico City office will host dance lessons to experience the rhythm and grace of being Hispanic.

As we celebrate NHHM year-round, our Learning and Development team at DoorDash will continue to empower and encourage learning opportunities, including sessions around how to succeed as a Latina in a global work environment, how to develop your authentic self, and more.

Through our company culture, community efforts, and engagement with merchants, we’ll continue to support and empower Hispanic and Latin merchants, as well as the communities we serve.

Architecture

MVP scope

Communication protocol

WebView

WebView entry and exit points

Message Handler Definition

Server-driven UI content

Development and initial rollout challenges

Results and future work

Conclusion / Summary

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

Boosting performance while supporting business logic

Dream big, start small

Single interface to rule them all

Layered caches

Runtime feature flag control

Observability and cache shadowing

Example usage

Use case guidance

When to use caching

Category 1: Can tolerate stale cache

Category 2: Cannot tolerate stale cache

When not to use caching

Write or mutation flows

As a source of truth

Conclusion

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

Cautionary tales of faux gains and real losses

Example 1: The $10 Million Mirage

Example 2: The bugfix bias

Solutions from the trenches

Statistical approaches for identifying imbalance

Our approach: Regression is all we need

Optimizations

Extensions

Observability

Alerting

Education is key: The role of awareness in reducing SRM

Results

Future Work

Acknowledgements

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

Software Engineer (E3)

Business Outcome

People

Eng Excellence

Software Engineer (E4)

Business Outcome

People

Eng Excellence

Senior Software Engineer (E5)

Business Outcome

People

Eng Excellence

Stay Informed with Weekly Updates

Please enter a valid email address.

Thank you for Subscribing!

Staff Engineer (E6)

Business Outcome

People

Eng Excellence

Senior Staff Engineer (E7)

Business Outcome

People

Eng Excellence

Principal Engineer (E8)

Business Outcome

People

Eng Excellence

Manager (M1)

Business Outcome

Team

Engineering Excellence

Senior Manager (M2)