Skip to content

Terraform lets DoorDash programmatically manage certain accesses across our infrastructure. However, the lack of a Terraform provider for Splunk, the tool we use to search, analyze, and visualize our data, made it difficult to manage access for users programmatically and at scale.

The efficiency and security gained by integrating access through Terraform lead us to create a Terraform provider for Splunk.

Terraform providers, which currently include such diverse products as Akamai, Github, and MySQL, integrate services with Terraform. Before creating our Terraform provider for Splunk, we had to manage Splunk access separately, a process requiring more work for administrators without any improvement in data security.

We concluded that writing a new Terraform provider for Splunk was worth the effort for DoorDash, and the process worth documenting for the engineering community at large. Terraform provides a flexible and easy way to manage existing and custom services as we will see in this post. Furthermore, we intend to open source our Terraform provider so other organizations can benefit from this work and run more efficient, secure infrastructure.

What are custom Terraform providers? 

Terraform is a popular tool used for building, changing, and managing infrastructure in a safe, repeatable way by using files for configuration instead of manually editing service settings.

Terraform is logically divided into two parts: Terraform Core and Terraform Plugins: 

The Terraform Core component communicates with Plugins using RPC (remote procedure calls) and is responsible for discovering and loading plugins.

The Terraform Plugins component provides a mechanism for managing a specific service (i.e. Splunk, in this article). These service specific plugins are known as custom Terraform providers.

The primary responsibilities of custom Terraform providers are:

  1. Initialization of any included libraries used to make API calls
  2. Authentication with the Infrastructure Provider
  3. Definition of resources that map to specific services

Each Terraform provider can manage one or more resources where a resource represents a component within the provider that has a set of configurable attributes and a CRUD (create, read, update, delete) lifecycle. 

While it is possible to write a plugin in a different language, almost all Terraform plugins are written in Go to take advantage of the Terraform provided helper libraries. These plugins are then distributed in the form of Go binaries.

How do you write a custom Terraform provider? 

As we saw in the previous section, custom Terraform providers are nothing more than Go binaries written to manage certain resources within a service. In this section, we will provide step-by-step guidance on how to create your own Terraform provider.

Before you get started, make sure that you have access to an API client library for managing Splunk roles and SAML Groups. According to Terraform’s documentation, client library logic should not be implemented in the provider. Instead, Terraform should consume an independent client library which implements the core logic for communicating upstream. 

Listed below are the steps to create your own Terraform provider:

  1. Define the provider entrypoint.
  2. Define the provider itself.
  3. Define the resources that the provider will control.
  4. Build and test your Terraform Plugin binary

The Terraform framework automatically manages all provider operations once these are defined. These components utilize a Terraform-provided helper library that offers a high level interface for writing resource providers in Terraform.

Defining the provider entrypoint

The first step in creating a custom Terraform provider is to define the provider entrypoint. As mentioned above, custom Terraform providers are distributed as Go binaries. Terraform requires each of these binaries to be prefixed with “terraform-provider-”. Hence, when we create a directory in our GOPATH it should be named “terraform-provider-splunk”. Inside of this GOPATH directory, we specify the entrypoint to our provider in the main.go file. Below is a code sample that demonstrates this entrypoint to the provider.


package main


import (
	splunkplugin "github.com/doordash/terraform-provider-splunk/splunk"
	"github.com/hashicorp/terraform-plugin-sdk/plugin"
	"github.com/hashicorp/terraform-plugin-sdk/terraform"
)



func main() {
	plugin.Serve(&plugin.ServeOpts{
		ProviderFunc: func() terraform.ResourceProvider {
			return splunkplugin.Provider()
		},
	})
}

The code sample above shows three libraries being imported into main.go – two of them are standard libraries and the third is custom. Below are the two standard libraries:

  • “github.com/hashicorp/terraform-plugin-sdk/plugin” library: which acts as the Terraform provider’s gRPC server. It executes any resource logic based on received parsed configuration and provides the Terraform Core updated state of a resource or appropriate feedback in the form of validation or other errors.
  • “github.com/hashicorp/terraform-plugin-sdk/terraform” library: this interface must be implemented by any resource provider. It creates and manages resources in a Terraform configuration.

The custom library contains the provider implementation, which we will cover in the next section.

Defining the provider itself

In this section, we will cover the implementation details of the provider. A provider is responsible for understanding API interactions with a service and exposing its resources. The starting point of the provider along with all provider properties are defined in this section. These properties are used to initialize the API client, declare resources that will be controlled, and define any configuration details. These properties may consist of the following:

  1. Schema: This is an optional schema for the configuration of this provider. The helper schema standard library is used to define the schema for the provider.
  2. ResourceMap: This property declares the list of available resources that this provider will manage.
  3. ConfigureFunc: This is an optional function for configuring the provider. This can be used to authenticate to the service.

import (
	"github.com/hashicorp/terraform-plugin-sdk/helper/schema"
)



func Provider() *schema.Provider {
	return &schema.Provider{
		Schema: map[string]*schema.Schema{
			"hostname": &schema.Schema{...
			},
			"port": &schema.Schema{...
			},
			"username": &schema.Schema{...
			},
			"password": &schema.Schema{...
			},
		},
		ResourcesMap: map[string]*schema.Resource{
			"splunk_role": resourceRole(),
			"splunk_saml_group": resourceSamlGroup(),
		},
		ConfigureFunc: providerConfigure,
	}
}

Defining the resources that the provider controls

In this section, we will define the Splunk resources and their attributes that we want the provider to control, i.e. Splunk roles and Splunk SAML Groups. Splunk roles determine the level of access that users can have and tasks they can perform, whereas Splunk SAML Groups allow authorized groups on your SAML server to login to your Splunk instance. A major component of Splunk authentication and authorization is managed through these two resources and their attributes, which we will define below. 

The resource schema from the helper library provides an abstraction that leaves management of only the CRUD operations to us while off-loading operations such as validation and diff generation to this library.

Each resource consists of the following:

1) Define Create, Read, Update, and Delete functions.

  • Each function definition gets passed the schema resource data
  • Create and Update functions set resource data ID (using SetIId). A non blank ID tells Terraform that a resource was created and is a value used to read the resource again.
  • Delete function sets ID to a blank string
func resourceRoleDelete(d *schema.ResourceData, m interface{}) error {
	apiClient := m.(*splunkclient.Client)

	err := apiClient.DeleteRole(d.Get("rolename").(string))
	if err != nil {
		return err
	}
	d.SetId("")
	return nil
}

2) Resource schema: the resource’s attributes. For each attribute, you can define the following (among others): 

  • ForceNew: delete and recreate a resource.
  • Required/Optional
  • Type
  • Description
  • Default
import (
	"github.com/hashicorp/terraform-plugin-sdk/helper/schema"
)

func resourceRole() *schema.Resource {
	return &schema.Resource{
		Create: resourceRoleCreate,
		Read:   resourceRoleRead,
		Update: resourceRoleUpdate,
		Delete: resourceRoleDelete,

		Schema: map[string]*schema.Schema{
			"rolename": &schema.Schema{
				Type:        schema.TypeString,
				Required:    true,
				Description: "The name of the role, also acts as its unique ID. Required.",
				ForceNew:    true,
			},
			"capabilities": &schema.Schema{...
			},
			"cumulativertsrchjobsquota": &schema.Schema{...
			},
			"cumulativesrchjobsquota": &schema.Schema{...
			},
			"defaultapp": &schema.Schema{...
			},
			"srchindexesallowed": &schema.Schema{...
			},
			"srchindexesdefault": &schema.Schema{...
			},
			"srchjobsquota": &schema.Schema{...
			},
			"rtsrchjobsquota": &schema.Schema{...
			},
			"srchdiskquota": &schema.Schema{...
			},
			"srchfilter": &schema.Schema{...
			},
			"importedroles": &schema.Schema{...
			},
		},
	}
}

Build and test your Terraform Plugin binary

Now that all components of our custom Terraform provider are defined, let us build and test the binary on our local machines. Build the Go binary (which is the Terraform provider Plugin) by running `go build -o terraform-provider-splunk`. This command outputs a binary named `terraform-provider-splunk`.

To test the binary on your local machine, place it in $HOME/.terraform.d/plugins (for MacOS) then follow these steps:

1) Initialize the provider and define a SAML Group resource and a role resource. A sample Terraform file looks like the following code sample:

provider "splunk" {}

resource "splunk_role" "testrole" {
  rolename = "testrole"
  capabilities = [
    "search"
  ]
  srchindexesallowed = [
    "*"
  ]
  srchindexesdefault = [
    "test"
  ]
  cumulativertsrchjobsquota = 1
  cumulativesrchjobsquota   = 2
  defaultapp                = ""
  srchjobsquota             = 3
  rtsrchjobsquota           = 4
  srchdiskquota             = 5
  srchfilter                = "*"
}

resource "splunk_saml_group" "test" {
  samlgroupname = "testgroup"
  rolenames     = ["testrole"]
}

2) Initialize Terraform by running `terraform init` then, run `terraform plan`. You should see the following output:

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # splunk_role.testrole will be created
  + resource "splunk_role" "testrole" {
      + capabilities              = [
          + "search",
        ]
      + cumulativertsrchjobsquota = 1
      + cumulativesrchjobsquota   = 2
      + id                        = (known after apply)
      + rolename                  = "testrole"
      + rtsrchjobsquota           = 4
      + srchdiskquota             = 5
      + srchfilter                = "*"
      + srchindexesallowed        = [
          + "*",
        ]
      + srchindexesdefault        = [
          + "test",
        ]
      + srchjobsquota             = 3
    }

  # splunk_saml_group.test will be created
  + resource "splunk_saml_group" "test" {
      + id            = (known after apply)
      + rolenames     = [
          + "testrole",
        ]
      + samlgroupname = "testgroup"
    }

Plan: 2 to add, 0 to change, 0 to destroy.

3) Apply the change with `terraform apply`. Or discard the local plan.

Conclusion

By now you hopefully have a good understanding on how to build a custom Terraform provider for Splunk. In the future, we hope to open source this Splunk Terraform provider so it can be downloaded as a Go binary. Using such a provider will help your company increase visibility, auditability, and scalability in a standardized and secure manner. 

Once you build and implement this solution, every new permission change to a Splunk role as well as every Splunk SAML Group mapping can be done through code. All changes will be tracked. The provisioning and deprovisioning of these managed resources will be simple and fast. Any new individual taking over management of Splunk authorization will be able to do so quickly and easily.

In addition, you can leverage the steps covered in this article to write a Terraform provider of your own for another system.

The challenge of building machine learning (ML)-powered applications is running inferences on large volumes of data and returning a prediction over the network within milliseconds, which can’t be done without minimizing network overheads. Model serving that is not fast enough will slow down your applications, jam up requests upstream, degrade user experience, and hamper operational efficiency. 

At DoorDash, we confronted similar issues in our gRPC-based model serving setup for search ranking with network overheads taking up to 50 percent of service response time. After investigating the issue, we were able to reduce our overall response time by 50 percent and the network overheads by 33 percent. In this article we discuss how we managed to minimize network overheads related to model serving by enabling client-side load balancing, payload compression, and transport-level request tracing under gRPC’s Kotlin framework

Background

Before we dive into how we reduced latency and network overheads let’s quickly review some key context including a short primer on gRPC and why it’s important to us. We are also going to briefly discuss our model serving setup at DoorDash before defining network overheads.

What is gRPC and why do we use it?

gRPC is a modern-era server-client communication framework built over the remote procedure call (RPC) paradigm. To get a quick understanding of gRPC, we’d recommend checking out its FAQ section (also look at its design principles if you want to take a deeper dive).

DoorDash migrated to use gRPC from REST as a standard for communication between microservices. The major benefits of gRPC in context of efficient network communication are:

What exactly are network overheads?

Difference between the time taken on the client-side to get a response (response-time) and the time taken on the server-side to process a request (processing-time) can be defined as network overhead. Different factors that go into network overhead include message serialization, deserialization, and time spent in the transport layer waiting to be put in the application layer (request/response queuing). These metrics are critical as they were the primary metric we were aiming to solve. 

Figure 1. Serialization, deserialization, and network queuing are all factors that add to network overhead.

How do we realize model serving at DoorDash?

All of the machine learning use cases at DoorDash such as search and recommendation, fraud detection, optimizing dasher dispatch, and predicting delivery times are served through Sibyl, our prediction service cluster. The cluster is deployed as a Kubernetes service in AWS with each pod in the cluster composed of a Kotlin gRPC server. 

A majority of model serving use cases are an essential part of the DoorDash user experience. All of these use cases mean we have to ensure our system can handle approximately one million predictions a second with a high reliability. In our recent blog article, Meet Sibyl, we talked about the various optimizations that went into building the service. Overall, we optimized for performance over design complexity. These optimizations were:

  • Running models in their native format using a C++ integration to mitigate computational overheads
  • Batching prediction requests in pre-tuned batch sizes for optimal throughput
  • Partially passing feature data with the requests to save on time-fetching at the server

Why is it hard to make model serving efficient at scale?

Next, we will highlight some of the key challenges specific to model serving. We will also highlight the need of addressing network overheads by elaborating upon their negative impact. Lastly, we will talk about scale in context of DoorDash to get a sense of the prevailing challenges.

Characteristic challenges of model serving

Now we will go over some of the key challenges of model severing including how large payloads can slow down model serving, how heavy computations increase response time, and why a faster response time is expected from an ML service in comparison to an independent service. 

Large payloads slow down model serving

Model services characteristically work with large request/response sizes. The size of a request sent to a model service is determined by the number of entities involved and features associated with each entity.

For example, in the restaurant ranking use case, hundreds of stores are included in a single request to be ranked by a search query (search ranking). Furthermore, each store can have several dozens of features passed in with the request, such as ratings, has_photos, and consumer_store_category_match. With this, the raw payload size can be in the range of megabytes (A 100 mbps connection takes 80 ms to transfer a file size of 1 MB).

Heavy computations increase response time 

The size of a feature matrix and prediction algorithm governs the complexity of a single model computation. For example, our order recommendation use case runs a PyTorch model over embeddings of menu items, each of which have an excess of 100 dimensions.

Additionally, there can be multiple model computations within a single request. For example, to optimize Dasher dispatch, we need multiple predictions such as estimating the time for a Dasher to reach a restaurant, the time to prepare food by the restaurant, and making corrections on times quoted by merchants.

Expectation of a faster response time versus an independent service

As you would expect, a slow service degrades the end user experience. More often than not, ML services are not the only services involved in meeting a business use case. For example, to render DoorDash’s homepage, a request flows from your device to a server component at DoorDash, which then makes a call to the feed-service, which fetches content such as a list of stores, carousels, and offers. To get a list of restaurants, the feed-service calls the search-service, which gets the list and ranks it using the sibyl-prediction-service

Meeting a machine learning business case might involve multiple service calls before getting results back to the user. Here, we show a call stack for a restaurant ranking model.

As a result, we require a faster response time for these services (for optimal user experience) in comparison to an independent service, such as the menu-service, returning menu items for a given restaurant. 

Note: The expectation of a faster response time holds true for any service which gets called by another service in serving a business use case.

Pseudo-traffic from experiments

Analytics and Data Science teams conduct shadowing and A/B experiments in production to improve an existing model or explore new AI use cases. These experiments increase traffic and resource use by a factor equal to the number of live experiments and the number of queries sent by these experiments.

Network overheads are not nominal

Typically, application services work with small request payloads and don’t feel the effect of network overheads. But as payload size increases, requests become increasingly slower to transport, serialize, and deserialize. In Figure 3, below, we share one of our use cases in which network overheads superseded request processing time.

Application services with large payload sizes can cause situations where network overheads (response time - processing time) are larger than service processing times.
Figure 3. Application services with large payload sizes can cause situations where network overheads (response time – processing time) are larger than service processing times.

Furthermore, a service whose throughput underweights its rate of incoming requests will suffer from request queuing. In such an event, processing of requests will begin much later than the time when requests were originally sent, thereby contributing to network overheads. The request queue consumes service resources and bottlenecks throughput of upstream services. 

For example, in Figure 2, referenced earlier, if the search-service has a request queue, calls made to its parent, the feed-service, will suffer significant delays and will, in turn, affect the feed-service’s throughput. Thus, request queuing may impede request throughput of the overall system in a super linear fashion and should not be ignored

Challenges of scale at DoorDash

At DoorDash, all ML use cases combinedly pose high volume challenges for model serving. For an insight into the volumes, our search ranking use case alone requires us to handle 600k predictions/sec. And we currently serve over a dozen use cases. 

If the rest of our application services operate at a request rate of X, model serving needs to be able to operate at 10X to 20X because for each user request, there are multiple ML use cases that need to be served. Our overall handling-of-scale expectations currently fall in the range of 1 to 10 million predictions/second. With the increase in use cases and consumer growth, these requirements keep rising.

How do we optimize our gRPC setup for model serving at DoorDash?

Here, we will talk about how we use gRPC to minimize network overheads, helping to limit the impact of problems we talked about above. We will explain why you would need client-side load-balancing and how it can be implemented with a simple server configuration. Next, we will cover the importance of payload compression and how it improved our setup. Finally, we will elaborate upon how you can use gRPC to study your network overheads using transport-level tracing. Since most of this is not covered as developer-ready documentation in gRPC, we will include code-snippets to help materialize our suggestions.

Using client-side load-balancing to evenly distribute request-load

It is imperative that a service should use load-balancing to evenly distribute its available resources to request-load. It is even more important with gRPC because a connection is reused for multiple calls. The heavy compute nature of making predictions could quickly overwhelm a server. One possible implementation is to employ an external load-balancer to route traffic evenly. We instead are able to use Kubernetes cloud-native service discovery, DNS, and client-side load balancing to achieve the same thing with the benefits outlined below. 

Why use client-side for load-balancing?

  1. It prevents a request hop to the external load-balancer.
  2. Your load-balancer may not support gRPC traffic because of its HTTP/2 protocol. For instance, the AWS network load-balancer doesn’t support HTTP/2 traffic.

The cost of using client-side load-balancing is added code complexity at the client-side. Next, we illustrate how we avoid that cost by using a simple implementation. 

An easy method to enable client-side load-balancing

We use short-lived connections to the server that reset a client connection every 10 seconds to realize load-balancing. This can be implemented using a server configuration that can be set using gRPC’s NettyServerBuilder and maxConnectionAge as below:

val server = NettyServerBuilder
   .forPort(ServiceConfig.PORT)

    ...
   .maxConnectionAge(ServiceConfig.MAX_CONNECTION_AGE, TimeUnit.SECONDS)
   .build()

With this configuration and MAX_CONNECTION_AGE set to 10 seconds, we get a decent resource distribution. Figure 4, below, shows a graph demonstrating how our 30 pod-clusters almost homogeneously use CPU over time under (periodically) varying request loads:

Our server configuration results in nearly collinear CPU utilization over time for 30 pods in a cluster.
Figure 4. Our server configuration results in nearly collinear CPU utilization over time for 30 pods in a cluster.

What are alternative ways to implement client-side load-balancing?

If the above configuration doesn’t work for you, the ideal solution involves setting up a service mesh by using a sidecar container on the client-side. The sidecar container is responsible for routing traffic emanating from a client to a service in a load-balanced manner. Linkerd is a Kubernetes-specific solution, while Istio is a platform-independent option designed to achieve load-balancing.

Using payload compression to reduce network packet sizes 

As payload sizes grow, compression is the natural solution to help reduce the packet sizes. Smaller packets are faster to transport but compression adds overheads of serialization and deserialization. That’s why applicability and choices of a compression algorithm need to be evaluated before putting them to use.

We did not settle down with gRPC’s default compression

After running benchmark tests using a few algorithms on one of our characteristic requests, we found Zstandard (zstd) algorithm performs the best in terms of end-to-end service latency. The reduction in latency from zstd over gzip (the default compression option from gRPC) was over 12 percent. Below is a snippet from our benchmark test comparing raw latency (in milliseconds) for LZ4, Snappy, gzip, zstd, and no compression.

In comparison among five compression algorithms, zstd showed the lowest end-to-end latency.
Figure 5a. In comparison among five compression algorithms, zstd showed the lowest end-to-end latency.
After instituting zstd (green line), we see significantly reduced latency in our production cluster.
Figure 5b. After instituting zstd (green line), we see significantly reduced latency in our production cluster.

How to enable compression with gRPC?

Since the gRPC framework itself provides very little documentation, here’s a few of our code snippets to enable zstd compression to help you get started:

import com.github.luben.zstd.ZstdInputStream
import com.github.luben.zstd.ZstdOutputStream
import io.grpc.Codec
import java.io.InputStream
import java.io.OutputStream

// Implement io.grpc.Codec interface for zstd compression using zstd library
class ZStdGrpcCodec : Codec {
   override fun getMessageEncoding(): String {
       return "zstd"
   }

   override fun decompress(inputStream: InputStream): InputStream {
       return ZstdInputStream(inputStream)
   }

   override fun compress(outputStream: OutputStream): OutputStream {
       return ZstdOutputStream(outputStream)
   }
}

// Register zstd codec to a factory enabling gRPC to automatically use it for compression/decompression
object CompressorRegistryFactory {

   fun newCompressorRegistry(): io.grpc.CompressorRegistry {
       val registry = CompressorRegistry.getDefaultInstance()
       registry.register(ZStdGrpcCodec())
       return registry
   }

   fun newDecompressorRegistry(): io.grpc.DecompressorRegistry {
       return DecompressorRegistry.getDefaultInstance()
               .with(ZStdGrpcCodec(), true)
   }
}

At the client-side, to be able to send compressed requests, declare CompressorRegistryFactory in a similar manner and then initialize the stub using it:

private val stub: GreeterCoroutineStub = GreeterCoroutineStub(channel)
compressorRegistryFactory.newCompressorRegistry();
stub = stub.withCompression(environment.sibylPredictionCompressorName());

How to find the right algorithm for your needs?

Since effects of compression are dependent on the input, you should run benchmarks on your requests to determine the best algorithm. One can quickly test performance on a given file using lzbench to shortlist algorithms for benchmark tests. Then, you should run benchmark tests on your services to get accurate comparison measures.

Using transport-level tracing to investigate network overheads

gRPC’s tracing capabilities aim to bring service developers closer to the transport layer. It is possible to intercept a message and know exactly when a client call entered the application layer or when a response was sent to the transport layer

Why should a service developer worry about transport?

Network overheads can add significant delays to service latency. Instrumentation at the application layer is not sufficient to investigate these delays. For instance, response may be generated by the service in 20ms but sent on the wire after 100ms. Or it could be sent on the wire almost immediately but received by the client after 100ms. If you only rely upon time measurements inside your method call, it won’t be possible to tell which service (caller or callee) to investigate to mitigate latency slowdowns.

How do we use gRPC tracing for visibility into a request flow?

We use all the available hooks in gRPC tracing to get visibility into transport at client-side and server-side for our production traffic. These tracing measurements are then reported to Wavefront, our observability tool. Figure 6, below, illustrates these hooks in action at server-side reporting the cumulative latency breakdown at a high level:

Figure 6. Using Wavefront, we get a snapshot of cumulative latency
Figure 6. Using Wavefront, we get a snapshot of cumulative latency

With this information, we are able to get a rough estimate of different components of the network overhead:

  • Network delay + request queuing ~ inboundMessage – requestCreated
  • Request deserialization ~ inboundMessageRead – inboundMessage
  • Response serialization ~ outboundMessageSent – predictionsMade
  • Network delay + response queuing ~ clientReceivedResponse – outboundMessageSent

Caution: For request deserialization, gRPC’s documentation says that the event inboundMessage is fired as soon as the stream knows about the message, but doesn’t have a further guarantee as to whether the message is deserialized or not.

From the metrics shown in Figure 6, for example, we can tell that transporting and deserializing a request (inboundMessageReadserverCallStarted) took ~2ms and another ~2ms to serialize response and send over wire (outboundMessageSentpredictionsMade). In all, it took us 5.6ms to make predictions, while the client sees the response 11ms after making the request. Thus, our network overhead is ~4.4ms.

Some sample code-snippets below can help you get started:

// define a factory to help create request tracers for each request
class ServerRequestTracerFactory() : ServerStreamTracer.Factory() {
   override fun newServerStreamTracer(fullMethodName: String?, headers: Metadata?): ServerStreamTracer {
       return ServerRequestTracer(fullMethodName, headers)
   }
}

// override ServerStreamTracer to add hooks to gRPC transport-level events
class ServerRequestTracer(
   private val fullMethodName: String?,
   private val headers: Metadata?
) : ServerStreamTracer() {
   private val startTime = Clock.defaultClock().tick - requestInitializedTime()

   private fun requestInitializedTime(): Long {
       // communicate the time when request was initialized using headers and read it here at the server-side
       val requestInitializedTime = headers?.get(RequestInitializedTimeHeader)
       val timeDiffMs = System.currentTimeMillis() - requestInitializedTime.toLong()
           // factor in inaccuracies of currentTimeMillis or time-sync issues
       return max(TimeUnit.NANOSECONDS.convert(timeDiffMs, TimeUnit.MILLISECONDS), 0L)
   }
   // report event to real-time monitoring tool using startTime and current timestamp
   private fun reportEvent() { .. }

   override fun serverCallStarted(callInfo: ServerCallInfo<*, *>) {
       reportEvent("serverCallStarted")
   }

   override fun outboundMessage(seqNo: Int) {
       reportEvent("outboundMessage")
   }

   override fun inboundMessage(seqNo: Int) {
       reportEvent("inboundMessage")
   }

   // similarly implement other gRPC tracing hooks such as outboundMessageSent, etc.
 
   companion object {
       // header definition
       private val RequestInitializedTimeHeader: Metadata.Key<String> = Metadata.Key.of(
               "x-request-start",
               Metadata.ASCII_STRING_MARSHALLER
       )
   }
}
// Initialize the server with the trace factory defined above
val server = NettyServerBuilder
   .forPort(ServiceConfig.PORT)
   ... 
   .addStreamTracerFactory(ServerRequestTracerFactory(sibylLogger, sibylStats))
   .build()

At the client-side, you merely have to ensure passing the header x-request-start with the value set to the time when the request was created. 

There are similar gRPC hooks available at the client-side to infer time spent in request serialization and response deserialization. These hooks can be enabled via a ClientInterceptor as illustrated in this gRPC CensusTracingModule.java documentation.

gRPC tracing led us to reduce our overall response time by 50 percent

When we initially rolled out our sibyl-prediction-service, we saw large unexplained network overheads. There was a sizable gap (~100ms) in latency reported at client-side vs latency reported at server-side. Using gRPC tracing, we noticed that most of this time (~50ms) was spent between predictionsMade and outboundMessageSent.

This observation made it evident that the issue was server-side, and upon closer investigation we discovered one of our logging calls outside predictionsMade scope was slowing us down. After removing the logging call, we cut the gap between predictionsMade and outboundMessageSent by 94 percent. This result led us to restructure our logging throughout the code-base, which caused our overall response time to drop by 50 percent.

Additionally, gRPC tracing also helped us in latency optimization experiments, such as comparing compression algorithms with a finer granularity since it can report time taken during serialization/deserialization.

Conclusion 

Through the testing and experiences described above, we learnt that network overheads can significantly slow down services and it’s important to address them. The impact is even more profound when working with large payloads typical when working with ML model serving, handling large traffic volumes, and having an expectation to get a response within milliseconds.

We used the use case of model serving methods at DoorDash to elaborate upon how network overheads can be minimized with the gRPC framework. We discussed how to efficiently and evenly distribute load in a service cluster. We also discussed the impact of compression in reducing network delays. 

Last but not least, we elaborated upon gRPC tracing and its benefits in providing a fine understanding of the time spent in a service call and evaluating latency optimization experiments. We also learnt that gRPC empowers application developers by bringing them closer to the transport layer, giving them novel opportunities for optimization.

As you might have noticed, these learnings extend beyond model serving and can be applied to any general setup of internal services. Setting a small connection age is sufficient to get you started on load-balancing. Using the right algorithm for compression can help you work with large request sizes or remove overheads when batching requests. gRPC’s transport-level tracing abilities can help you identify request queuing, serialization, and deserialization overheads.

Future work

In the near future, as we expand our scale of operations and work with multiple other machine learning use cases, we will encounter new challenges related to performance and we hope to explore other avenues for optimization.

Acknowledgements:

Thanks to Param Reddy and Hien Luu for your involvement in discovering these findings.

As companies utilize data to optimize and personalize customer experiences, it becomes increasingly important to implement services that can run machine learning models on massive amounts of data to quickly generate large-scale predictions. At DoorDash, our platform is utilizing data to power models that curate search results, assign dashers, recognize fraud, and more. This would only be possible with a robust prediction service that can apply our models to the data and serve our various microservices that rely on data-driven insights. 

We recently implemented DoorDash’s next generation prediction service, and named it Sibyl prediction service, after the Greek oracles who were known to utter predictions in an ecstatic frenzy. In this blog post you will learn the ideation, implementation and rollout of Sibyl and the steps we took to build a prediction service that can handle massive numbers of calls per second without breaking a sweat. If you are interested in how models are integrated with the service you can check out this article here .While Sibyl itself may be unique to DoorDash, the learnings and considerations we used to build it can be applied to almost any high throughput, low latency service. If you are looking to learn about the greater ML platform, check out this blog post: DoorDash’s ML Platform – The Beginning 

Ideation and Requirements: The prediction service’s role in Machine learning (ML) infrastructure

Before starting any actual coding, we took some time to consider exactly what role Sibyl would have in DoorDash’s ML infrastructure ecosystem, as well as all the required functionalities that this new prediction service needed. 

In terms of its role, we wanted the Sibyl prediction service to handle all realtime predictions, and to focus on predicting, leaving other components, such as feature calculation, model training pipelines, and storage of features and models, in separate services/stores. The first thing we considered was scalability and latency, given that we were expecting hundreds of thousands of predictions per second, and that in order to convince other services to call our service for predictions, we needed to make them so fast that calling Sibyl would be better than the alternative of each individual service making predictions itself.

How Sibyl prediction service fits into the ML infrastructure ecosystem:

Sibyl prediction service high level flow
Figure 1: Sibyl prediction service high level flow

As you can see by Figure 1, all predictions come in from other services as gRPC requests, and Sibyl retrieves both models and features from independent stores (with the option of outputting predictions to Snowflake for offline evaluation). Possible future expansion includes a separate model evaluator, which can be used for pure prediction computation needed for complex models. For V1, however, this will be included within Sibyl.

In terms of required functionalities, here were some highlights:

  • Batch predictions: We should allow each prediction request to contain any variable amount of feature sets to predict on (note: a “feature set” is a set of feature values we want to predict on. A simple way of thinking about this is that a feature set is just an input we want to predict on). Batch predictions are essential, as they allow client services to send and retrieve 1 to N predictions at once, greatly reducing the number of calls needed.
  • Shadow predictions: In addition to making predictions and sending them back to client services, the option to make shadow predictions asynchronously in the background was essential. Oftentimes, before finalizing and clamping down on one particular model, teams may have multiple candidate models, and want to test multiple models at once on the same data. Allowing teams the option to use one model for official predictions, and to asynchronously make predictions on the same data with different models, gives them the flexibility and power to analyze the efficacy of various candidate models.
  • Feature and model fetching: As mentioned above, Sibyl needs to be able to fetch both features and models from their respective stores. For features, they would be fetched during each prediction, and for models, we could save time and computing power by first fetching all models when the service starts up, then caching them in-memory, eliminating the need to load them for each request.

Implementation: General service overview and decision highlights

To get a general understanding of how the service works, as well as a brief overview of the moving parts of the service, here’s what the lifecycle of a typical request looks like:

Figure 2: Lifecycle of a typical request
Figure 2: Lifecycle of a typical request

Referencing Figure 2:

  1. The request arrives.
  2. For each model identified in the request, we grab both the model, as well as the model config (which contains info on that model, such as all required features, default fallback values for features, and model type) from an in-memory cache. 
  3. Then we iterate through the values provided in the request, and find out if there are any missing feature values that were not provided. We do this for all models and all feature sets at once, and store values in a map for easy lookup.
  4. For all missing features, we attempt to retrieve feature values from the feature store, which is a Redis cache of feature values. If they still cannot be found, we set the feature values as the provided default value in the model config.
  5. Now that we have all features and all feature values required for predictions, we asynchronously make a prediction for each feature set. For each shadow model, we also launch an asynchronous coroutine, but don’t wait on the results to finish before continuing.
  6. With all predictions made, finally, we construct a response protobuf object, and populate it with the predictions, returning the protobuf back to the client.

Now I want to highlight some decisions/details we’ve made that were noteworthy:

Optimizing prediction speed using native API calls

The two ML frameworks that Sibyl uses, LightGBM and Pytorch (if you are curious as to why we chose these two, give DoorDash’s ML Platform – The Beginning a quick read), have API frameworks implemented in a variety of different programming languages. However, in order to optimize for speed, we decided to store models in their native format, and make prediction calls to these frameworks in C++. Working in C++ allowed for us to minimize the latency for making the actual predictions. We used the Java Native Interface (JNI) so that the service, implemented in Kotlin, can make the LightGBM and Pytorch prediction calls, implemented in C++.

Coroutines, coroutines, and more coroutines

Due to the demanding scalability and latency requirements, one of our top priorities was to make sure that all predictions were being conducted in parallel, and that when waiting for features to be fetched from the feature store, threads would actually be performing computations and calculations (instead of just waiting). Thankfully, developing in Kotlin gave us much needed control over threads via its built-in coroutine implementation. Kotlin’s coroutines aren’t tied to a specific thread, and suspend themselves while waiting, meaning that they don’t actually hold the thread, allowing the thread to perform work on something else while waiting. While it is possible to implement similar behavior in Java using callbacks, syntactically, creating and managing Kotlin coroutines are far cleaner than Java threads, making multithreaded development easy.

Rolling Out: First loading testing, then introducing into production

Conducting a test run

We decided to test Sibyl’s prediction capabilities on one of DoorDash’s most in-demand services. DoorDash’s search service has many responsibilities, one of which includes calculating which restaurants to show you when you visit doordash.com. You may not realize it, but every single restaurant you see on DoorDash needs to be scored and ranked beforehand, with the score being used to personalize your experience on the site, providing different restaurants in different positions on the site for different people (Figure 3). 

The Search Service’s ranking illustrated
Figure 3: The search service’s ranking illustrated

Currently, the search service’s ranking logic is done in-house and within the service itself. So what we decided to do was when at any time, the search service was about to rank a restaurant, it would spawn an asynchronous thread that would also call Sibyl. Doing this gave us the ability to not only verify that the prediction service works as intended, but also allowing us to accurately load test the service. Furthermore, the asynchronous calls ensured that any calls to Sibyl would not slow down the search service’s endpoints.

Sibyl ended up handling over 100,000 predictions per second, wow! This test run demonstrated that Sibyl was now ready to handle the throughput required for our production services, and that services at DoorDash could start migrating their models to call the service for any predictions.

Toggling request batch size and other tuning to optimize the service

One configuration we played around with was the batch size for each prediction request. Since there are potentially hundreds of stores that can appear on your store feed, the search service actually ranks hundreds of stores as once. We were curious to see how much faster each request would be if instead of sending all stores at once to Sibyl, we split the request into sizable chunks, so that instead of predicting on ~1000 stores at once, Sibyl predicted on 50 stores in 20 separate requests. 

We found that the optimal chunk size for each request was around 100-200 stores. Interestingly, smaller chunk sizes, such as chunks of 10 and 20 stores, actually made latency worse. Therefore, there was a nice middle ground, illustrating that while the number of stores per request mattered, the service performed better when the chunks were decently large in size. The hypothesis is that if chunk sizes are too small, then the number of requests increases substantially, resulting in request queuing and higher latencies. On the other hand, if predictions contain all 1000 stores, then the amount of data to send and receive balloons, and the propagation delay between client and service becomes our bottleneck. This finding was actually encouraging for us, as it demonstrated that we efficiently implemented Sibyl to run predictions in parallel, and that on a large-scale, the service is able to make substantial batch predictions without hiccup.

Besides chunking, request compression was looked into as well. As mentioned above, one issue with these batch requests is that the sizable amount of data being sent results in large propagation delay times. With hundreds of stores and their feature values included in each request, it made sense to try to compress requests in order to reduce the number of packets in the network layer that would need to be sent to Sibyl.

Finally using the service’s predictions in production

When load testing, although Sibyl would be called every single time a store was ranked, the result returned by the service was never actually used. The next step was to actually use these calculated values and to officially integrate the service’s predictions into the production workflow for various models. While handling requests from our search service was good for load testing, due to very strict latency requirements, it would not be the first few models migrated over.  Furthermore, added time would need to be spent migrating all feature values from the search service to Sibyl’s feature store.  We decided to start with some fraud and dasher pay ML models, the reasons being that the estimated QPS would be far lower, and the latency requirements were not as strict. Fraud detection and dasher pay do not need to be calculated nearly as quickly as loading the home page. Starting March of this year, both fraud and dasher pay models now officially use Sibyl for predictions. Following the rollout, one big win observed was a 3x drop in latency (versus our old prediction service), a testament to Sibyl’s efficacy.

Concluding rolling out: the end of the beginning

Following the successful roll out of the fraud and dasher pay models in using Sibyl for predictions, over the past couple of months, the ML platform team has been continuously adding more and more models to the service, and the migration of models to Sibyl is nearing completion. All but five models have been migrated and are now calling Sibyl for predictions. To learn more about their migration check out this new blog post here

The team is continuing to add support for different feature and model types. For example, support for embedded features, which are used primarily in neural networks, has been added. Composite models, models that consist of a chain of submodels and expressions, called compute nodes, have also been added. Although Sibyl’s role as the predictor for DoorDash has just begun, it has already been an exciting and active one!

Acknowledgements:

Thanks to Param Reddy, Jimmy Zhou, Arbaz Khan, and Eric Gu for your involvement in the development of Sibyl prediction service

On the front lines of addressing Stay-At-Home orders caused by the COVID-19 pandemic, DoorDash’s Merchant team has been busy ensuring that merchants have what they need to offer delivery, get onboarded onto the platform and that everything is scaled to maximize reliability. In this interview we sat down with Varsha Dudani, an engineering manager in the Merchant Organization to talk about the challenges she and her teams are tackling, how they are helping merchants and how everything is scaling with this huge increase in demand.

Varsha Dudani, Engineering manager for the DoorDash Merchant team
Varsha Dudani, Engineering manager for the DoorDash Merchant team

1)  Tell us a little about what teams you lead, and why DoorDash?

A little bit about me, I am a Sr. Engineering Manager with almost 20 yrs of experience in the industry and about 10 years in engineering management. I have been with DoorDash for more than 2.5 years and in my time at DoorDash, I have led a number of teams across the Merchant organization that handles the B2B side of the business.  As part of the merchant organization, I currently manage 4 teams:  

  • The Order Platform team that creates and maintains the Order Service that routes, schedules and maintains every DoorDash order
  •  The Developer Platform team that is in charge of the developer platform and portal used by third party engineers to integrate with DoorDash for core features that DoorDash offers its merchants
  • The Merchant Selection team locates the best merchants in every local market and works with the sales team to help bring them onto the platform
  • The Merchant Finance team that ensures that every merchant gets paid and can clearly see the value that DoorDash brings to their business.

When I think “why DoorDash” the answer is rooted in the company’s mission and the impact I and my teams can create to achieve it. DoorDash’s mission is to help local economies succeed and I feel passionately and positively about being a part of that.  I have always found that getting things done is essential to one’s success. As a female engineering manager, I have felt extra pressure to prove myself but rarely have I been in a position to make such a large impact, so fast on the business and help the merchants my team serves. Despite the recent uncertainty caused by COVID-19, we have still been able to assist hundreds of businesses, and yet, I am even more excited about the myriad of opportunities in front of us to make an even larger impact. 

It’s not just me that feels so passionately about DoorDash and our mission. The environment at DoorDash overall has a strong sense of optimism and positivity, which I think is reflected in how we’ve seen DoorDash grow from being an underdog to the market leader in food delivery. 

2) What are some of the challenges your teams are working on?

Many of the challenging engineering problems that my teams are working on are on scaling the systems to adapt to a growing business. These are coupled with building some awesome product features that while fun and impactful, can be challenging to execute.

For us to scale effectively, we needed to implement the correct Service Oriented Architecture (SOA) that would allow for quick iterative development. The Merchant organization has been focused on redoing various parts of the system since last year, while continuing to roll out new product features. The most evident challenge here is to change the brakes while the car is still running – making sure we’re making architectural changes to the system, while not slowing down the business or affecting the product. It’s always exciting to work on these types of scale and reliability challenges, which are hard engineering problems, but you end up learning a lot. 

On the product side, our main challenge is to continually offer new features and updates that will help merchants improve their businesses and make it easier for them to use the platform. We look at all our merchant initiatives with a product mindset and are committed to continuously making small improvements every day with new features and updates that aim to satisfy the needs of our wide variety of merchants, from mom and pop shops to enterprises. For example, we recently offered merchants more promotions options they could sign-up for, as well as helping small businesses create their online ordering pages more quickly and efficiently. The rapid rollout of these new initiatives plus the constant improvement of other existing features keeps us busy in our mission to empower the merchants we serve.

3)  What are the DoorDash product and engineering teams doing for Merchants to help them during the COVID-19 pandemic?

Back in March when Stay-At-Home orders first started getting rolled out, we immediately put our heads together as a team and company, and built a list of product initiatives and relief programs that could help our Merchants in these difficult times. We also recognized that for many restaurants having a digital delivery experience was now a priority and they would need access to a platform like ours to attract and retain customers.

Our first and most important task was to make our products and technology more accessible to new and existing merchant partners. We offered a free 30 day trial to new merchants onboarding onto the platform, and reduced our commission fee to 0%. Additionally, we offered a 0% commission on all pick up orders and cut our commission by 50% for all small businesses. Given that Merchant’s cash flow was impacted we changed our pay-out period from weekly payouts to daily. 

On the engineering side, our systems needed to scale, remain reliable and continue to fulfill orders in a timely and precise way. We pulled together a cross-functional team to tackle these issues, by building a plan to scale our infrastructure and resolve performance bottlenecks.

It was, and still is, amazing to see the product and engineering teams come together virtually from their homes and deliver on all of these initiatives in just a couple weeks. While we continue to iterate and improve we achieved our main agenda of pushing out the new product features to make an impact as soon as possible and ensure our systems were resilient and  ready for all  incoming traffic.

4) How have we helped merchants onboard during COVID-19?

With the world sheltering in place, and delivery being an essential service, we needed to ensure our platform was fast and easy for new merchants to join. Onboarding a merchant takes a lot of  effort and time to set-up correctly, and involves setting up the merchant’s store, uploading their menus, and gathering store data. Many merchants also require a DoorDash tablet to partner with us, and shipping these can also take some time. 

At the outset, our operations team needed help to efficiently manage the demand of merchants looking to onboard onto the platform. In true start-up style, we went all-hands-on-deck, to help the merchant operations team. We had employees from every organization within the company help create menus, which can be one of the most time consuming tasks in the onboarding process or answer support calls addressing merchant concerns and questions. I witnessed first hand the power of dividing work and multiplying success. The positivity and willingness to lean in and make things happen was electrifying.

Product and Engineering came up with solutions to onboard merchants onto the platform without a tablet. We adjusted our product to take note of a merchant’s phone number when they signed-up, this would allow them to receive orders as soon as they were set up on DoorDash via phone while they waited for a tablet to arrive. Our engineering team really showed a bias for action, one of our core company values, and I am super proud to be a part of this team. 

5) What are some future challenges the Merchant team will face post-COVID-19?

Off-premise is here to stay, and I’m excited for the future opportunities for DoorDash. Here’s a few examples of areas that I believe the teams will be focusing on in the future, including the ones we have started work on already:

  • Build Scalable and Reliable systems: Reliability is our #1 feature. As I already mentioned, we have started the journey to build the correct SOA at DoorDash. We need to finish this work with scalable and reliable microservices.
  • Bring in the best selection in record time: We hope to use more analytics and ML to find which merchants will give DoorDash the best selection and work to get them on the platform. The dream is that a merchant who wishes to onboard with DoorDash can do so within an hour with the best self serve products possible. 
  • Empower Merchants: We wish to empower our Merchants by providing them with great products and technology solutions that lets them grow their business and operations more efficiently. We are consistently working on new initiatives to improve the merchant experience: white labeling solutions, merchant promotions, more customer engagement control and visibility, clarifying reports and financial numbers, and providing a world class developer platform and portal for them to systematically integrate with DoorDash. 
  • Empower Merchant Operations Team at DoorDash: We wish to empower our internal Sales and Ops Teams by providing them with easy to use products and tools to do their job. The scale of operations is growing and it’s important to understand how technology can ease the increasing workload on our ops teams and help them out with the best tech solutions. 

I am looking for talented frontend and backend engineers who feed off challenging engineering problems, who are owners, doers, leaders and learners and believe in team spirit. If you think that you’ve got what it takes to build the next world class company, then come join me on this exciting journey. To learn more about Varsha and her experience leading innovative engineering teams click here 

In this post, we introduce a method we call CUPAC (Control Using Predictions As Covariates) that we successfully deployed to reduce extraneous noise in online controlled experiments, thereby accelerating our experimental velocity. 

Rapid experimentation is essential to helping DoorDash push key performance metrics forward. Whether improving our store feed ranking, optimizing the efficiency of our logistics system, or evaluating new marketing campaigns, it’s critical for DoorDash to maintain a robust experimentation methodology to ensure new releases improve business metrics in production. Without such rigorous validation, the practice of “test and learn” can quickly devolve into “ship and pray.”

To complicate matters, the metrics we care most about are often very noisy. One such metric is the total time it takes to prepare and deliver an order, which we call ASAP. ASAP is a key metric for us to monitor as delivery speed drives both customer satisfaction and retention. ASAP is very noisy as it varies by merchant type (e.g. quick-service restaurants prepare food quicker than steakhouses), the customer’s distance from the merchant, and current traffic conditions. Such variation lowers the probability of detecting improvements (i.e. the power of the test) driven by new product features and models in an experiment. This makes it difficult for us to conclude whether observed changes in ASAP are real or are merely fluctuations driven by random chance.

To mitigate this issue we developed and deployed CUPAC. CUPAC is inspired by the CUPED methodology pioneered at Microsoft (Deng, Xu, Kohavi, & Walker, 2013), extending it to leverage machine learning predictions built using inputs unaffected by experiment intervention. This approach has proved powerful in practice, allowing us to shorten our switchback tests by more than 25% while maintaining experimental power.

What is Variance Reduction?

Strategies that attempt to reduce the variance of a target metric are known as variance reduction techniques. Common variance reduction techniques include stratification, post-stratification, and covariate control. CUPAC falls into the covariate control category. To more clearly illustrate what variance reduction seeks to accomplish, consider the distributions for the test and control observations in Figure 1 below:

Variability makes it more difficult to detect differences between treatment and control. These distributions are purely illustrative
Figure 1: Variability makes it more difficult to detect differences between treatment and control. These distributions are purely illustrative

In this example we see substantial overlap between the treatment and control distributions of ASAP. All else held equal, such overlap will make it difficult to detect whether a product change meaningfully reduces delivery time. 

In Figure 2, we can see that if we were able to explain away a portion of the variation in ASAP using factors that have nothing to do with our experimental intervention, this overlap will decrease. Such improvements in the signal to noise ratio will make it much easier to identify the treatment effect: 

Reducing variability makes experimental changes easier to detect. As before, these distributions are only illustrative
Figure 2: Reducing variability makes experimental changes easier to detect. As before, these distributions are only illustrative

In the case of ASAP, the features we can use to explain such irrelevant variability include historical dasher availability, historical merchant food preparation times, and the expected travel time between the merchant and the consumer given typical traffic conditions. 

Reducing Variance using Linear Models

The standard t-test for a difference in population averages can be generalized to a regression setting using an approach known as ANOVA (Analysis of Variance). In the most basic version of this approach, the treatment indicator T is regressed on the outcome variable Y. The estimated value of the coefficient on T, typically denoted as β̂ , is then compared to its own standard error. We conclude that the treatment effect is statistically significant if this ratio is sufficiently large.

This regression approach can be then extended to include control variables that help explain the variation in Y not due to the treatment T. The expansion is straightforward and involves adding X to the regression of T on Y, where X is the vector of covariates. Note that for our measurement of the treatment effect to remain valid under this extension, each of these control variables must be independent of our treatment T. 

To make this concrete, let’s again consider measuring changes in ASAP times. A potential control variable for such a test would be the travel time between the restaurant and the consumer (as estimated at the time of order creation). Under the assumption that our treatment is assigned randomly and that it does not meaningfully affect road conditions, this variable should be independent of our treatment T. In addition, it should have significant explanatory power over the ASAP time, Y. As a result the standard error on β̂ , the coefficient on T, will be significantly smaller after the introduction of this covariate, making it easier for our test to achieve statistical significance.  

Using Predictions as Covariates in Linear Models

CUPED (Controlled-experiment Using Pre-Experiment Data) delivers variance reduction through the inclusion of control variables defined using pre-experiment data. The key aspect of this approach is the insight that pre-experimental data is collected before a randomly assigned treatment effect is introduced. Such variables must be uncorrelated with the treatment assignment and therefore are permissible to include as controls.

To build on this approach, we defined an approach to variance reduction we call CUPAC that uses the output of a machine learning model as a control variable. CUPAC involves using pre-experiment data to build a model of our outcome variable Y using observation-level features. As long as these features are uncorrelated with the treatment T during the experimental period, this estimator ŷ — as a function of these features — will also be uncorrelated with T. As a result, it is permissible to include such an estimator as a covariate. 

The amount of variance reduced by CUPAC scales linearly with its out-of-sample partial correlation with the outcome variable Y, given other control variables. When improving model performance (hyperparameter tuning, feature engineering, etc.), we therefore recommend aiming to maximize the partial correlation between the prediction covariate (CUPAC) and the target metric.

Using an ML-based covariate represents an improvement over competing control strategies for multiple reasons. First, an ML-based encoding of our outcome variable generalizes CUPED to situations where the relevant pre-experiment data is not clearly defined. For example, in logistics experiments at DoorDash pre-experiment data does not exist at the delivery level in the same way that it does for a market or customer-level test. While there are multiple pre-experiment aggregates that can be used as proxies (e.g. the average value of the outcome Y during the same hour the week before), all such averages can be viewed as simple approximations to a ML-based control model specifically designed to maximize the partial correlation with the outcome variable.

Second, an ML-based encoding of our outcome variable will capture complex relationships between multiple factors that a set of linear covariates will miss. In nature, nonlinearities and complex dependencies are not the exception, but the rule. ML models such as ensembles of gradient-boosted trees are uniquely suited to capture such complex interaction effects.

Finally, an ML-based approach to control can reduce the computational complexity of variance reduction. Prior to implementing CUPAC, we had used multiple categorical covariates in our experiment analysis. As such regression analysis requires categorical variables to be one-hot encoded, excessive cardinality can make the required calculations computationally expensive. By replacing a large number of variables with a single ML-based covariate, we are able to significantly reduce runtime.

Increasing Experimental Power using CUPAC

In offline simulations, CUPAC consistently delivers power improvements across a variety of effect sizes. The graphic below shows how CUPAC reduces the time required to detect a 5 second ASAP change with 80% power relative to a baseline model with no controls. We include results for each of the 4 random subsets of markets we currently use for switchback testing. On average, CUPAC drives a nearly 40% reduction in required test length vs. baseline and a 15-20% improvement when compared to alternative control methods. While the magnitude of these effects varies across market groups, in each instance CUPAC proves itself to be the most powerful control method.

CUPAC reduces the time required to detect a 5s ASAP change with 80% power
Figure 3: CUPAC reduces the time required to detect a 5s ASAP change with 80% power

In Figure 4, we build upon the above analysis by plotting simulated confidence intervals for an A/A switchback test run on similar DoorDash data. As above, we see that CUPAC leads to lower uncertainty in our estimation of the experimental effect than a model with one-hot encoded regional features. However the difference between the confidence intervals of the two methods look smaller than the improvement in Figure 3 might suggest. This is due to the fact that the amount of data required to measure a given effect approximately varies as the inverse of the sample variance, whereas the size of the confidence interval varies as the sample variances’s square root. Such a nonlinear relationship implies that variance reduction can greatly accelerate the rate of testing even when its impacts on confidence are modest.

CUPAC delivers tighter confidence intervals than a model with one-hot encoded region features
Figure 4: CUPAC delivers tighter confidence intervals than a model with one-hot encoded region features

Conclusion

As DoorDash grows, being able to measure incremental improvements in our logistics system becomes increasingly important. Through variance reduction, CUPAC allows us to detect smaller effect sizes and conclude our experiments more quickly.  We are currently using CUPAC on all of our experiments with our dispatch system and are looking to expand this to additional areas of the business. 

In our next post, we’ll deep dive into important aspects of building and maintaining such models, such as identifying and handling feature endogeneity and monitoring model deterioration. Stay tuned!

Acknowledgements

We wish to thank Sifeng Lin and Caixia Huang for their significant contributions to this project. We are also grateful to Alok Gupta, Raghav Ramesh, and Ezra Berger for their detailed feedback on drafts of this post. 

References

Alex Deng, Ron Kohavi, Toby Walker, and Ya Xu. Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. WSDM 2013.

Huizhi Xie and Juliette Aurisset, Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix, Netflix 

Simon Jackson. How Booking.com increases the power of online experiments with CUPED, Booking.com. 2018

David Kastelman and Raghav Ramesh. Switchback Tests and Randomization under Network Effects at DoorDash. 2018

Yixin Tang and Caixia Huang. Cluster Robust Standard Error in Switchback Experiments 2019.

 Rajat Shroff VP of Product at DoorDash
 Rajat Shroff VP of Product at DoorDash

Hi Rajat, tell me about what you do at DoorDash.

I run Product and Design. The teams are responsible for developing product strategy and executing visions to help DoorDash achieve its mission of helping local communities and businesses reach their full potential.

What does the DoorDash Product teamwork on?

There are two aspects of the product team’s work. One is based on the audience group we support: Consumer, Merchant, and Dashers. The consumer team is focused on helping consumers get their food faster at a more affordable price. Merchant team is all about helping the merchants grow their business and operate more efficiently. Dasher team enables our Dashers to earn more on the DoorDash platform with flexibility.

Another set of teams is based on the new vertical products that touch on all three audiences in the DoorDash ecosystem. Some examples are Catering, Convenience, and Grocery.

What made you decide to join DoorDash 3 years ago?

I worked on several products that have helped small businesses find ways to grow over 10 years. It’s often harder for small businesses to capitalize on the ROI of online systems like SEO and ads. Most existing solutions require merchants to heavily discount or advertise their products, and merchants don’t have the flexibility they’d like to attain specific goals.

I’ve been looking for a company that lives by a merchant-first approach and found that DoorDash was exactly that. DoorDash provides the flexibility to merchants to solve their unique challenges, whether it’s about bringing new customers, creating loyalty programs, introducing new products, expanding to catering, building their own marketing channels, etc.

You’ve been leading the products in many other tech companies in the past. What makes our product org stand out?

First of all, we made an intentional decision to keep the product and design team small, so that each of us has a tremendous amount of leverage. The impact each of us makes should be visible from space.

Secondly, we don’t think about what we build as a “product.” Rather, we think about what we build as a “service.” We know that our customers come to DoorDash not because of the app itself, but because they need a service. For consumers, it means getting their food on time from their favorite merchants. For Merchants, it means getting more orders from more customers and growing their businesses. Dashers, it means earning more money as efficiently as possible.

To deliver the service-driven product, you have to collaborate with Sales, Support, and Operation teams closely. Every PM and Designer needs to understand operational challenges and then use technology to scale the solution. And a lot of times, this doesn’t happen while sitting behind the desk. A lot of our teams are experiencing each audience in the field. For example, our merchant team would work behind the counter or spend days in the kitchen to observe the merchant’s operational challenges. Our Dasher team would ride along with Dashers to understand the challenges around parking, pickup, and dropoff processes.

Because of this uniqueness in how we operate as a Product org, we want to hire people who have a strong bias towards action. They need to have ambition but should be able to break up problems into small parts to test their hypotheses faster. Curiosity is another important one as they need to know the ins and outs of their customers. As we are tightly knitted with cross-functional teams on all problems we solve, collaboration ability is also critical.

How would you describe the culture of DoorDash Product and Design?

Foremost importantly, a strong sense of ownership. The leadership team provides goals and then from there, how each team achieves their goals truly comes from bottom up. PMs and Designers obsess about their audience, define the strategy, and drive the solution.

Another unique culture about us is speed, a breakneck speed. We spend a lot of time prioritizing so that the execution itself can go fast. We create projects fast, launch them fast, and also kill them fast if they don’t work. Agility and resilience have been in our DNA from day one.

How do you work with the Design team day-to-day?

I’m quite closely plugged into the design processes end to end, from the product briefs to vet the problem statements as a team, the design reviews to iterate the design strategy and execution, to the ship review to ensure the quality of the product we’re shipping. I also closely work with the research team to understand customer insights.

Besides those recurring reviews, I also enjoy strolling around the design team’s area in the office. And this is when I hear a lot of interesting new ideas from designers in more informal settings. Often I’m impressed by the team thinking beyond what’s on the roadmap to bring more delight to our customers, and I feel excited to help bring those visions to come true.

What do you think the Design’s role is in the product organization?

The design team is the keepers of customer delight and customer love. They are the ones who are the closest to what the customer touches at the end of the day. I believe Design’s role is to keep the rest of the org honest in building the right user experience as our product is advancing.

At DoorDash, designers are highly encouraged to bring a strong opinion and challenge the cross-functional partners. I value some healthy friction in debates as it uncovers new areas and broadens our perspectives. Also, I look to the design team to the rest of the company about how to apply the customer love to drive a delightful experience end to end.

Looking back a year, what excites you about the Design team at DoorDash?

For the past year, the team has literally tripled its size and we have brought in many new talents into DoorDash. Now there’s a robust process that enables us to move faster while keeping the design standard high. The design team has also found a strong voice within the company, and they feel more empowered to bring perspectives in decision-making processes.

All of these make me very excited about where the Design team is now as well as what it’ll become in the future.

Do you have any advice to give to Design/Content/Research candidates who are considering to join DoorDash?

The problems we’re solving at DoorDash are difficult and complex, involving multi-sided audiences, online and offline. And we provide lots of high leverage opportunities for folks to come and invent new things and make an impact that’s visible from space. Adding to that, our mission of uplifting our local communities is very relevant for the current climate we are in. If you’re ready to learn a lot from amazing design leaders and make a massively outsized impact, this will be the best team to join!

Thanks, Rajat for your time! 🙂


Please learn more about other leaders at DoorDash:

Christopher Payne—Chief Operating Officer
Radhika Bhalla — Head of UX Research
Tae Kim — UX Content Strategist Lead
Will Dimondi — Manager for Merchant Design

Tae Kim UX Content Strategist at DoorDash

Hey Tae, how’s it going?

It’s going well. How are you doing this Friday?

I’m well, glad we’re able to chat today. So what do you do at DoorDash?

I sit on the design team as a UX content strategist — and my role is to help partners build better products. Achieving this by thinking about their goals, specifically from the user experience standpoint, especially as it relates to the content. First, I partner with product and design peers to help understand and define the heuristics of the product we’re trying to create. It’s also understanding the potential impacts on the content side that these ideas are going to have. If a product is very complex and it’s going to require a lot of explanation in the form of things like tooltips, landing pages, and FAQs, I try to help identify those needs as early in the ideation process as possible. Second is copywriting, which is a secondary aspect of my job. That’s one distinction I try to make, which is that copywriting is only 10% of my responsibilities. Ultimately, my role is really to help build the product, and I think that’s a misconception that a lot of people have about my position. They think we’re just writers, but we’re ultimately product people.

Interesting. So from what I understand you previously used to run a business, right?

I did! I had a small business. It was a tiny little coffee bar in the Inner Sunset neighborhood of San Francisco.

And so how does how did that unfurl and lead you to becoming a content strategist?

That was kind of a departure from my initial career, which started in games journalism. I was an editor for a gaming magazine for a long time. I then moved into other roles around audience engagement. The desire to do a small business stemmed from needing a break from what I was doing at the time. My parents we’re also restaurateurs that had a series of restaurants in San Francisco and Oakland, so I just wanted to see that side of the family business.

And how it relates to my career as a content strategist is very interesting. Building user experience is essentially what small business owners do. We think of them as people who provide a service, but what they offer are experiences, right? They think about areas of the service layer, like what type of experience the customer is having when they come through the door and what kind of products they’re offering.

The content strategy role is very similar in that, again, we’re helping to build products from the user experience perspective. Thinking about what happens when a customer enters the product, what kind of experience and expectation do they have, what are we offering to them, and what kind of impressions are we leaving them with?

The best latte art Tae ever poured.

The best latte art Tae ever poured.
That’s a great perspective to have.

Also, running a business of my own also helped me better understand the needs of small business owners and how they think, and that sense of empathy helps me serve our merchants that use the DoorDash platform.

Absolutely, and that’s been hugely beneficial to my team and other folks that work under merchant. That said, what ultimately made you decide on Doorash?

The opportunity to work across the three-sided marketplace that we serve. There are fascinating and differentiated problems — whether you’re talking to a customer, one of our Dashers, or one of our merchants.

DoorDash’s mission is also something that enticed me as well. When I was in the interview phase, I thought a lot about my parents and how their business might have benefited from a service like DoorDash, and that made me excited to work on a platform that could have a positive impact on merchants.

That’s great to hear. With that in mind, what makes DoorDash unique from places you’ve previously worked or tech companies that are in a similar space?

First, I love the open and collaborative culture that we have. I think a lot of companies claim that everyone has a voice and a seat at the table, but DoorDash sincerely lives by that principle. Second, it’s the high caliber level of talent we have across the board. It’s been great for me to come in and have this top bar of excellence to shoot for, and to be surrounded by so many amazing partners that I can learn from.

I recently finished my first year at Doordash, I feel I’ve really had opportunities to improve my skillset and knowledge base, and how to operate as a successful individual contributor, which has been great.

Because you mention the collaborative nature of DoorDash, how do you think peers closest to you would describe your working style and their relationship to you?

I would hope that they would say that I’m a very collaborative and patient partner and note the genuine sense of curiosity that I bring to every project. I ask a lot of questions to help clarify what we’re trying to do, but also to gain the context that I need.

Lastly, I would also hope they say that I’m a firm advocate for the user, but that I also try to have a comprehensive perspective and help ensure that we’re meeting the business objectives as well.

I would agree with that statement! You’re also in a somewhat unique position in comparison to some other teams on the design side because you’re the sole content strategist on the team. How do you remain so diligent being an army of one?

It’s tough! I think one skill that I’ve rapidly developed is my ability to ‘context switch’ very rapidly. We always have a lot of projects going in the different parts of the three-sided marketplace, so the ability to seamlessly transition from thinking about one unique set of problems to another has been vital. Also, being smart about resourcing and bandwidth has been key. Shameless plug here, but this is one of the big reasons why we’re trying to hire for my team aggressively!

So given that, what type of person are you looking to hire to bring onto your team?

That’s an interesting problem we’re trying to solve. The term ‘content strategist’ applies to a lot of different roles and leads to confusion on the type of person we’re looking to hire.

For me, it’s a very product-focused role, so I’m looking for someone who understands product and has a good eye for how we can cater solutions to our users. Having a great design sensibility is useful as well because it ensures our team can evaluate a design approach and understand how some of the decisions made an impact on the user experience from a content perspective.

Excellent collaboration skills are also essential. This role entails a lot of discussions, strategy, and ultimately working with partners to take a lot of different inputs and then meld it together into an overall solution.

This person also needs to have the ability to articulate their position and why you’re advocating for certain things such as product specs or design changes. This is especially key because of the perception problem that I noted earlier about content strategists being viewed as writers.

Finally, just being a very qualified and creative content person is very necessary. At the end of the day, when we create words that appear in the product, you do need that creative streak to be able to find the right words and angles that are the best for the user and the business.

Really good to know. What is some advice that you would give potential candidates that are looking to apply for a content strategist role on your team?

Something I struggled with that I’ve been trying to be better about is having a great portfolio. Documenting your work and telling a comprehensive narrative about the problem you were trying to solve, your strategies, and the final results are key.

Also, ensuring that you understand how products are built and how they work to engage users. Your work shouldn’t just focus on the words but should show attention to the overall user experience as a whole.

That said, what type of culture would you like to instill amongst your newly founded team?

Well, I’m fortunate enough to have a great blueprint to follow. Helena, who leads our Design team, is building the exact culture that I would try to foster — one that’s very open, collaborative, supportive, and empathetic.

I also really appreciate that we’re very flat org, and we’re not title obsessed. No one thinks about seniority or deferring to other people just because they might have been here longer or their level is higher. Everyone is treated as an equal, and that’s key for me as well.

Glad to see how you’ll distill this to the team, but enough about DoorDash. Tell me a little bit about what you like to do on the personal side. You’re into ice hockey and video games, right?

Yeah, hockey is a tremendous passion of mine. I particularly enjoy it because as you get older, your life tends to revolve around things that you’re good at, like your career and interpersonal relationships with your family. So finding opportunities to improve at something becomes more difficult. I’m also a diehard fan of the San Jose Sharks, so it was a natural extension of my fandom.

Video games are also another passion of mine because of my previous experience in the games industry. They offer some amazing narratives and they’re just fun to play.

Interesting. So random question, which I think you’ll be the only one that could ever answer this. What has a faster momentum, a game of ice hockey, or a day working at DoorDash?

*laughing* You know, what’s funny is they’re very related in this weird way. Both require you to think about so many different factors on the fly and juggle many things. DoorDash usually has tight timelines, so you have to move very smartly and quickly, and hockey is the same. You have to think about your skating, your stick, the puck, and what everyone else is doing on the ice, and what you’ll be doing three moves from now.

I think of the two, I’m much better at my job, but that’s only because I’m terrible at hockey right now. But they’re also very rewarding in their unique way.

That’s an intriguing way to put it! Also, what’s your favorite food, and why?

The obvious answer I’m going to give you is anything my mom makes for me. She ran restaurants for 30-plus years and I’m still fortunate enough to live close to her so I can go over and get a great home-cooked meal every so often. Her kimchi is one of my favorite things in the world by far.

In terms of going out, I would say a burrito is the platonic ideal of the perfect food. It’s this self-contained thing that has everything you could ask for in one convenient package.

So, this might be a little bit of a controversial topic, especially in San Francisco, but who makes the best burrito here?

My favorite place to go to for burritos is Taqueria Cancun in the Mission District.

Good call, even though I’m a fan of Farolito 🙂 Alright, one last question. What is something that someone would learn about you as they get to know you, but wouldn’t necessarily know from meeting you at first?

Oh dang, that’s a great question. I’m a pretty open book, and I don’t hold anything back. I genuinely try and be the same person on day one as day 100, so I don’t have a lot of secrets.

Well, I would argue that, as one of your design counterparts, it is that you’re way more knowledgeable of UX than most content strategists.

So, all that credit is due to the design partners who’ve taught me everything I know about UX. Working with very talented designers at companies like DoorDash and Groupon, I get an excellent education in design principles that most people would pay tens of thousands of dollars for.

What a great way to end this interview! Is there anything else you’d like to add?

Yes, I’m hiring! So if you are a content person who understands product, and you want to help build something amazing and work with some talented people, apply through this link!


For those readers who are interested in joining Tae’s awesome content team, please check out this link:

Content Strategist

To learn more about other design leaders at DoorDash:

Radhika Bhalla — Head of UX Research
Will DiMondi — Manager for Merchant Design

DoorDash uses Machine Learning (ML) at various places like inputs to Dasher Assignment Optimization, balancing Supply & Demand, Fraud prediction, Search Ranking, Menu classification, Recommendations etc. As the usage of ML models increased, there grew a need for a holistic ML Platform to increase the productivity of shipping ML-based solutions. This kick-started an effort to build an ML Platform for DoorDash.

The ML Platform consists of two critical pieces: first the infrastructure needed for ML to work at scale, and second a productive environment for engineers and data scientists to build their models. Scalability and productivity are the key driving factors in the decision making process for us.

Scenarios and Requirements

As we dug into ML usage at DoorDash, the following key scenarios for ML emerged:

  • Online models – This is the scenario where we make predictions live in production in the critical path of the user experience. In this scenario the models and frameworks need to be performant and have a low memory footprint. We also need to understand both the modeling frameworks and services frameworks, most in-depth here. Consequently, this is where the restrictions about which ML frameworks to support and how complex models will be stringent. Examples of these at DoorDash include food preparation time predictions, quoted delivery time predictions, search ranking, etc.
  • Offline models – These predictions are used in production, but predictions are not done in the request/response paths. In this scenario runtime performance is secondary. Since these predictions are still used in production, we need the calculations to be persisted in the warehouse.   Examples of these at DoorDash are demand predictions, supply predictions, etc.
  • Exploratory models – This is where people explore hypotheses, but the model or its output are not used in production.  Use cases include exploring potential production models, analysis for some identifying business opportunities, etc. We are explicitly not placing any restrictions on frameworks here. 

These brought out the following requirements for the ML Platform:

  • Standardizing ML frameworks: Given the number of ML frameworks available, for example LightGBM, XGBoost, PyTorch, Tensorflow, it is hard to develop expertise within a company for many of them. So there is a need to standardize on a minimal set of frameworks which covers the breadth of use-cases that are typically encountered at DoorDash.
  • Model lifecycle: Support for end to end model life-cycle consisting of hypothesizing improvements, training the model, preserving the training scripts, offline evaluation, online shadow testing (making predictions online for the sole purpose of evaluation), A/B testing and finally shipping the model.
  • Features: There are two kinds of features we use. One kind is request level features, which capture request-specific information, for example the number of items in an order, request time etc. The second kind is environmental features which capture the environment under which DoorDash is operating. For example, average wait times in a store, number of orders in the last 30 mins in a store, numbers of orders from a customer in the last 3 months, etc. Environmental features are common across all requests. We need a good way to compute and store environmental features.

Standardizing on Supported ML Frameworks

The first step towards an ML Platform was to standardize the ML frameworks which will be supported.

Supporting any framework requires a deep understanding of it, both in terms of the API it provides and its quality and performance tuning. As an organization we are better off knowing a few frameworks deeply than many in a shallow fashion. This helps us run better services for ML as well as help leverage organizational knowhow.

The goal was to arrive at the sweet spot where we make appropriate tradeoffs in selecting frameworks.  For example, if there is some pre-trained model in some framework which is not available in currently supported frameworks and building one is going to take considerable effort, it makes sense to support a different framework.

After completing an internal survey on currently used model types and how they might evolve over time, we arrived at the conclusion that we need to support one tree based model framework and one neural network based modeling framework.  Also given the standardization of DoorDash’s tech stack to Kotlin, we needed something that had a simple C/C++ API at the prediction time to hook up into the Kotlin-based prediction service using JNI.

For tree based models we evaluated XGBoost, LightGBM, and CatBoost, measuring the quality of the model (using PR AUC) and training/prediction times on production models we already have.  The accuracy of models were almost the same for use cases we had. For training, we found that LightGBM was fastest. For predictions, XGBoost was slightly faster than LightGBM but not by a huge margin.  Given the fact that the set of current models were already in LightGBM, we ended up selecting LightGBM as the framework for tree based models.

For neural network models, we looked at TensorFlow and PyTorch. Here again, for our use cases we did not find a significant difference in quality of the models produced between these two. PyTorch was slower to train on CPU’s compared to Tensorflow, however on GPUs both had similar training speeds.  For predictions,  both of these had similar predictions per minute numbers. We then looked at the API set for Tensorflow and PyTorch for both training and prediction time and concluded that PyTorch gave a more coherent API set. With the launch of TorchScript C++ support in PyTorch, we had the right API set needed to build the prediction service using PyTorch.

Pillars of the ML Platform:

After the ML framework decision,  based on prediction scenarios and requirements,  the following four pillars emerged:

  1. Modeling library – A python library for training/evaluating models, creating model artifacts which can be loaded by the Prediction Service, and making offline predictions.
  2. Model Training Pipeline – A build pipeline where models will be trained for production usage.  Once a model training script is submitted into git repo, this pipeline takes care of training the model and uploading the artifacts to the Model Store. The analogy here is if the modeling library is the compiler that produces the model, then the model training pipeline is the build system.
  3. Features Service – To capture the environment state needed for making the predictions, we need feature computation,  feature storage and feature serving. Feature computations are either historical or in real time.
  4. Prediction Service – This service is responsible for loading models from the model store, evaluating the model upon getting a request, fetching features from the Feature Store,  generating the prediction logs, supporting shadowing and A/B testing.

Architecture of the DoorDash ML Platform

Based on the above, the architecture for the online predictions flow (with brief description of components) looks like:

 

ML platform high level architecture

Feature Store – Low latency store from which Prediction Service reads common features needed for evaluating the model. Supports numerical, categorical, and embedding features.

Realtime Feature Aggregator – Listens to a stream of events and aggregates them into features in realtime and stores them in the Feature Store. These are for features such as historic store wait time in the past 30 mins, recent driving speeds, etc.

Historical Aggregator – This runs offline to compute features which are longer-term aggregations like 1W, 3M, etc. These calculations run offline. Results are stored in the Feature Warehouse and also uploaded to the Feature Store.

Prediction Logs – This stores the predictions made from the prediction service including the features used when the prediction was made and the id of the model used to make the prediction. This is useful for debugging as well as for training data for the next model refresh.

Model Training Pipeline – All the production models will be built with this pipeline. The training script must be in the repository. Only this training pipeline will have access to write models into the Model Store to generate a trace of changes going into the Model Store for security and audit. The training pipeline will eventually support auto-retraining of models periodically and auto-deploy/monitoring. This is equivalent to the CI/CD system for ML Models.

Model Store – Stores the model files and metadata. Metadata identifies which model is currently active for certain predictions, defines which models are getting shadow traffic.

Prediction Service – Serves predictions in production for various use cases. Given a request with request features, context (store id, consumer id, etc) and prediction name (optionally including override model id to support A/B testing), generates the prediction.

We are just starting to execute on this plan, there is still a lot of work to do in building, scaling and operating this. If you are passionate about building the ML Platform which powers DoorDash, do not hesitate to reach us.


Acknowledgments:

Cody Zeng, Cem Boyaci, Yixin Tang, Raghav Ramesh, Rohan Chopra, Eric Gu, Alok Gupta, Sudhir Tonse, Ying Chi, and Gary Ren

Kevin Fu joined DoorDash in 2015 on the Bay Area operations team. Now, 5 years later, he is the Director of Product leading the Logistics and Dasher team. With a background in both operations and product management, Kevin has experience working across deeply cross-functional teams to drive improvements in delivery quality and Dasher efficiency. We interviewed Kevin to share what it’s like to ship products in a fast-paced environment, and solicit his advice on how to accelerate your career in product.


You’ve had an impressive career trajectory here at DoorDash – starting on the Ops team to now leading a critical product area as the head of Logistics. Tell us a little about your tenure here and how it has led you to where you are today.

I started my career here on the operations team. We were a small scrappy team, which was a big change from the work I had done previously in consulting. On my first day, I built an excel model to prioritize what restaurants our sales team should sell. I’ll always remember my manager looking at the work and telling me “You’re working with one sales representative, you should go out there and sell restaurants yourself” – this has stuck with me to this day. I spent almost all my time as an operator doing things that I knew was solving a dasher, customer or merchant problem, whether that was texting Dashers to let them know it was a busy time for deliveries or making hundreds of menus so we could onboard merchants more quickly.

In 2016, I worked on a product feature that was aimed at improving quality on our larger catering orders. Initially, we had been assigning dashers to these orders in real time, and soon realized that some dashers didn’t have either the right size vehicle or equipment to handle these high volume orders. We decided to start manually assigning our top and most well equipped dashers to handle these catering orders the day before, via a sign-up sheet on google sheets. We eventually productized that workflow and that feedback loop from a few earlier merchants, and seeing this product feature that we built continue to increase quality was what first got me hooked on Product work.

Since then, I’ve managed products on the Support, Dispatch and Dasher teams, and most recently am leading the Logistics org. From the earlier days to now, I still think about my job in a similar way – staying close to our audiences and solving their problems. 

What are some of the similarities and differences between being an operator and being a Product Manager (PM)?

Both operators and PMs need to be incredibly goal oriented. We set and hit aggressive goals and have to adapt as those goals shift beneath your feet – currently, our goals have shifted as we’re working to better support customers, merchants and dashers during the COVID-19 outbreak. As a PM you need to focus on not just hitting those goals, but also how you get there matters just as much.

For example, we’re working to reduce the frequency of cancellations on restaurant orders. One reason cancellations happen is because soon after placing an order, customers change their mind about an item or two. They aren’t able to adjust their order and are instead forced to cancel, a notably awful customer experience. A quick way to improve this is to delay an order around a minute, so that if a customer wants to add, remove or modify an order they’re easily able to. This can turn into another bad experience for the customer by having every order delayed, so we’re investing in tools to give consumers the ability to modify orders all the way up until pickup time without interfering with merchant order protocols and POS integrations. This route might take longer and is more complex, but it ultimately achieves the goal the right way: customers get their order just as fast, they can modify the order seamlessly, and all the complexity of making it happen is on the backend.

Why is the Logistics team an exciting team to join?

The most exciting thing for me is the engagement we get from the Dashers. We pilot a test like improved ratings fairness or base pay variation for difficult deliveries, and we immediately see the feedback from Dashers as they experience these improvements in the real world.

Another example of how our team’s work has an immediate and meaningful impact on Dashers is the launch of Dasher Rewards. We believe that our top rated Dashers should be rewarded with unique perks, which is now especially imperative to our Dasher community during the COVID-19 pandemic. Now more than ever, there is a real sense of mission in doing what we do and it is very exciting to see our work having an immediate impact on Dashers nationwide during a time of need. We have a reddit page if you want to check out the many problems we help Dashers solve on a daily basis.

What’s the long term product vision for your team?

First, to be such a reliable, affordable and comprehensive service that DoorDash serves as a true utility to our customers. This means every delivery needs to be seamless and on time, all the way down to the correct sides and sauces. We also need to be hyper-efficient, with the potential to deliver 5+ separate orders in an hour – leading to a more affordable cost per delivery.

Second, to provide financial stability and mobility for Dashers. An individual with no bank account should have the ability to apply and start dashing within a few hours. Another focus from that point is helping Dashers achieve their earnings goal, whether that’s on an hourly or annual level. We have a lot of product work to be done to get there. 

What’s the culture of the Product org at DoorDash? What do you think sets the DoorDash Product apart from its competitors?

The most unique thing about DoorDash is how fast we move. Our product teams sometimes go from problem identification, to mvp, to shipped in an a/b test in the span of a day. Moving fast is a mentality we have – how do we work most efficiently together to get something done today.

Our executive team has intentionally set a “bottom up” culture, where most decisions are entrusted to the working teams so they’re able to move at their own pace, and work autonomously. We strongly believe that failure and bad decisions can be okay, so long as you learn and apply those learnings going forward. That level of trust from a management team is hard to find.

In your opinion, what makes a great PM and a great PM leader?

The PMs I’ve worked with in leadership positions all exhibit this one essential trait – they elevate everyone around them. What you’ll observe is people around these leaders are also performing at an incredibly high level whether it’s on engineering, design, operations, marketing, analytics or finance. Naturally, part of that is building amazing teams, but another portion is how a PM works with others. 

A good PM might be sharp and be able to appropriately delegate and prioritize cross-functionally. A great PM is someone who would take a step further, and ask their cross-functional stakeholders what they need from them to tie everyone’s work together. A PM has the opportunity to act as a guide and an amplifier – this leads to the best work output as a team.


If you’re interested in joining Kevin and the Product team, check out our Career Page. We’re hiring!

 Radhika Bhalla, Senior manager of UX research

Hi Radhika! What do you do at DoorDash?

I lead a small but mighty UX Research team at DoorDash. My team works very closely with cross-functional partners to discover product opportunities, gather customer insights and evaluate the product/design decisions.

Tell me about your background:

I grew up in India. My family moved every 2–3 years because my dad was in the Air Force. I went to a boarding school for 9 years where I lived with students from diverse backgrounds speaking different languages. This was my introduction to observing human behavior.

I studied Architecture in my undergrad, determined that I’d solve important issues like low-cost housing for the world. When I got an internship at a government org, I found that many decisions were driven by the architects themselves without understanding the clients’ real needs, and that’s when I knew that this wasn’t the right career for me.

It’s always interesting to hear the stories of people who had a different start from their current career. How did you become a UX researcher?

I found my way to become a UX researcher through practicing a hybrid of design and research.

When I felt a bit lost as an architecture student, a mentor of mine who worked at one of the few design research companies in India introduced me to a research position. I got the taste of research for a year and really liked it.

In 2007, I came to the US to go to a grad school for Industrial Design at Art Center College of Design. There were many good grad school options in India, but they were mostly focused on Indian grassroots, and I wanted my learning to be more global for my future career.

Art Center had a program called Design Matters, and it partnered the design students with nonprofit orgs. One of the alumni initiated a fellowship around this program, and I was one of the first fellows. For a year, I partnered with a social enterprise called Sustainable Health Enterprises whose mission was to locally manufacture biodegradable sanitary pads for women in Rwanda. My job was to productize the prototype that MIT students had built. It was almost a mission impossible because in Rwanda, access to everything is expensive and people can’t afford to buy a biodegradable pad. We spent a lot of time lobbying the government to remove the luxury tax that’s put on the sanitary pads. It was an amazing opportunity to investigate what the women in Rwanda can do in their lives if manufacturing the sanitary pad isn’t the job for them.

What was your career journey like after school?

After graduation, it was time to get a real job, and I always had a heart in working for social causes. I joined Hot Studio as a Design Strategist which had a social innovation department. A lot of projects in my early career were centered around civic innovation or healthcare. Because of my design background, my job stayed as a hybrid — research and design. When Hot Studio was acquired by Facebook, it was the time when I decided to fully join the research lane.

Since then, I worked at Facebook for over 4 years and led the research of International Experience at Google for 2.5 years. While at Google, I was grateful to be able to spend a lot of time in my home country India to drive some meaningful differences.

Having been a researcher for a decade now, what is your own ultimate career goal as a researcher?

Firstly, I’m interested in defining what building a research team culture looks like in the world we live in today. In almost every company I worked at, products start very US-centric and then hack into global. I want the teams to re-envision what a good global product looks like from day one.

Second, coming from an architecture and design background, I love building stuff. I believe that the role of the researcher has been also evolving. I’m interested in changing the companies to be data-driven to “insights-driven.” The insights can come from big data or small qualitative research.

Third, people usually think that research slows things down. I’d love to find a path for the research team to change the perception. Many researchers want to be absolutely confident about the result they share and it takes a long time to perfect it. They need to learn how to scale back to give just enough insights to form a decision and be ok with some ambiguity to move faster. At Google, I learned that having a structured process helps move things faster. Similar to their well-known five-day design sprints, researchers also tried getting quick results in five days.

Through a lot of trial and error, I also learned to develop my own voice vs. simply reporting research results. Having a perspective on suggestions makes a researcher an even more valuable partner in the cross-functional org.

Now you’re in DoorDash as a Head of Research. We’re very lucky to have you join us. I’m curious what’s been unique about DoorDash that you didn’t experience in other companies?

Unlike other consumer tech companies like Facebook and Google where the product is the business, DoorDash stands for much more than just product. It’s about operationalizing the entire experience across offline and online.

In the Venn diagram of three circles — Business, Technology, and Human, every company had a different size circle. Google had the biggest circle in Technology while Facebook had the biggest in Users (through data). DoorDash seems like one of few companies with all three circles in equal size!

Also, I have never been in a company that is so into writing — it’s joyful to find so many docs and memos in detail. As you walk into meetings, there are reads on printouts, and people take 10–15 mins to read and ask very thoughtful questions to bring everyone to the same page. It’s interesting to think about how this culture influences the way the researchers should present the insights. It challenges how succinctly the researchers should write insights so that the large group can consume the content easily, but with just enough details.

Let’s switch the gear a bit and talk about your management experience. You have managed different teams in your career. How do you think your team will describe you?

I believe that you need to be a different manager to different people depending on what they need at different moments. And I’m proud that each person I’ve managed would have a different description of me. More junior researchers would say I provided a mix of teaching and mentoring. I take a more hands-on approach to solve the problems together. More senior researchers would say I was open to hearing their ideas on the research approach and constantly inspire them to stretch goals.

Another thing they will say is that I get hands dirty and help them whenever needs arise. And of course, sometimes it bites me back with the work and life balance! 🙂

I also hope they would also say I put the team before myself. My favorite managers and mentors gave their teams a boost to their shoulders and propped them up. So I hope to be the same.

As you build the research team, what qualities are you looking for in a researcher?

It depends on the level of individuals, but since we’re currently focused on hiring more senior talents, I can speak to what I’m looking for in senior researchers. “Senior researcher” doesn’t just mean they just have many years of experience under their belt. It means they have proactively led teams to scope out research roadmap and played a strategic partner while being deeply embedded in the cross-functional teams. Because we’re a small team and we train the cross-functional teams to self-run research, it’s important that you can mentor others. Also, I’m looking for someone who has experience of transforming the culture of the research team. Again because our team is small, every member shapes the culture of the team.

What kind of tips would you provide to the researcher candidate who might be interested in applying to DoorDash?

This isn’t limited to the research candidate — it’s important that you’re able to articulate why you’re applying to this position. In other words, why are you excited about DoorDash? For senior researchers, it’ll definitely get our attention if you can highlight your experience in mentorship or proactive research leadership in the cover letter.

Now on the personal side… What do you do for fun?

I adopted a cutest Labradoodle puppy back in Jan last year. His name is Frisco. I love taking a walk with him every morning and evening. And of course, he changed our life completely. Now my husband and I are on the lookout only for dog-friendly hiking places or drivable getaways.

Another hobby of mine is boxing. Yeah, real boxing! I discovered my interest in it 4–5 years ago when I lived in NY and met with a great trainer. He taught me the principles and strategy of defense in boxing, and interestingly those learnings were useful in some stressful situations with stakeholders at work. I haven’t found a good trainer yet here in the SF Bay Area (if you have a recommendation, please give me a holler), but my husband got me Fight Camp gear — I should get back into it.

I’m also into reading, especially historical fiction these days. I’d recommend the last book I enjoyed reading: Lost Girls of Paris. It’s about a small group of women spies during World War II.

=======

For those readers who are interested in joining Radhika’s awesome team, please check out these links:

Consumer Experience Research Lead | Dasher Experience Research Lead

Please learn more about other design leaders at DoorDash:

Will Dimondi — Manager for Merchant Design