Improving Subgroup Analysis with Stein Shrinkage

DoorDash is often interested in knowing not only the average effect of an intervention, but also the more granular effect of that intervention on specific cohorts. To estimate such subgroup effects, it is common for practitioners to simply filter experimental data by subgroup and then note the observed difference between treatment and control groups in the filtered dataset. This approach is intuitive and unbiased, but the variance for each segment can quickly escalate. Beyond affecting error size, variance creates additional problems when the sample sizes of subgroups are unequal and we are interested in the groups that are performing best or worst. Because of their larger variances, smaller subgroups will be overrepresented in the tail results, as famously explicated in Howard Wainer’s 2007 American Scientist article “The Most Dangerous Equation.” By failing to account for these issues, practitioners may be constructing false theories of change, iterating less effectively, or mistargeting their interventions.

To resolve these issues at DoorDash, we utilized a canonical method from the field of Empirical Bayes called the James-Stein estimator (JSE). At a high level, the JSE shrinks subgroup-level estimates toward the global mean in proportion to their variance. The JSE produces better composite predictions than classic segmentation while mitigating the overrepresentation of smaller subgroups in our most extreme results. We’re hopeful that a high-level explanation and a DoorDash case study will help practitioners conduct subgroup analysis more effectively in the face of challenges posed by variance.

The benefits of subgroup analysis

It is common to break down experimental results by subgroup. At DoorDash, we will often zoom in and look at distinct consumer, Dasher (our name for delivery drivers ), and merchant groups and their specific results. This partitioning is useful for tracking the impact on important subgroups, for example, local merchants. Segmentation is also often leveraged to identify patterns in the subgroups that deviate most from the mean. These patterns frequently allow us to generate hypotheses around the causal mechanism and inspire iterations geared toward improving the experience for poor-performing subgroups. We also use segmentation to target our intervention exclusively to the subgroups that are estimated to be most positively affected. This targeting can be particularly helpful in these circumstances:

When the overall results are significant, we can often make our interventions more efficient by rolling them out to only the most positively affected segments.
When overall results are not significant, we can still frequently find certain segments for whom there is a positive impact. We discuss this in detail in this blog post.

Overall, segmentation allows us to go beyond estimating the single average treatment effect (ATE) of an experiment or new feature on the entire use base. Instead, we signal our interest in something deeper by breaking up the ATE into a series of estimated treatment effects that are conditional to a specific cohort. These cohort-specific effects are known as conditional average treatment effects (CATEs).

The challenges posed by variance for typical subgroup analysis

An intuitive and straightforward approach to estimating the CATEs involves segmenting experimental data and separately estimating the treatment effect in each segment; it even results in an unbiased estimate. At the same time, variance creates several concerns for this type of subgroup analysis.

Larger Errors

First, as the number of subgroups increases, the variance of our CATE estimates will also increase, resulting in larger errors. This degradation follows from what is often called the “most dangerous equation” — the inverse proportionality between the sampling variance and the sample size.

Here, sigma-x-bar is the standard error of the mean (and its square is the sampling variance), sigma is the standard deviation of the sample (and its square is the variance of the sample), and n is the sample size. Additionally, because sampling variance is governed by a non-linear 1/n relationship with the sample size n, each new subgroup that’s further segmented will result in an escalating degradation to the sampling variance — and therefore estimate error — as shown below.

Figure 1: Above, we plot how sampling variance scales with sample size (here the variance of the sample is normalized to 1). You can see that sampling variance goes up a lot more when the sample is cut from four to two than when it is cut from six to four.

Overrepresentation of small segments in extreme results

Beyond general concerns around error size, variance concerns are particularly acute when:

Segments are of different sizes
We’re interested in identifying the groups that have the most extreme results

Because small subgroups will have the largest sampling variance, they will be overrepresented whenever we look for the groups in the tails (see figure 2 below).

Figure 2: Above, we plot two sampling distributions around a mean of 0; the small sample has a variance of three, and the large sample has a variance of two. When we focus on the tail results region with a value greater than five, there are many more instances from the small sample than from the large sample.

The law of statistics can often create misleading results when looking for extreme values. For example, education activists were once motivated to champion smaller school size based on the observation that small schools were overrepresented among the highest-achieving schools in the US. However, when studying the problem by looking at the relationship between school size and performance in the whole dataset rather than the highest-performing tail, researchers found there was actually a positive correlation between school size and performance — debunking the notion that smaller schools are better (as illustrated by The Most Dangerous Equation article). This study is not unique; “best” lists of things like cities or hospitals often disproportionately cite smaller entities, failing to account for how higher variance makes these smaller examples more likely to revert to average performance in future observation periods.

How the JSE mitigates subgroup variance issues

Fortunately, there is a way to estimate our CATEs while mitigating the variance concerns by using the JSE. At a high level, the JSE shrinks a group of estimates toward the global mean in proportion to the variance of each estimate and a modulating parameter. Formally, the JSE can be written as:

where z is the JSE, y-bar is the global mean, y is the group mean, and c is a modulating parameter that depends partly on the sampling variance of the group estimate.

A synonym for the “shrinkage” that the JSE performs is “regularization.” All regularization can be interpreted as a Bayesian approach that shrinks coefficients toward some prior value. Widely popular forms of regularization like Ridge and Lasso techniques, for instance, shrink parameters toward zero using the L2 and L1 norm respectively. Stein shrinkage is not just Bayesian, but also a canonical technique from the burgeoning set of Empirical Bayes methods. The JSE is “empirical” in constructing its Bayesian prior from the data at hand; it pools the data and uses the global average for shrinkage rather than using an invariant prior value like zero.

The JSE addresses issues with traditional segmentation-based estimates of CATEs in two ways:

It creates better predictions of CATEs in aggregate than the traditional subgroup analysis. Traditional subgroup analysis, which uses observed averages of subgroups, can be implemented using ordinary least squares (OLS). The JSE, however, is sometimes better and never worse (i.e. it “dominates”) in expected mean squared error (MSE) than OLS when there are three or more subgroups.
It directly addresses the overrepresentation of smaller, higher-variance cohorts in the tails by shrinking results toward the global average in proportion with variance. Essentially, the JSE is likely to do a better job of constructing a true ordering of CATEs.

When the JSE works best

Two notes about the expected difference in MSE between the JSE and using subgroup observed averages:

The JSE’s expected improvement over observed averages will be larger if the parameter values are actually similar to each other
The JSE always dominates observed averages in MSE for three or more parameters, but only in aggregate

The second point — that JSE always dominates observed averages for three or more parameters, but only in aggregate — underscores the primary insight behind the approach. If we have to predict a number of parameter values and we only care about the composite prediction accuracy, then there are better ways to predict those values in combination than how we would predict them in isolation.

For example, if you have estimates of three completely unrelated parameters, such as wheat yield in 1993, number of spectators at Wimbledon in 2001, and the weight of a candy bar, then using the JSE and shrinking your estimates toward the global mean of all three parameters would still dominate using your individual estimates. However, for no single parameter would you expect the JSE to outperform the individual estimate; the dominance is truly an emergent property, existing only when considering the combined prediction accuracy. If we really care about the individual predictions such that we would regret degrading one of our estimates, even if we were successful in improving the combined accuracy of all estimates, then the JSE is not appropriate.

Using JSE for consumer promotions at DoorDash

DoorDash, for instance, uses promotions to ensure affordable prices for consumers while ensuring they are sustainable. We can model this goal as a formal optimization problem whose objective blends the business’s top-line metrics (like revenue) with bottom-line metrics (like net income). The optimization problem also features various constraints such as price consistency and algorithmic fairness.

For this use case, we were interested in a particular consumer promotion and we were able to estimate the relevant parameter values through a combination of experimentation, historical data, and simulation. Moreover, we specifically aimed to estimate the parameter values for each and every “submarket,” which is a high-level geographic area. Note that because submarkets are based on geography, they can vary widely in the number of orders they comprise.

The challenges variance poses for promotion optimization

Unfortunately, our consumer promotion optimization set-up has all the complications of variance previously mentioned: we have a large set of segments — submarkets — which can lead to larger estimate errors for our parameters of interest. We thus needed to improve the composite accuracy of submarket-level parameter estimation to improve the business performance of our optimization. Moreover, we have different sample sizes used for estimation and we’re interested in the extreme values: for instance, we would like to know the most sensitive areas where it’s best to invest in offering the promotions.

Without an approach like JSE, we’d be poised to mistarget our interventions — calling-out and taking action on our smaller submarkets more than our larger submarkets. Note also that one of our constraints for optimization is that we want price consistency; we know that consumers value predictable promotions and there is a cost to frequently changing promotion levels. With naive segment analysis, we’d be poised to change promotion investment frequently in small submarkets because their estimated parameters, like sensitivity, fluctuate wildly. But that would stunt the development of our smaller, often newer, submarkets and keep them from reaching their long-term potential. Given these factors, it’s clear why Stein shrinkage can be applied to this problem space.

Perhaps most importantly, however, the large variance in small submarkets could have led to suspicious recommendations that undermined trust in the whole system of optimized promotional recommendations. As optimization matures at DoorDash, there is still quite a bit of room for manual operational intervention. A lack of trust in our models can undermine the values of those models, and that certainly would have been the case here. Manual interventions likely would have produced effects that were the reverse of Stein shrinkage, with a high likelihood of improving individual situations — making the recommendations warranted — but a combined performance likely expected to be worse, eating into the large gains we were able to achieve with this optimization. Therefore, to build trust in our recommendations, it was essential we had credible recommendations even in our smallest, highest-variance submarkets.

Details on Stein shrinkage implementation

We implemented Stein shrinkage by making a few practical alterations from the ordinary JSE approach outlined above. We started with the shrinkage estimator shown in formula 3.

Here, subscript b represents submarket b, beta-tilde is the JSE, beta-hat is the observed average, beta-bar-hat is the observed global average, var-hat is the observed sample variance, and lambda is a tunable parameter.

This estimator also shrinks a group of estimates toward the global mean in proportion to the variance of each estimate and a modulating parameter. This method, however, uses the observed variance because the actual variance is unknown. The estimator is also optimized to handle unknown and unequal variances, which fits our use case when estimating parameter values from data.

A note on “sensitivity” and derived metrics

We used Stein shrinkage to estimate a number of different parameters of interest in the promotion optimization. One of those parameters, sensitivity, is actually a formula derived from the difference between groups rather than a metric that could be averaged across the units of randomization. This formulation of sensitivity renders traditional HTE modeling inoperable and complicates the calculation of sample variance. While we could use the bootstrap or delta method to estimate sampling variance of sensitivity, we instead used a simple variance proxy calculated from sample size and fee differences in each submarket as a first pass.

Hyperparameter tuning for Stein shrinkage

Let’s talk about our choice of a loss function for hyperparameter tuning of lambda. For our use case in industry, we decided the most business-relevant loss function was weighted mean average error (wMAE). We select MAE rather than MSE because there was no particular reason to overweight large errors in this set-up; we’ve observed fairly linear sensitivities and our promotions are ultimately clipped in a fairly narrow range. We then weighted the error in each submarket according to historical delivery volume. Using wMAE, rather than MAE, is common when concerned most with aggregate business impact.

The results from using Stein shrinkage

Stein shrinkage improved the wMAE for all of our parameters of interest. For any single parameter, like sensitivity, the gains were moderate; however, summing up all the improvements resulted in a noticeable lift to optimization. This was especially the case because the wMAE improvements were larger among our smaller submarkets, resulting in a better ordering of optimal submarkets in which to invest. This allowed us to avoid mistargeting too much of our limited promotion budget to smaller, more variable submarkets.

The larger wMAE gain in smaller submarkets was also critical in avoiding the long-term risk of frequently changing promotion levels on our smallest submarkets, which would have deterred consumers and hurt the development of these submarkets. Last, and most importantly, Stein shrinkage allowed us to use the most business-relevant loss function of wMAE while still ensuring that our smallest submarkets had high-quality estimates. By making plausible recommendations in even our smallest submarkets and optimizing for aggregate business results, we successfully built trust in the recommendations, generated large gains from the optimization, and generated excitement for continued use of the system.

Conclusion

When estimating the treatment effect for different segments, variance poses a number of issues. It increases error size and can lead to a misleading ordering of the most affected segments, with smaller segments exhibiting more extreme effects because of their larger variances. To address these issues at DoorDash while optimizing promotions, we had notable success using Stein shrinkage. Stein shrinkage is a form of regularization that accepts some bias in order to constrain variance by shrinking estimates toward the global mean in proportion to their sampling variance.

Zooming out, we want to offer two concluding thoughts on Stein shrinkage. Firstly, the JSE actually was the original innovation in using regularization to improve variance at the expense of bias; widely popular shrinkage techniques like Ridge and Lasso were all preceded by the JSE. Despite still being broadly useful, the JSE is not as universally known or taught today as are other forms of regularization. Secondly, regularization and bias-variance tradeoffs are well-known considerations in predictive modeling. But these techniques are not as widely used when analyzing experiment data. By convention, experiment analysis to this day is conducted primarily according to strictly frequentist approaches that aim for unbiasedness. There are almost certainly times this convention is unhelpful, however, and it instead makes sense to minimize the error of our estimates.

Acknowledgements

Thank you to the DoorDash Causal Inference Reading Group (including Abhi Ramachandran, Yixin Tang, Lewis Warne, Stas Sajin, Dawn Lu, and Carol Huangci) for contributing to the ideas in this post. Thanks also to Alok Gupta, and Han Shu for reviewing the blog post. Lastly, an enormous thank you to Ezra Berger for the many improvements and iterations on the draft of the blog post.