CUPED is a method for variance reduction being widely implemented by major online experimentation platforms (Deng et al. 2013). The idea was recently extended to a general augmentation framework (Deng et al. 2023), which is a special case of
Chen, Y.-H., and Chen, H. (2000), A Unified Approach to Regression Analysis Under Double Sampling Design, Journal of the Royal Statistical Society, Series B, 62, 449–460.
The extended CUPED is a good application case of
multipleOutcomes
. Instead of following Deng’s new paper, I
use Chen’s approach to elaborate on the details of extended CUPED, as
this will better facilitate the explanation of its underlying
assumptions and failure conditions.
Consider a two-arm randomized controlled experiment of a targeted metric \(Y\). Let \(A\) be the arm assignment. If randomization is successful, the treatment effect can be assessed through \(\hat{\Delta}\), the estimate of \(\Delta\) in the linear regression model
\[Y = \alpha + \Delta A + \text{error},\]
where error is unnecessary to be normal. Note that a logistic regression model may be more appropriate if \(Y\) is binary.
Obviously, \(\hat{\Delta}\) is consistent and asymptotic unbiased under good randomization. We are of interest in alternative estimates for \(\Delta\) with variance lower than that of \(\hat{\Delta}\). Consider an estimate
\[\tilde{\Delta} = \hat{\Delta} + c \times \hat{\delta} \] where \(\text{E}(\hat{\delta}) = 0\). Thus, \(\tilde{\Delta}\) is also consistent and asymptotic unbiased to the treatment effect \(\Delta\), and the variance of \(\tilde{\Delta}\) is no greater than \(\hat{\Delta}\) (worst case: \(c=0\)). As long as \(\hat\Delta\) and \(\hat\delta\) are correlated, we can find a \(c\) yielding a more efficient estimate of \(\Delta\). This simple idea can be further extended for more than one \(\hat\delta\):
\[\tilde{\Delta} = \hat{\Delta} + c_1 \times \hat\delta_1 + c_2 \times \hat\delta_2 + \cdots + c_K \times \hat\delta_K.\]
To conduct \(\hat\delta\) of zero expectation, one may collect additional pre- or in-experiment metrics \(X_1, X_2, \cdots, X_K\) for each of the samples. If \(X\) are well balanced in the two arms, we have \(\delta_1 = \delta_2 = \cdots = \delta_K = 0\), where \(\delta_k\) is the slope of a working model
\[X_k = \alpha_k + \delta_k A + \text{error}.\] Here, a working model means that the relationship between the variables specified by the model is unnecessarily true; we simply choose to fit it as is. As a result, the error term can be anything and an unreasonable link function may be adopted without affecting the validity of extended CUPED. For example, one can fit a linear regression against a binary \(X\); the slope \(\delta = 0\) as long as \(X\) is balanced. Furthermore, note that for any transformation \(h(\cdot)\), \(h(X)\) is balanced given that \(X\) is balanced, one can fit the following working model as well
\[h_k(X_k) = \alpha_k + \delta_k^h A + \text{error}.\] A more natural model, the logistic regression model also works
\[\text{logit}\ \text{Pr}(X_k = 1\mid A) = \alpha_k + \delta_k^\text{logit}A.\] It is interesting that mutiple \(\hat\delta\) that corresponding to the same \(X_k\) can be integrated into the extended CUPED estimate simultaneously. In the example above, \(\tilde\Delta\) can be defined as
\[\tilde\Delta = \hat\Delta + c_k \hat{\delta}_k + c_k^h \hat{\delta}_k^h + c_k^\text{logit}\hat{\delta}_k^\text{logit} + \hat\delta \text{ based on other } X\] The variance of this estimate is no greater than an extended CUPED estimate with less \(\hat\delta\) from the same metric \(X_k\) (e.g. \(\hat\Delta + c_k \hat\delta_k\)).
In summary, the extended CUPED is valid (consistent and asymptotic unbiased) when
The extended CUPED is more efficient when
The optimal weights \(c\) depend on
the covariance matrix of \((\hat\Delta,
\hat\delta)\), not the original observation of \((Y, X)\), thus, tuning on \(c\) will not inflate the type I error of
inference. multipleOutcomes
can return the
covariance matrix for extended CUPED.
The extended CUPED remains valid even if
In fact, adding multiple \(\hat\delta\) based on different transformations on the same \(X_k\) into the extended CUPED estimate may lower its variance further in some cases. A variance stabilize transformation may be considered for metrics of certain parametric families of distributions or highly skewed metrics.
## generating data
library(multipleOutcomes)
library(mvtnorm)
library(ggplot2)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
genData <- function(seed = NULL){
set.seed(seed)
n <- 500
sigma <- matrix(.7, 5, 5)
diag(sigma) <- 1.
x <- rmvnorm(n, sigma = sigma)
x[, 1] # pre-experiment data of the targeted metric
x[, 2] # continuous metric
x[, 3] <- log(abs(x[, 3])) # continuous and skewed metric
x[, 4] <- ifelse(x[, 4] > .2, 1, 0) # binary metric
a <- rbinom(n, 1, .5) # 1:1 randomization
# y is post-experiment data of the targeted metric
y <- x[, 5]
y[a == 0] <- y[a == 0] + rnorm(sum(a == 0), sd = .1)
y[a == 1] <- y[a == 1] + rnorm(sum(a == 1), mean = .1, sd = .1) # Delta = .1
## x5 and x6 are metrics (random noise) but uncorrelated with the targeted metric
data.frame(
y, a = a,
x1 = x[, 1], x2 = x[, 2], x3 = x[, 3], x4 = x[, 4],
x5 = rnorm(n), x6 = rbinom(n, 1, .5))
}
dat <- genData(1)
head(dat) %>% print()
#> y a x1 x2 x3 x4 x5 x6
#> 1 0.40263988 0 -0.1619335 0.2817750 -1.2855327 1 0.6414685 0
#> 2 0.07511838 0 -0.2599780 0.4563869 -0.5211990 1 1.6552190 1
#> 3 0.63327183 0 0.8814712 0.2669605 -1.2488553 0 -0.8596355 0
#> 4 1.04650293 1 0.6195346 0.6352780 0.1493731 1 0.6173840 0
#> 5 0.52152316 0 0.6172003 0.5422494 -1.8662895 0 1.3358366 1
#> 6 -0.23932112 0 -0.5193212 -0.5739109 0.2578483 0 -0.5340088 1
xtabs(~x4, dat) %>% print()
#> x4
#> 0 1
#> 299 201
plot(
dat %>%
select(-x4, -x6, -a) %>%
pivot_longer(everything(), names_to = "indicator", values_to = "value") %>%
ggplot(aes(value)) +
geom_histogram() + facet_wrap(~indicator)
)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## function to compute CUPED estimate
## ...: formulas for each of the metrics
## family: a vector of family for each of the formulas
## data: a data frame of experiment data
cuped <- function(..., family, data){
fit <- multipleOutcomes(..., family = family, data = data, data_index = NULL)
id <- sapply(fit$id_map, function(x){x['a']})
id1 <- id[1]
id2 <- id[-1]
Delta <- coef(fit)[id1]
delta <- coef(fit)[id2]
mcov <- vcov(fit)
opt_c <- as.vector(- solve(mcov[id2, id2, drop = FALSE]) %*% mcov[id2, id1, drop = FALSE])
estimate <- Delta + sum(delta * opt_c)
stderr <- (
mcov[id1, id1] +
2 * as.vector(t(opt_c) %*% mcov[id2, id1, drop = FALSE]) +
as.vector(t(opt_c) %*% mcov[id2, id2, drop = FALSE] %*% opt_c)
) %>% sqrt()
list(estimate = estimate, stderr = stderr)
}
## An example
## y ~ a: treatment effect of post-experiment data for the targeted metric
## x1 ~ a: delta (= 0) using pre-experiment data for the targeted metric
## x2 ~ a: delta (= 0) using a continuous pre-experiment metric
## x3 ~ a: delta (= 0) using a skewed pre-experiment metric
## abs(x3) ~ a: delta (= 0) using a transformation of a pre-experiment metric
## x4 ~ a: delta (= 0) using a binary pre-experiment metric
## x5 ~ a: delta (= 0) using a metric that is a random noise
## family: linear regression model is fitted for all working models (even if x4 is binary)
cuped_fit <-
cuped(
y ~ a,
x1 ~ a,
x2 ~ a,
x3 ~ a,
abs(x3) ~ a,
x4 ~ a,
x5 ~ a,
family = 'gaussian',
data = dat)
## CUPED estimate and its standard error
print(cuped_fit)
#> $estimate
#> [1] 0.06196927
#>
#> $stderr
#> [1] 0.05886572
## treatment effect on targeted metric (post-experiment data only)
summary(lm(y ~ a, dat))$coef
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.11394931 0.06572757 1.7336608 0.0835972
#> a -0.03789196 0.09221800 -0.4108955 0.6813259
The example above illustrates a complex case of CUPED using
mutipleOutcomes
. The standard error of estimated treatment
effect is substantially reduced (0.059 vs 0.092).
We carry out simulations to verify the key points in Conclusions. Specifically,
n_simu <- 1000
all_estimate <- NULL
all_stderr <- NULL
for(i in 1:n_simu){
dat <- genData(i)
# use post- data only
post_only <- summary(lm(y ~ a, dat))$coef
# add pre- data of targeted metric
pre_target <- cuped(y ~ a, x1 ~ a, family = 'gaussian', data = dat)
# add more pre- data
pre_more <- cuped(y ~ a, x1 ~ a, x3 ~ a, x4 ~ a, family = 'gaussian', data = dat)
# multiple delta (transformation) for the same metric
# transform_delta <- cuped(y ~ a, x1 ~ a, x3 ~ a, exp(x3) ~ a, x4 ~ a, x4 ~ a, family = c('gaussian', 'gaussian', 'gaussian', 'gaussian', 'gaussian', 'binomial'), data = dat)
# add all correlated pre- data
all_corr <- cuped(y ~ a, x1 ~ a, x2 ~ a, x3 ~ a, x4 ~ a, family = 'gaussian', data = dat)
# add noise data
add_noise <- cuped(y ~ a, x1 ~ a, x2 ~ a, x3 ~ a, x4 ~ a, x5 ~ a, x6 ~ a, family = 'gaussian', data = dat)
# noise only
noise_only <- cuped(y ~ a, x5 ~ a, x6 ~ a, family = c('gaussian', 'gaussian', 'binomial'), data = dat)
extractEstimate <- function(mod){
mod$estimate
}
extractStderr <- function(mod){
mod$stderr
}
estimate <-
data.frame(
post_only = post_only['a', 'Estimate'],
pre_target = extractEstimate(pre_target),
pre_more = extractEstimate(pre_more),
# transform_delta = extractEstimate(transform_delta),
all_corr = extractEstimate(all_corr),
add_noise = extractEstimate(add_noise),
noise_only = extractEstimate(noise_only)
)
stderr <-
data.frame(
post_only = post_only['a', 'Std. Error'],
pre_target = extractStderr(pre_target),
pre_more = extractStderr(pre_more),
# transform_delta = extractStderr(transform_delta),
all_corr = extractStderr(all_corr),
add_noise = extractStderr(add_noise),
noise_only = extractStderr(noise_only)
)
all_estimate <- rbind(all_estimate, estimate)
all_stderr <- rbind(all_stderr, stderr)
}
## mean of estimates of simulation (true Delta = .1)
apply(all_estimate, 2, mean) %>% print(digits = 3)
#> post_only pre_target pre_more all_corr add_noise noise_only
#> 0.101 0.100 0.101 0.101 0.101 0.102
## empirical standard deviation
apply(all_estimate, 2, sd) %>% print(digits = 3)
#> post_only pre_target pre_more all_corr add_noise noise_only
#> 0.0916 0.0654 0.0627 0.0585 0.0585 0.0918
## theoretical standard error
apply(all_stderr, 2, mean) %>% print(digits = 3)
#> post_only pre_target pre_more all_corr add_noise noise_only
#> 0.0898 0.0643 0.0617 0.0576 0.0575 0.0896