The Delta Method in A/B Testing - When Randomization and Analysis Units Differ

A/B tests are pretty standardized in industry today, and everyone agrees that they are very useful. However, despite that, there are still many misconceptions and teams that lack expertise are often making same mistakes over and over again, compromising the efforts to make data-driven decisions without even knowing it.

This article will go through one mistake that usually leads to overly optimistic results in experiments.

Randomization and Analysis Units

A key component of any randomized controlled trial (the real name of A/B test) is - surprise - randomization. Meaning, assigning units to different groups randomly, where each group has a different treatment applied to it. This allows us to isolate the effect of a treatment from effects coming from other variables. One of the reasons for this is because it provides a property that is central to most statistical analysis - independent and identically distributed observations, the famous i.i.d property.

Given we have an i.i.d sample, and an average metric we want to compute over it, it is easy to find the properties of the sampling distribution¹ of the metric. Firstly, we know that it must be normal, as it is guaranteed by the central limit theorem². Second, by the same theorem, we know that the mean and variance of the sampling distribution are closely related to the sample mean and variance. More specifically, they are:

\mu(\bar{Y}) = \frac {\sum_{i}^{N} Y_i} {N}

\sigma^2(\bar{Y}) = \frac {\sum_i^N {(Y_i - \mu(\bar{Y}))^2}} {N}

If the $Y_i$ that we use in the analysis (analysis unit) are the exact same units as the ones we used for randomization (randomization units), then we are safe to assume that the i.i.d property is satisfied³.

However, what if we have the need to use a different kind of unit in our analysis? For example, we could randomize on users, but need to calculate a metric that has sessions as an analysis unit. Then users have multiple sessions, but sessions from the same user are more often than not correlated, which breaks our i.i.d assumption.

While the calculation of the average of the sampling distribution happens to remain valid, the sampling distribution variance, which we need for confidence intervals and inference, does not. This is because it depends on the sample variance, and for the sample variance calculation to be correct, a key assumption is i.i.d of the sample observations.

This article will show one possible method to deal with this when we are using ratio metrics in our A/B test. Let's go through an example and see some simulations that can illustrate the problem a bit more intuitively.

An Example - Click-Through Rate

Let's consider a realistic use case - we have a website and we serve ads on it. Say we want to test weather it's better to show an ad at the top of the page, or on the sidebar. Now, obviously, whenever a visitor clicks on an ad, our website generates revenue. Since revenue is important to us, we want to make sure our choices result in the most clicks on the ads as possible.

This metric is usually known as click-through rate, and we can define it mathematically like this:

CTR = \frac {C} {N}

where $N$ is the number of user sessions in which an ad was shown (often called impressions), and $C$ is the number of those impressions that resulted in the user clicking on an ad.

Now, as we can see from the formula, we compute it by using impression data. This means our analysis unit is impression. The easiest way to design the experiment correctly would be to have impression be the randomization unit as well, i.e. every time we have an impression, we randomly choose which ad position we would want to show.

However, this would mean that the same user can see both the position at the top, and the position on the side, which could result in a weird customer experience, but it could also potentialy have confounding effects on it's own. And we don't want to take the chances with that, so we decide to randomize on users.

Now our randomization unit is user, but our analysis unit is impression, and since one user can have multiple impressions, the i.i.d. assumption that we need for our analysis is broken.

Let's see this in an example.

The Problem - Computing the Sampling Variance

First, let's define a probabilistic model for the CTR on our website. The model will give us the ability to sample infinite amount of times from the model, and every sample we get is a simulation of a sample we would get in the real world. Then, we would be able to generate the true sampling variance given our choosen sample size.

This will enable us to compare our estimates to the ground truth in case we choose to ignore the violation of i.i.d assumption.

Let's create a model of user impressions and clicks that is believable enough. I choose here to use the pyro⁴ python package and create a Bayesian network that we can easily understand. Even though data from reality is likely more complex, it does not really create a problem for the argument here - if we cannot estimate the sampling variance on a simplified model of reality, we are probably going to perform even worse on real-life data.

So, I propose the following model:

import pyro
import pyro.distributions as dist

import torch

import pandas as pd

num_users = 10000

user_group_table = pd.DataFrame(
    {
        "p": [1/3, 1/2, 1/6],
        "Num Impressions 𝜆": [2., 5., 30.],
        "Click Prob": [.3, .5, .8]
    },
    index=pd.Index(data=[1,2,3], name="User Group")
)

def model():
    user_group_probs = torch.tensor(user_group_table.p.tolist())
    user_group = pyro.sample("User Group", dist.Categorical(probs=user_group_probs), sample_shape=[num_users])
    
    user_group_indicator = torch.nn.functional.one_hot(
        user_group, num_classes=user_group_table.shape[0]
    ).type(torch.FloatTensor)
    
    group_poisson_lambdas = torch.tensor(user_group_table["Num Impressions 𝜆"].tolist())
    group_p_ctr = torch.tensor(user_group_table["Click Prob"].tolist())
    
    poisson_lambda = user_group_indicator.matmul(group_poisson_lambdas)
    p_ctr = user_group_indicator.matmul(group_p_ctr)
    
    num_impressions = pyro.sample("Num Impressions", dist.Poisson(poisson_lambda)) + 1 # Exclude users with 0 impressions.
    num_clicks = pyro.sample("Num Clicks", dist.Binomial(total_count=num_impressions, probs=p_ctr))
    
    return num_impressions, num_clicks

pyro.render_model(model, render_distributions=True)

svg

The inspiration for this model came from Deng et al (2018)⁵, where the assumption for the simulation model is that there exist clusters of users with similar behaviour - a fair assumption to make. The same assumption is applied here, though the model is not exactly the same - it's a little simplified and we are not performing the same simulation as in the paper.

The model consists of three main distributions:

User Group - a categorical distribution⁶ choosing one of three user clusters, with different probabilities, that model users with different behaviours, including the power users. The clusters parametrize the distributions for Num Impressions and Num Clicks in following way:

from IPython.display import Markdown, display

Markdown(user_group_table.to_markdown())

User Group	p	Num Impressions 𝜆	Click Prob
1	0.333333	2	0.3
2	0.5	5	0.5
3	0.166667	30	0.8

Num Impressions - a Poisson distribution parametrized by $\lambda$ from the given user group modelling the number of impressions a user had for the duration of the trial.
Num Clicks - a Binomial distribution parametrized by the probability to click for a given user group (i.e. the CTR), as well as the number of impressions.

Now we can simulate a number of datasets (for example 1000), compute the CTR for every one of them, which gives us the sampling distribution for our model. The variance of this distribution is the sampling variance that we would like to estimate for a sample of 10000 users, and the square root of that is the standard error estimate that we actually end up using for inference in our A/B test. So let's see what the empirical standard error is:

import numpy as np
import seaborn as sns

ctrs = []
for _ in range(1000):
    session_per_user, clicks_per_user = model()
    
    ctrs.append((clicks_per_user.sum() / session_per_user.sum()).numpy())
    
ctrs = np.array(ctrs)
real_se = ctrs.std()

sns.histplot(ctrs, bins=50, kde=True)
print(f"True Standard Error (SE): {real_se:.4f}")

True Standard Error (SE): 0.0028

png

As expected, the sampling distribution is approximately normal (guaranteed by the central limit theorem), and the true standard error (SE) is estimated to be:

SE = 0.0028

Now that we know the true standard error, let's simulate another sample and calculate the SE in a naive way, ignoring the correlation between impressions from the same user:

sample_num_impressions, sample_num_clicks = model()
sample_num_impressions, sample_num_clicks = sample_num_impressions.numpy(), sample_num_clicks.numpy()

sample_p = sample_num_clicks.sum() / sample_num_impressions.sum()
sample_var = sample_p * (1 - sample_p)
n_impressions = sample_num_impressions.sum()

naive_se = np.sqrt(sample_var / n_impressions)

print(f"Naive SE: {naive_se:.4f}")

Naive SE: 0.0016

We can see that the naive SE is ~45% less relative to the real SE. This can have serious implications, because it increases the chance of a false discovery.

Let's now move to the solution, and give a short background on it.

The Delta Method - Background

We'll dive deeper for a moment into the delta method. This part is a bit mathematical, in case you'd like to get a better understanding of what's going on, but feel free to skip it if you just want to see how it performs in practice.

The delta method itself comes from calculus, and it's simply a Taylor approximation of a function of a random variable. In this case, the random variable is asymptoticaly (standard) normal, and in a following way:

\sqrt{n} (T_n - \theta) = \mathcal{N}(0, 1) \tag{1}

where $n$ is the sample size, $T_n$ some estimator, like the average, and $\theta$ some estimand. The expression converges to a normal distribution as $n \to \infty$ . To avoid overloading the equations with needless limit notation, let's just assume that everywhere the limit $n \to \infty$ applies.

Now, given a function $\phi$ , we can use Taylor approximation to derive:

\phi(T_n) - \phi(\theta) = \phi'(\theta)(T_n - \theta) + O((T_n - \theta)^2)

Note that $(T_N-\theta)^2$ is on it's own $O(\frac {1} {\sqrt{n}})$ , so as $n \to \inf$ , this term vanishes to 0. This leaves only:

\phi(T_n) - \phi(\theta) = \phi'(\theta)(T_n - \theta) \tag{2}

Then, if we remember again that $T_n - \theta$ converges to a normal from (1), then:

\sqrt{n} \phi(T_n) - \phi(\theta) = \mathcal{N}(0, \phi'(\theta)^2)

as the variance of a random variable times a constant follows the rule: $Var[C * X] = C^2 Var[X]$ . We can also write this as:

\phi(T_n) - \phi(\theta) = \mathcal{N}(0, \frac {\phi'(\theta)^2} {n}) \tag{3}

Now we know that this function of a random variable is still asymptotically normal, but how do we relate this to the example we have here?

Well, though it's easier said than done, we have to find the right $\phi$ , $T_n$ and $\theta$ to link it to our problem.

In the case of a ratio metric like CTR, we can define the following:

$T_n = (\bar{C}, \bar{N})$ , i.e. a vector of average clicks per user and average impressions per user for a sample of n users.
$\theta = (\mu_{\bar{C}}, \mu_{\bar{N}})$ , meaning a vector of real average clicks per user and average impressions per user for a sample of n users.
$\phi(\theta) = \frac {\mu_{\bar{C}}} {\mu_{\bar{N}}}$ , which also implies $\phi(T_n) = \frac {C} {N}$ .

Now, we can plug this into (2), but first we need to make a slight change to account for the multivariate case, since now we are working with vectors:

\phi(T_n) - \phi(\theta) = \nabla\phi \cdot (T_n - \theta) \tag{4}

This then becomes:

\frac {\bar{C}} {\bar{N}} - \frac {\mu_{\bar{C}}} {\mu_{\bar{N}}} = \frac {1} {\mu_{\bar{N}}} (\bar{C} - \mu_{\bar{C}}) - \frac {\mu_{\bar{C}}} {\mu^2_{\bar{N}}} (\bar{N} - \mu_{\bar{N}}) \tag{5}

We know that (5) is asymptotically normal, we have shown that. Now we need to find the expression for it's variance, which is what we ultimately want.

This is not difficult to compute, if we first decompose equation (5) to the user level:

\frac {1} {\mu_{\bar{N}}} (\bar{C} - \mu_{\bar{C}}) - \frac {\mu_{\bar{C}}} {\mu^2_{\bar{N}}} (\bar{N} - \mu_{\bar{N}}) = \\ \bar{C} \frac {1} {\mu_{\bar{N}}} - \frac {\mu_{\bar{C}}} {\mu_{\bar{N}}} - \bar{N} \frac {\mu_{\bar{C}}} {\mu^2_{\bar{N}}} + \frac{\mu_{\bar{C}}} {\mu_{\bar{N}}} = \\ \sum_i^n{\frac{C_i} {n} \frac {1} {\mu_{\bar{N}}} - \frac{N_i} {n} \frac {\mu_{\bar{C}}} {\mu^2_{\bar{N}}}} = \\ \frac {1} {n} \sum_i^n{C_i \frac {1} {\mu_{\bar{N}}} - N_i \frac {\mu_{\bar{C}}} {\mu^2_{\bar{N}}}} =

Since $C_i$ and $N_i$ are user level metrics, and user is our randomization unit, we assume that the samples of these metrics are i.i.d. And that makes it much easier to compute the variance. However, keeping in mind that $C_i$ and $N_i$ are not independent, we get:

Var[\frac {1} {n} \sum_i^n{C_i \frac {1} {\mu_{\bar{N}}} - N_i \frac {\mu_{\bar{C}}} {\mu^2_{\bar{N}}}}] = \\ \frac {1} {n^2} n Var[C_i \frac {1} {\mu_{\bar{N}}} - N_i \frac {\mu_{\bar{C}}} {\mu^2_{\bar{N}}}] = \\ \frac {1} {n} (Var[C_i \frac {1} {\mu_{\bar{N}}}] + Var[N_i \frac {\mu_{\bar{C}}} {\mu^2_{\bar{N}}}] - 2Cov[C_i \frac {1} {\mu_{\bar{N}}}, N_i \frac {\mu_{\bar{C}}} {\mu^2_{\bar{N}}}]) = \\ \frac {1} {n} (\frac {1} {\mu^2_{\bar{N}}} Var[C_i] + \frac {\mu^2_{\bar{C}}} {\mu^4_N} Var[N_i] - 2 \frac {\mu_{\bar{C}}} {\mu^3_{\bar{N}}} Cov[C_i, N_i]) = \\ \frac {1} {n\mu^2_{\bar{N}}} [\sigma^2_{\bar{C}} + \frac {\mu^2_{\bar{C}}} {\mu^2_{\bar{N}}} \sigma^2_{\bar{N}} - 2 \frac {\mu_{\bar{C}}} {\mu_{\bar{N}}} \sigma_{\bar{C}\bar{N}}] \tag{6}

This is ultimately the estimand of variance yielded by the delta method, and we can estimate it using the sample means, variances and covariances of the two metrics that make up the ratio metric we want to compute.

In case of CTR, this is average clicks per user (C), and average impressions per user (N). Note that the CTR estimate is pretty straightforward from user level metrics:

CTR = \frac {\bar{C}} {\bar{N}} = \frac {\frac {\sum_i^n {C_i}} {n}} {\frac {\sum_i^n {N_i}} {n}} = \frac {C} {N}

Testing the Solution

So let's see how this works on the same sample we have computed the naive SE on.

sample_ctr = sample_num_clicks.sum() / sample_num_impressions.sum()
clicks_mu = sample_num_clicks.mean()
impressions_mu = sample_num_impressions.mean()
clicks_var, cov_clicks_impressions, _, impressions_var = np.cov(sample_num_clicks, sample_num_impressions).flatten()
delta_var = (
    1 / impressions_mu**2 * clicks_var 
    + clicks_mu**2 / impressions_mu**4 * impressions_var
    - 2 * clicks_mu / impressions_mu**3 * cov_clicks_impressions
)
delta_se = np.sqrt(delta_var / num_users)
print(f"Delta Method SE: {delta_se:.4f}")

Delta Method SE: 0.0028

It seems the Delta Method has computed the correct SE, the same one that we had in the simulated data (up to 4 decimals, of course)!

Let's have a final look at the comparison:

Markdown(
    pd.DataFrame(
        {
            "Naive": [naive_se],
            "Delta Method": [delta_se],
            "Real": [real_se]
        },
        index=["SE"]
    ).round(4).T.to_markdown()
)

	SE
Naive	0.0016
Delta Method	0.0028
Real	0.0028

Conclusion

This article was meant to illustrate how correlated samples can invalidate the standard computation of sampling variance and standard error, and how this can be solved using the delta method. Not accounting for this usualy leads to high false discovery rate, and it is interesting that most A/B testing tools do not really account for experiments that could contain correlated samples. Instead, it is left for the experimenter to make a mistake, or perform the analysis correctly outside of the tool UI.

The model used here is fairly simple, and it's safe to bet that reality is more complex and that the users are way more diverse than the three clusters presented here. The more differences between user clusters, the worse estimate we would get if we use the naive estimator.

It would be interesting to test out different models of increasing complexity and diversity of clusters, and connect them to real use cases, but I would leave that for a future article. Cheers!

https://en.wikipedia.org/wiki/Sampling_distribution ↩
https://en.wikipedia.org/wiki/Central_limit_theorem ↩
There is more to this topic. In case you are interested, the following paper argues how even having analysis and randomization units being the same still doesn't guarantie i.i.d, but comes close enough most of the time. ↩
https://pyro.ai/ ↩
Deng, Alex et al. “Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018): n. pag. (Check section 3.3 - Numerical Examples) ↩
https://en.wikipedia.org/wiki/Categorical_distribution ↩