Context
Picture this: you set up your experiment to have 50% of users on variant A and 50% on variant B. The experiment is over and you have 44,000 users in A, and 45,000 in B. Is this a problem?
Most likely. If your assignment worked properly, the probability of seeing this imbalance (or a larger one) is less than 0.1%.
When the difference between the ratios is significant, you have a Sample Ratio Mismatch (SRM). Unless you understand why it happened, you should not analyse the results of the experiment; your setup may be flawed, invalidating any conclusions.
SRM can be detected using a statistical test: the Chi-Squared Goodness-of-Fit. It compares the actual frequencies (44,000 and 45,000 in the example) to the expected frequencies (usually an even distribution: 44,500 and 44,500).
When you run a chi-squared test, you get a chi-square value. The chi-square value is a positive number that indicates how large of a mismatch there is between reality and expectations.
We can use the chi-square value to get a p-value.
The p-value is a number between 0 and 1 that indicates how surprised you should be to see your distribution (or one even more extreme) if your experiment setup was working correctly. The smaller the p-value, the more surprised you should be.
A perfectly even distribution gets a p-value of 1—you shouldn't be surprised at all. A p-value of 0.50 means this imbalance would happen about 50% of the time by chance alone; nothing to worry about. A p-value of 0.0008 (like in our 44,000 vs 45,000 example) means this would happen less than 0.1% of the time; it's unlikely to be random.
p-values below 0.05 are evidence of a potential SRM issue, and values below 0.01 are strong evidence that something is wrong with your experiment setup.
Calculations
When you enter two or more observed frequencies (i.e. how many users/samples landed on each variant), the calculator does the following:
-
Calculates expected frequencies based on your distribution choice:
- For equal distribution: each variant gets the same expected count (total samples ÷ number of variants)
- For custom ratio: your specified ratio is normalised (so it sums to 1) and multiplied by the total sample count
-
Computes the chi-square statistic using the formula:
where is the observed count for each variant and is the expected count
- Determines degrees of freedom as
,
where is the number of variants
-
Calculates the p-value from the chi-square statistic and degrees of freedom using the regularised incomplete gamma function:
This involves two numerical approximation techniques:
-
Interprets the result:
- p < 0.01: strong evidence of sample ratio mismatch
- 0.01 ≤ p < 0.05: possible sample ratio mismatch
- p ≥ 0.05: no evidence of sample ratio mismatch
All calculations are done locally; your data never leaves your device.
The calculator has been tested by comparing its results to that of established statistical packages in Python (stats from scipy). You can verify your device returns the expected numbers, within a 0.01% margin of error, by adding ?test to this calculator's URL.