Question 1

What confidence level should I use?

Accepted Answer

95% is the industry standard and works for most cases. Use 99% for high-stakes decisions (pricing, major redesigns) where false positives are costly. 90% is acceptable for low-risk tests or when you need faster results and can accept more uncertainty.

Question 2

Why isn't my test reaching significance?

Accepted Answer

Three common reasons: (1) Not enough traffic — you need more visitors to detect small differences, (2) Effect is too small — the actual difference between variants may be negligible, (3) High variance — your conversion rate fluctuates a lot. Calculate required sample size and run until you reach it.

Question 3

Can I stop a test early if results look significant?

Accepted Answer

No. Early stopping inflates your false positive rate dramatically. If you've calculated 10,000 visitors needed, run until 10,000 — even if results look significant at 3,000. The exception is if you're using sequential testing methods designed for early stopping.

Question 4

What's the difference between confidence and statistical power?

Accepted Answer

Confidence (1 minus alpha) is the probability of NOT declaring a winner when there isn't one (avoiding false positives). Power (1 minus beta) is the probability of detecting a real effect when it exists (avoiding false negatives). Standard values: 95% confidence, 80% power.

Question 5

My test is significant but the lift is small. Should I implement?

Accepted Answer

Statistical significance doesn't equal practical significance. A 0.1% conversion lift may be statistically significant with enough data, but not worth implementation effort. Consider: implementation cost, maintenance burden, and whether the lift is meaningful for your business.

Question 6

How do I handle multiple metrics?

Accepted Answer

Define one primary metric before the test starts — that's what determines success. Secondary metrics provide context but shouldn't change your decision. Testing multiple metrics increases false positive risk; apply Bonferroni correction if you must test several primary metrics.

Question 7

What is statistical significance in A/B testing?

Accepted Answer

Statistical significance indicates that the observed difference between variants is unlikely to have arisen from random sampling variation alone. Operationally, it means the p-value falls below your pre-specified alpha threshold (typically 0.05 for a 95% confidence level). A significance calculator like this one quantifies that probability so you can decide whether to act on the result. Significance does not measure the magnitude or business value of the effect — only the strength of evidence against the null hypothesis of equal conversion rates.

Question 8

Is this a free A/B test calculator?

Accepted Answer

Yes. This is a fully free ab testing calculator with no signup, no email gate, no rate limit, and no vendor account. All computation runs in your browser; no test data is transmitted to any server. You can use it for unlimited experiments, including production decisions. If you compare it to other free options like Evan Miller's tool, the main differences are configurable confidence levels (90/95/99%) and the inclusion of Z-score in the output.

Question 9

How do I interpret p-value vs confidence?

Accepted Answer

They are complementary representations of the same evidence. The p-value is the probability of seeing your result (or more extreme) if the variants were truly identical. Confidence, as used in this tool, equals (1 − p-value) × 100%. A p-value of 0.03 corresponds to 97% confidence. Lower p-values mean stronger evidence against equality. Neither value is the probability that your variant "really" wins — that requires Bayesian inference and a prior. For frequentist A/B testing, treat them as evidence strength against the null, not as probabilities of being right.

Question 10

A/B test vs split test vs multivariate — what's the difference?

Accepted Answer

A/B test and split test are synonymous: two variants compared on one variable. An A/B/n test extends the same logic to three or more variants of one element. A multivariate test (MVT) varies multiple elements simultaneously — for example, headline × image × button color — to estimate main effects and interactions. MVTs require substantially larger sample sizes because each cell in the factorial design needs adequate power. This calculator handles two-variant comparisons; for multivariate experiments you need a multivariate test calculator with factorial design support, or a dedicated experimentation platform.

Question 11

Do I need to calculate sample size before starting?

Accepted Answer

Yes — sample size planning is non-negotiable for valid inference. Without a pre-specified target, you risk peeking and inflated false positive rates. The required sample depends on your baseline conversion rate, the minimum detectable effect (MDE) you want to catch, and your chosen alpha and power levels. Use our sample size calculator to determine visitors per variant before launching, then run until that target is reached. The reference table above gives ballpark figures at 95% confidence and 80% power.

Question 12

Why does my A/B test calculator say "not significant"?

Accepted Answer

"Not significant" means the observed difference could plausibly have occurred under the null hypothesis of equal conversion rates. Three common causes: (1) the test is underpowered — not enough visitors to detect the true effect; (2) the actual effect is smaller than your design assumed; (3) variance is high due to seasonality, traffic-source mix, or noisy conversion definitions. Verify you've reached your pre-calculated sample size. If you have and results remain non-significant, treat the variants as practically equivalent and prioritize larger-effect hypotheses.

Question 13

Can I use this calculator for non-binary outcomes?

Accepted Answer

No. This a/b test calculator applies a two-proportion Z-test, which assumes binary conversion outcomes (converted / did not convert). For continuous metrics like average order value, revenue per visitor, or session duration, use a t-test or Mann-Whitney U test instead. For count data (e.g., page views per session), Poisson or negative-binomial regression is more appropriate. Mismatching test type to outcome type produces unreliable p-values regardless of sample size.

Tool	Outputs	Confidence Levels	Free	Privacy
This calculator	Uplift, confidence, p-value, Z-score	90 / 95 / 99%	Yes	Browser-only
Evan Miller’s calculator	Significance verdict, p-value	Fixed (one threshold)	Yes	Browser-only
AB Tasty calculator	Confidence, uplift	Fixed	Signup required	Server-side
VWO calculator	Confidence, uplift	Limited	Vendor lock-in	Server-side
Optimizely	Full Stats Engine	Configurable	Paid platform	Server-side

Baseline Rate	10% MDE	20% MDE	50% MDE
1%	`190,000`	`48,000`	`7,700`
3%	`62,000`	`15,500`	`2,500`
5%	`36,400`	`9,100`	`1,500`
10%	`17,200`	`4,300`	`700`
20%	`7,700`	`1,900`	`310`

A/B Test Calculator

Control

Variant

Statistical Analysis

What is A/B Testing?

Why This A/B Test Calculator

What Confidence and p-value Really Mean

The Math Behind It

Sample Size Requirements

Best Practices

Common Mistakes

Frequently Asked Questions