Calculate statistical significance of your experiments
A
Control
0.00%
Conversion Rate
B
Variant
0.00%
Conversion Rate
📊
Statistical Analysis
Results at selected confidence level
0%
Relative Uplift
0%
Confidence
—
p-value
—
Z-score
Statistical Significance0%
0%50%95%100%
📋
Enter Your Data
Fill in visitors and conversions for both variants to calculate statistical significance.
Confidence Level:
This A/B test calculator evaluates whether the difference between two variants is genuine or attributable to sampling noise. Enter visitor counts and conversions for each cell, and the tool returns relative uplift, confidence, p-value, and Z-score. As a free statistical significance calculator, it covers the standard practitioner workflow — proportion comparison, hypothesis testing at 90/95/99% thresholds, and a two-tailed verdict — without requiring registration. Whether you call it an ab testing calculator, split test calculator, or a/b test calculator, the underlying math is identical: a two-proportion Z-test against the null hypothesis of equal conversion rates. Computation runs entirely in your browser. No data leaves your device.
🧪 What is A/B Testing?
A/B testing (also called split testing) is a method of comparing two versions of a webpage, email, or other marketing asset to determine which one performs better. You split your audience randomly between version A (control) and version B (variant), then measure which version achieves more conversions.
The key challenge is determining whether the difference in performance is statistically significant or just due to random chance. This calculator uses a two-proportion z-test to determine if your results are reliable enough to make a decision. For methodology, hypothesis design, and program-level practices, our A/B testing pillar guide covers the full lifecycle from idea backlog to post-test analysis.
🎯 Why This A/B Test Calculator
Several free split test calculators circulate online, but their feature sets diverge meaningfully. Some require account registration, others offer only a single fixed confidence level, and a few transmit your data to third-party servers. This ab test calculator was built around four design constraints: methodological rigor, configurability, privacy, and zero friction.
First, the tool computes conversion rates in real time as you type, then runs a two-proportion Z-test using the pooled standard error. Second, it surfaces the full output set — relative uplift, confidence, p-value, and Z-score — rather than abbreviating to a single “winner” verdict. Third, it lets you toggle between 90%, 95%, and 99% confidence thresholds, which matters because the appropriate alpha depends on the cost of a false positive in your context. Fourth, all calculations execute client-side in JavaScript; no telemetry, no signup, no vendor account.
Compared to commonly cited alternatives, the trade-offs look like this:
Tool
Outputs
Confidence Levels
Free
Privacy
This calculator
Uplift, confidence, p-value, Z-score
90 / 95 / 99%
Yes
Browser-only
Evan Miller’s calculator
Significance verdict, p-value
Fixed (one threshold)
Yes
Browser-only
AB Tasty calculator
Confidence, uplift
Fixed
Signup required
Server-side
VWO calculator
Confidence, uplift
Limited
Vendor lock-in
Server-side
Optimizely
Full Stats Engine
Configurable
Paid platform
Server-side
For quick decisions and exploratory analysis, a free statistical significance calculator is sufficient. For continuous experimentation programs across many concurrent tests, dedicated platforms add value through traffic allocation, sequential testing, and audit trails. The two are complements, not substitutes. If you need to size a study before launch, see our companion sample size calculator.
📊 What Confidence and p-value Really Mean
Confidence and p-value are commonly conflated, yet they carry distinct definitions in frequentist inference. The p-value is the probability, assuming the null hypothesis is true, of observing a test statistic at least as extreme as the one computed from your data. It is not the probability that the null hypothesis is true, nor the probability that your variant “really” wins. This distinction is emphasized in standard references such as Wasserman’s All of Statistics and Casella & Berger’s Statistical Inference.
Confidence, in this calculator, equals 1 minus the p-value, expressed as a percentage. A 95% confidence threshold corresponds to alpha = 0.05, the conventional Type I error rate adopted from R. A. Fisher’s early work. Reaching 95% confidence does not mean the observed uplift will replicate 95% of the time; it means that, if the variants were truly identical, results this extreme would occur in only 5% of repeated experiments.
Two implications follow. First, statistical significance is binary against an arbitrary threshold, but the underlying evidence is continuous — a p-value of 0.049 and 0.051 are practically identical. Second, significance does not imply practical relevance. A trivially small uplift can reach significance with a large enough sample. Therefore, always pair the p-value with the effect size and a confidence interval around the difference.
📐 The Math Behind It
Z-Score Formula
Z = (p₁ – p₂) ÷ √[p(1-p)(1/n₁ + 1/n₂)]
Where p₁ and p₂ are conversion rates, n₁ and n₂ are sample sizes, and p is the pooled proportion.
The p-value represents the probability that the observed difference occurred by chance. A p-value below your threshold (e.g., 0.05 for 95% confidence) indicates statistical significance.
📏 Sample Size Requirements
The sample size you need depends on your baseline conversion rate and the minimum detectable effect (MDE) you want to measure.
Baseline Rate
10% MDE
20% MDE
50% MDE
1%
190,000
48,000
7,700
3%
62,000
15,500
2,500
5%
36,400
9,100
1,500
10%
17,200
4,300
700
20%
7,700
1,900
310
*Per variant, at 95% confidence and 80% statistical power. MDE = Minimum Detectable Effect (relative change).
💡 Best Practices
✓ Determine Sample Size First
Calculate required sample size before starting. Running until you see significance leads to false positives.
✓ Run Full Business Cycles
Include weekdays and weekends. Behavior varies by day, so run tests for at least 1-2 full weeks.
✓ Test One Variable
Change only one element at a time. Testing multiple changes makes it impossible to know what worked.
✓ Document Everything
Record hypothesis, metrics, dates, and segments. Build organizational knowledge from every test.
Once a winning variant is identified, quantify its commercial impact with our marketing ROI calculator to translate uplift into incremental revenue.
⚠️ Common Mistakes
✗ Peeking at Results
Checking results daily and stopping when significant inflates false positive rate from 5% to 30%+.
✗ Underpowered Tests
Running with too few visitors means you’ll miss real effects and may wrongly conclude “no difference.”
✗ Testing Too Many Variants
Each variant needs traffic. With 5 variants, you need 5x the sample size for the same statistical power.
✗ Ignoring Segments
Overall results may hide that variant B wins for mobile but loses for desktop. Check key segments.
❓ Frequently Asked Questions
95% is the industry standard and works for most cases. Use 99% for high-stakes decisions (pricing, major redesigns) where false positives are costly. 90% is acceptable for low-risk tests or when you need faster results and can accept more uncertainty.
Three common reasons: (1) Not enough traffic — you need more visitors to detect small differences, (2) Effect is too small — the actual difference between variants may be negligible, (3) High variance — your conversion rate fluctuates a lot. Calculate required sample size and run until you reach it.
No. Early stopping inflates your false positive rate dramatically. If you’ve calculated 10,000 visitors needed, run until 10,000 — even if results look significant at 3,000. The exception is if you’re using sequential testing methods designed for early stopping.
Confidence (1 minus alpha) is the probability of NOT declaring a winner when there isn’t one (avoiding false positives). Power (1 minus beta) is the probability of detecting a real effect when it exists (avoiding false negatives). Standard values: 95% confidence, 80% power.
Statistical significance doesn’t equal practical significance. A 0.1% conversion lift may be statistically significant with enough data, but not worth implementation effort. Consider: implementation cost, maintenance burden, and whether the lift is meaningful for your business.
Define one primary metric before the test starts — that’s what determines success. Secondary metrics provide context but shouldn’t change your decision. Testing multiple metrics increases false positive risk; apply Bonferroni correction if you must test several primary metrics.
Statistical significance indicates that the observed difference between variants is unlikely to have arisen from random sampling variation alone. Operationally, it means the p-value falls below your pre-specified alpha threshold (typically 0.05 for a 95% confidence level). A significance calculator like this one quantifies that probability so you can decide whether to act on the result. Significance does not measure the magnitude or business value of the effect — only the strength of evidence against the null hypothesis of equal conversion rates.
Yes. This is a fully free ab testing calculator with no signup, no email gate, no rate limit, and no vendor account. All computation runs in your browser; no test data is transmitted to any server. You can use it for unlimited experiments, including production decisions. If you compare it to other free options like Evan Miller’s tool, the main differences are configurable confidence levels (90/95/99%) and the inclusion of Z-score in the output.
They are complementary representations of the same evidence. The p-value is the probability of seeing your result (or more extreme) if the variants were truly identical. Confidence, as used in this tool, equals (1 − p-value) × 100%. A p-value of 0.03 corresponds to 97% confidence. Lower p-values mean stronger evidence against equality. Neither value is the probability that your variant “really” wins — that requires Bayesian inference and a prior. For frequentist A/B testing, treat them as evidence strength against the null, not as probabilities of being right.
A/B test and split test are synonymous: two variants compared on one variable. An A/B/n test extends the same logic to three or more variants of one element. A multivariate test (MVT) varies multiple elements simultaneously — for example, headline × image × button color — to estimate main effects and interactions. MVTs require substantially larger sample sizes because each cell in the factorial design needs adequate power. This calculator handles two-variant comparisons; for multivariate experiments you need a multivariate test calculator with factorial design support, or a dedicated experimentation platform.
Yes — sample size planning is non-negotiable for valid inference. Without a pre-specified target, you risk peeking and inflated false positive rates. The required sample depends on your baseline conversion rate, the minimum detectable effect (MDE) you want to catch, and your chosen alpha and power levels. Use our sample size calculator to determine visitors per variant before launching, then run until that target is reached. The reference table above gives ballpark figures at 95% confidence and 80% power.
“Not significant” means the observed difference could plausibly have occurred under the null hypothesis of equal conversion rates. Three common causes: (1) the test is underpowered — not enough visitors to detect the true effect; (2) the actual effect is smaller than your design assumed; (3) variance is high due to seasonality, traffic-source mix, or noisy conversion definitions. Verify you’ve reached your pre-calculated sample size. If you have and results remain non-significant, treat the variants as practically equivalent and prioritize larger-effect hypotheses.
No. This a/b test calculator applies a two-proportion Z-test, which assumes binary conversion outcomes (converted / did not convert). For continuous metrics like average order value, revenue per visitor, or session duration, use a t-test or Mann-Whitney U test instead. For count data (e.g., page views per session), Poisson or negative-binomial regression is more appropriate. Mismatching test type to outcome type produces unreliable p-values regardless of sample size.