Is bayesian approach for A/B testing overrated
In this post, I will discuss some statements I have been hearing on why the Bayesian approach for analysing an A/B test is superior to the frequentist methods. A/B testing offers evidence that enables the business to make data-driven decisions. However, improper execution of the A/B testing analysis adversely leads to the wrong conclusions, defeating the purpose of conducting A/B tests.
No Sample Size Planning Required
The first statement arises that Bayesian A/B tests require no sample size planning ahead of the tests, which on contrary frequentist hypothesis testing do. While on the surface, the statement seems correct that there is no such procedural requirement. However, such a prerequisite by the frequentist methods controls type I and type II errors that the Bayesian approaches do not. By strictly following hypothesis tests procedures — only check the result after the pre-defined sample size has reached, choosing the variant by consulting the p-value, the result then will be bounded by the pre-defined type I and II error rates. Quantify and control the incorrect discovery rates achieves fewer mistakes in the long run for teams and businesses.
The Bayesian methods make no claims to control the type I / II error rate. It infers the probability of a hypothesis using a prior belief and updates based on observed data. The Bayesian interpretation remains correct regardless of the chosen decision rule. However, having correct Bayesian inference is different from having the right decision (choosing the right variant).
Using a laymen’s explanation, the result of a Bayesian analysis would be like the weather report. For example, tomorrow has an 80% chance of rain. But then it turns out that the second day is sunny, would you say the weather report is incorrect? No, it’s still correct because there is still a 20% chance of not raining. Similarly, what Bayesian interpretation gives.
The Bayesian analysis relies on the use of a stopping/decision rule to choose the variant. The choice of the rule, in some situations, would considerably increase the chance of leading the incorrect decisions. (Sanborn, 2013) Among four popular stopping rules used, NHST, BF, HDI with rope, precision, none of them claims to control the type I / II error rate. Frequentist methods, however, has the straightforward decision rule (reject null hypothesis if P-value > pre-defined alpha), simple yet powerful, and it controls the type I/II error rate.
Using the weather report example again to illustrate the choice of the decision rule. One decides to carry an umbrella if the chance of rain is greater than 50% (the decision rule by this individual). The other brings the umbrella only when the probability of rain is greater than 80%. (Already being inconsistent from one individual to another) One would then make an incorrect decision by bringing the umbrella while it’s sunny. Yet this still will not invalidate the accuracy of the Bayesian model. The model inference is correct, but the decision may not.
Stopping an experiment earlier
The second overselling on Bayesian methods for A/B testing is that Bayesian methods allow early experiment stopping. This illusionary benefit stems from that the frequentist method suffers from a wild increase on false positive rate while “peeking” on the result prematurely. The adverse effects increase are simply because the incorrect analysis procedure is adopted. When doing the frequentists hypothesis test, the expected false positive rate is only bound when the required sample size has reached. (Plus a few of other cautions!) If the correct procedures were followed, the result will be sound as it promises.
While Bayesian models give the always valid inference, it does not guarantee the correctness of the result. The result perhaps is of the experimenter’s interest, not the state of the Bayesian model. The benefit of being capable of stopping an experiment earlier is overstated.
The last claim I’ll discuss for using Bayesian analysis is that Bayesian statistics give the right answers to business questions, while frequentists struggle with the explanation in terms of the null hypothesis. This claim has a relative blurry focus, as different job roles in a business often expect different levels of detail to the answer. For instance, product managers (semi-technical roles) expect to understand the impacts of each variation in an A/B test made on the product. Frequentists analysis is too well equipped for the task that it computes the confidence interval. The confidence interval does reveal the impacts within an upper and a lower bound. Bayesian methods compute the credible interval (eg.. HPDI), which are analogous to the confidence interval from frequentist statistics. Bayesian intervals treat their bounds as fixed (all possible values on likelihood) and estimate the impacts as probabilities. Both approaches can be charted, presented, and be understood. I would also suggest that the CI is even easier to understand for a business question than Bayesian’s HPDI.
Some might be interested to understand the chance of treatment is better than the control variant, which Bayesian analysis directly approaches. This, however, is to the interpretation of the underlying Bayesian model based on the choosing stopping rule, and the chosen priors. There is little alignment on the actual impacts made to the business on each variant. Plus, one needs to understand the unintuitive terms such as priors and the meaning of Bayes factors before able to make the judgements. Georgi (2017) suggested that the question it solves translates to:
“Given that I have some prior knowledge or belief about my variant and control, and that I have observed data X0, following a given procedure, how should data X0 change my knowledge or belief about my variant and control?”.
Above is a tough question on the level of statistical understanding. No big difference to the frequentists’ alternative. See reference here.
This is not to give a blanket criticism of Bayesian methods. There are Bayesian approaches that mitigate some of the above problems, the same also holds for frequentist approaches. There are perhaps cases where a team decide to go Bayesian beyond the reasons outside this analysis. Both sides of Bayesian methods should be explored for those use cases, better with the help of simulations. Applying any of the analysis without full understanding can be dangerous to the business which it wastes resources and set the team to the wrong course.
- Sanborn, Adam & Hills, Thomas. (2013). The frequentist implications of optional stopping on Bayesian hypothesis tests. Psychonomic bulletin & review. 21. 10.3758/s13423-013-0518-9.
- 5 Reasons to Go Bayesian in AB Testing – Debunked
- A/B Testing Rigorously (without losing your job)
- Optional stopping in data collection: p values, Bayes factors, credible intervals, precision