You're probably AB testing wrong.

4 replies
Hey Guys!

Long time lurker figured I'd start contributing. I've seen a fair bit of stuff crop up lately about AB testing and it kinda annoys me every time I see people screwing it up.

So let's have a chat about bayesian statistics, multi armed bandits and ab testing.

If you understand this joke. You can probably skip this or call me out on my mistakes.



Source: xkcd: Frequentists vs. Bayesians

The ELI5 of this is that frequentist statistics only looks at the data in the experiment. Bayesian statistics can be informed by prior knowledge.

What else does Bayesian Statistics offer?
Firstly, you need to understand that you're answering fundamentally a different question.
With a frequentist approach you generally look at answering a null hypothesis. The null hypothesis is that there is no difference in the conversion rate of variation A and variation B. Now to answer this you look at a P value which is effectively a measure of what the chance of the result you've observed having occurred by random chance. If the p-value is less then < 0.05 you typically say that you disprove the null hypothesis and there is a winning variation.

With bayesian statistics you generally phrase questions along the lines of "What's the probability that variation A is better" or "What's the probability that variation A is the best or within x% of variation B". These are far more valuable when making business decisions - you can start to look at the opportunity cost versus probability of a negative outcome. This allows you to start valuing the speed of iterating quickly versus waiting for absolute certainty. You do have to worry about priors but you can simply set them to a uniform distribution of a previous week or so of conversion data. This helps in the early stages of the AB test when you're waiting for significant data to come in.


Now let's go through the two main ways of AB testing.
There's multi-armed bandits and what I call vanilla splits

In a vanilla split you take all the traffic going to each variation and you split it at a set ratio. You then look at the results using bayesian statistics and decide who the winner was.

In a multi armed bandit (GOOGLE DOES THIS BY DEFAULT) you start out with a split and then the split varies depending on what variation is winning. This aims to send more traffic to the likely 'winner' and reach significance quicker. The one issue is that the premises underlying this are pretty munted when it comes to real world. Anyone who's played the internet marketing game seriously knows that traffic quality varies significantly. Some is good some is bad.

Let's take a look at what happens when you have a multi armed bandit with varying traffic quality. Think of a set as a day of traffic.



Say the true conversion of Variant A is 0.1 and Variant B is 0.2

For the ease of explanation we'll do the allocation at one discrete interval and it'll be proportional to the conversion at the time. Of course the allocation algorithm could be different, the above problem still exists for all the given allocation algo's I'm aware of.

Set 1: 2k traffic comes in, it converts as expected.
Set 2: 2k traffic comes in, we allocate it according to previous conversions. It converts as expected.
Set 3: A spammer had a field day we get 20k of traffic but it converts terribly. However Variant B has a higher proportion of shit traffic to good traffic then variant A.

When we look at the results we see the conversion rate of B overall is now worse then the conversion rate of A, even though we know that the true conversion rate of B is double that of A.

Apply some bayesian statistics to the above and you'll see that Variation A is the winner with 100% confidence.

This is a simple example of it with a very obvious anomaly which arguably we could detect and potentially account for.

The question I pose to all my fellow warriors is can we confidently account for more subtle variations and is it worth it for reaching significance a little bit quicker? If you don't think we can everyone needs to quit using multi armed bandits in google and wherever else they're running tests .

Hope you enjoyed this guys hit me up with any questions and I'll try and get around to answering them when I find time.
#wrong
  • Profile picture of the author ANDREIS
    I think you're complicating a bit too much. Every offer and landing page have their own conversion rate depending on what the offer and what the traffic is. Conversion rate can only be established by testing and then testing some more.
    {{ DiscussionBoard.errors[10647175].message }}
  • Profile picture of the author smiket
    That is why developing a solid measurement plan is crucial. Segmentation will help you spot the 'shit traffic', be it by traffic source, keywords, age/gender, referral, interests, etc and once you do that you can compare everything apples to apples and not use some campaign wide averages to base your decisions on. As I have been repeatedly told, you don't just look at your data and try to derive meaningful insights from it, you have to ask a question first, such as "how can I reduce wasted spend" or "what segment converts best/worst and why" and then dig into your data for the answer. That, combined with your in-depth knowledge of statistics should make you a feared opponent when it comes to online marketing.
    {{ DiscussionBoard.errors[10647469].message }}
  • Profile picture of the author ozki
    My experience with A/B Testing is that it often leaves real game changing modifications out.
    Shocking? Not really.
    The results tend to be MEDIOCRE incremental 'improvements' rather than stuff worth writing home about.
    Personally, what works is a truly elemental approach that uses a systematic and methodical way of 'evolving' on page and off page signals to maximize conversions.
    Surprisingly, this isn't expensive. In fact, it's free if you have the time.
    Signature
    Custom-written Kindle Books
    Only $57
    Click Here for Details
    {{ DiscussionBoard.errors[10648103].message }}
    • Profile picture of the author oompaloompa
      Originally Posted by ozki View Post

      My experience with A/B Testing is that it often leaves real game changing modifications out.
      Shocking? Not really.
      The results tend to be MEDIOCRE incremental 'improvements' rather than stuff worth writing home about.
      Personally, what works is a truly elemental approach that uses a systematic and methodical way of 'evolving' on page and off page signals to maximize conversions.
      Surprisingly, this isn't expensive. In fact, it's free if you have the time.
      Nothing stops you from running an AB test on a radically different page. You sacrifice some learning about the specifics of a hypothesis, you can go for a new global maxima which I think is what you refer to a game changes. In fact I think this is quite often a better approach especially in the beginning. Try radically different pages then once you've done a bunch you'll start to get a feel for what works and be able to construct more and more informed tests which is where the incremental improvements come into there own. that 1% to 2% increase at the start of your funnel just doubled the amount of people entering it. etc...
      {{ DiscussionBoard.errors[10648192].message }}

Trending Topics