Fundamentals of A/B & Multivariate Testing
So the title of this presentation is “Fundamentals of A/B and Multivariate Testing.” I chose that title because the real title is too long to fit in the marketing documents.
The actual title of this presentation is “Fractional Factorial Testing and Polynomial Regression Analysis with Orthogonal Arrays for Dummies.” Testing is fundamental to processes improvement and there are a lot of tools out there that can help you get the job done. Several of which are offered by Adobe, but it’s really important to understand how those tools work so that you can understand how to get the most value out of them.
In this presentation I’m not going to be pitching one particular tool, and not really even going to make mention of the tools that we offer. This presentation is just to help you understand the fundamentals of testing so that you can be a better and more successful analyst.
So AB testing is important because it allows us to constantly improve, to iteratively improve on our KPIs. Some of the things we can test are functionality, promotions, graphics, headline, layouts, etcetera. This is pretty basic, and frankly, if you understand this, it’s part of the reason why you’re here.
Just to talk about the problems of testing, the first thing we need to do is make observations. We’re going to take a doctored, semi-real world example. This is the B&H Photos website. Now, we look at this website and we take look at the data and we make a couple of observations. The first thing that we observe is that the More Deals section, which is down there at the bottom of sort of the nine products the bottom the page, convert really well and that’s really cool.
But we also notice that the More Deals section it represents a relatively low percentage of overall revenue, and that’s something that we want to improve. We want to leverage the conversion power of the section to drive our bottom line. Now we’re looking in our analytic report and we noticed that the average user to our page sees a certain amount of the of the actual site. This red box represents the visible area in the typical browser. As you can see, the More Deals section is actually below the fold, and so most people who come to our site actually don’t even see it. So we have a hypothesis. Our hypothesis is that if we move More Deals above the fold, we make more money.
Now I’ll be honest with you, this is a rookie mistake, right? One of the things that that early analysts would do is they would say, “Hey look, we we’ve got our hypothesis, we’ve got our idea, let’s go and make this change. Problem solved. Done.” I mean there are a lot of consultants in the early days at Omniture who would say, “Hey, pay us our money, we’ve now earned our value.” But the fact is we won’t know if we’ve actually added value until we run the test. So in this case we’re showing you we’ve taken the More Deals section, we moved it up, but now it’s important to actually look at our revenue, look at our conversions, and see if there’s an impact.
In this example, I’m actually going to pretend that we run the analysis, we’ve looked at our revenue over a twenty six week period, we looked at the last six weeks, and that sort of jump that you see there at week 20 is the period in which we did the test. This, in our case, is a time series test. We had one person on the site at a particular point in time, we changed it, and we see how the change impacts the bottom line. Now looking at the 26 weeks of revenue, we’ve got an average, and we’ve actually run a statistical analysis looking at the eightieth percentile, or 80 percent confidence interval, to determine what the high and low end of our revenue should be to be within a statistically significant range.
In the last six weeks there, we run an average and you can see that bright green line is above the dark green line, which represents our around our 80 percent confidence interval. So we can say that making this change was statistically significant and we actually added to the bottom line of improvement.
Now we’re going to talk about iterative efficiency because AB testing is pretty straightforward, but it can get sort of cumbersome. There’s a formula for understanding how many tests, how many iterations we have to do in order to understand if the change is going to be positive or not. The formula is the number of variations in the element that we’re going to be testing raised to the number of elements.
So if we have one element on a page that we want to test, and we’re going to have two different versions of that element, so two variations, one element, then we need to run two different versions of this page in order to determine the optimum configuration. If we have two elements on the page and we have two variations per element, then we need to run four different versions of this page in order to find the revenue maximizing, your KPI optimizing, configuration.
Now you can sort of see where this is going if we have three elements on the page, then we have to run eight tests. Four elements, 16 test. Five elements, have to run 32. Six, 64, 128, 256 512, 1,024, 2,048. So what this is saying is if we’ve got 11 elements on this page that we want to test with two variations each we have to literally create 2,048 different version of this page to find the optimum configuration. I don’t have to tell you that if you were to go to your IT department and tell them to create 2,048 different versions of your home page, they would have a cow. Not only that, but if you were to take your traffic in any given period divided by 2048, you’re going to have to run this test for a couple of years in order to reach statistical significance.
This is what I call the iterative efficiency problem, but it’s solved by using some advanced techniques that we’re going to show you. Because using, for example the Taguchi method, we can find the optimum combination of all these different variations of elements in only 12 tests. We’re going to show you how you do that. The way you do that is, of course, with something called Orthogonal Arrays.
The methodologies that we’re discussing right now were created, or were pioneered, by a gentleman by the name of Genichi Taguchi. This was an engineering operation scientist from Japan, living back in the 1940s and 50s. Brilliant man. And you should really read a story, but we don’t have time go into that right now. But we’re going to talk about how he impacted the industrial complex of Japan.
After World War II, the Americans took it upon ourselves to rebuild Japan. But there was a problem; Japanese components, electronics, machinery, etc., were very low quality. So a group of American scientists and analysts, working with local experts like Genichi Taguchi, decided to take it upon themselves to help the Japanese improve the quality of their manufactured products. This story is described in the book by Rangit Roy, A Primer on the Taguchi Method , which I recommend all of you go and check out from Amazon.
This is this is what they recommended. So they figure that taking American processes, manufacturing processes, and just tossing them into the Japanese industrial complex just kind of wasn’t going to cut it. In America, we optimized or manufacturing by doing inspections. So for example, you would have an optimal parameter for a particular manufacturer product and you’d have a tolerance range.
You say that this widget has to be five centimeters wide, but it could be anywhere from four centimeters to six centimeters, but it needs to optimally be five centimeters. And what would happen is we would manufacture the widget and then at the end of the production line, somebody would do a spot inspection. They would determine whether or not the widget fit within the tolerance. And if it didn’t fit within the tolerance, they would take it and throw it in the discard bin and move on.
So processes in American engineering were designed to minimize the number of discarded components. What that resulted in, is a range of sizes of the specifications for different products that looked like this. As you can see, there are very few components that are outside of the tolerance range. So we’re minimizing the number of discarded components, but the actual component itself could have a specification that was outside of optimal. But that was okay because we weren’t throwing things away, we weren’t discarding components.
What to Gucci and other experts on operational scientists in Japan figured out is that by utilizing advanced statistical techniques they could get more and more products to be developed around the optimal way. And that’s what they focused on. They said, “Let’s try to get these components so that they’re closer to optimal, not just inside tolerance.”
Now the long term impact of this can be seen in, for example, the Japanese auto industry. This is an example of the 1984 Honda Civic. I used to have one of these. Great car, I put 200,000 miles on it. Frankly, I only got rid of it in 93 when my mother forced me to. This is a 1984 Plymouth Horizon, also known as the Dodge Omni. Now most people don’t recognize this car. You don’t see very many of them on the road today. My dad had one. You frankly couldn’t get in the driver’s side door you had to wiggle your way through the passenger door because the driver’s side door handle didn’t work.
This is an example of the quality difference in the Japanese versus American auto industries up until the 1980s. The difference that you observed in the quality of these kinds of products in the manufacturing process is specifically due to the influence of Taguchi and his optimization methods for production. Making sure that product was as close as possible to the optimal specification, not just inside a tolerance range.
Now one of the key elements that Taguchi used to achieve this is something called an orthogonal array. I’m not going to go into a lot of description of what an orthogonal array is or means. My understanding is that most of the orthogonal arrays necessary for the kind of testing work that we do were developed by Franciscan monks in the 17th century, but just follow me.
If we’re going to if we’re going to take two elements and we want to understand the optimal arrangement of two elements that have two possible variations—let’s say there’s a test state and a controlled state—we understand not only what’s the better state of the two, but how they interact with each other. There are only four possibilities that we can observe. There’s control, control. Control, test. Test, control. And test, test.
So if we have a controlled state and test state and the two elements, there are only four different combinations of those products that we can observe. With an orthogonal array, it is an array that allows us to test the control state and the test state of our elements and test the interaction between any two elements. So in this case, the rows are the number of tests, the columns are the number of elements. So in this case, we have three different elements, say in an engineering process, that we want to test.
We have a control state for each and a test for each. You will observe that for every pair of these three elements we will test the control-control, control-test, test-control, and test-test state of those two elements.
We’ll do that again for another pair of elements. You’ll see again we control-control, control-test, test-control, and test-test. So every single pair of elements in here is being tested in all four variations. So that’s important to understand. What’s also important to understand is how we arrive at our test the efficiency with this. Because we’re not testing every combination of every three elements on a page or every four elements on the page. We’re only testing every combination of every two elements one the page.
So this is your orthogonal array broken out and you can see these are the variations we’re not testing. We’re not testing control-control-test or test-control-control because it’s not necessary. We know all of the interactions between every pair of elements and we know how every single element behaves.
The next thing that’s important understand, this is a larger orthogonal array with seven elements, but every single element on this page is tested in its controlled state and its test state half the time and the same number of times.
So here we’re going to go ahead and look and you’ll see we’ve got the control-control, control-test, test-control, and test-test testing twice. So every single state is tested the same number of times. If we have another pair of columns, we’ll see exactly the same thing, control-control, control-test, test-control, and test-test.
So with the orthogonal array every single element is tested in his test state and control state half the time, and every single possibility of interaction between two elements is tested the same number of times. So if we were to arrange a test—and fortunately, all of the orthogonal arrays you’ll ever need of already been tested—you don’t have to mess around with it or try to arrive at your own.
Now if we run a test using the Taguchi method, and we determine we’ve optimized the configuration of the different elements on the page. The next thing you want to know is was this statistically significant? Did we get a larger sample size? And you can use tools for that.
So this is an example of a tool that I’ve created that you can use and just say we want to look at, with 90 percent confidence, in this case we have a sample proportion of 30 percent and then our test proportion is 28. So let’s say our conversion rate is 28. We run this test. We think it’s up to 30 we want to know if that’s real. Can we trust that? Well how many conversions do we have to look at in order to determine if this is statistically significant?
This is just a tool, and I’ll give you the link later on that will help you determine that. Here’s the link here: acu.men/stats/ is a tool that you can use to determine the sample size that you need to arrive at in order to figure out the results you’re seeing are statistically significant. And that link up above there, neilsloane.com, is the list of all the orthogonally arrays you would need to use to design your test.
Not all of the variations that you’re going to want to observe or test for can be controlled. We need to also understand how do you test for changes or elements or variables that you can’t necessarily just go in and tweak on your own? That requires something called polynomial regression. But let’s back up a minute and let’s first review what linear regression means.
This is a chart showing the stock price of Adobe’s stock over the last 6, 12 months or something like that. So we have all the different stock prices in there and we’ve run a line through the middle of it. That line is a regression line. That is the best fit line through all of these downs.
Now there’s one thing that’s very important to note here it’s called the R-Squared, and it’s in bold here. The R-Squared is .9441. What that means is that ninety four point four one percent of the variance in this series of data can be explained by that line, by the trend implied by that line. So 94 percent is pretty darn good. That means we just apply the line and you know 94 percent of the time we’re explaining the variation. That leaves about six percent or five and a half percent of the variation is just error something that we don’t understand.
This is an example of a .31 R-Square distribution. So you can see there are a whole bunch of values out there and there will be a regression line through it, the best line through it. But that line only explains 31 percent of the variation and you can just sort of look at it here and observe, yeah, a higher R-Squared is better. One is perfect, but you’ll never get to one. So this is linear regression. Now the cool thing about this is you’re creating a model to figure out to understand one particular variable and how it impacts your KPI.
But you can run this in multiple dimensions. This is an example of a regression with two independent variables in three dimension. So you know kind of like a line, except line is extended out into the second dimension and you’ve got sort of a plain here. What’s really exciting is that you don’t have to stop in three dimensions. I mean you can only visualize three dimensions because we’re human, but you can go into the 4,5 6, 18, 20th, or 1024th dimension if you really want to. That’s called multiple regression.
Once you break down the multiple regression model into a formula that is kind of what it’s going to look like. The Y over there is the KPI, the number that you’re trying to measure. Let’s say it’s conversions or revenue or what have you. Then each variable, you can see one tier circling in green, each variable have a cohesive shift next to it.
So let’s say you’re trying to calculate probability of voting, and you have age, or income, as variables. X would be either age or income. X-1 would be one variable X-2 would be the other. The coefficients are the numbers that make that variable relevant or explain how that variable applies to your why, your probability of voting or your revenue or your conversion. So this is what the linear regression the polynomial regression formula would end up looking like.
You can do this yourself in Excel. Frankly, there’s a little thing called the Analysis Tool Pack that you can install in Excel and it creates a little data analysis button inside of your data tab. So you’ve got a whole bunch of data—in this example we’re looking at column B in this Excel spreadsheet is my Y variable that the probability of voting, and then column C through J are a binary determinant of whether you voted in one of the previous eight elections, age is column K.
You can go into Excel, tell it my dependent variable is Column B. My independent variables are C through K. Come up with a formula and just explain to me, create a model to predict this. The end result will end up looking like so.
Now you can see here we’ve got the R-squared. This is an actual model by the way. This is forecasting voter turnout probability in the 2018 general election based on the previous eight elections from over the four years. From 2014 until this coming November.
The R-squared here is 43 percent. That means the model that Excel created is exporting 43 percent of the variation in voter turnout probability. And then you can see over here we’ve got something called the T-stat. The T-stat means, that line that I showed you in the linear regression line, all those values are bouncing around the line. And basically how far they bounced away from a line is calculated and brought together into a T-stat. The higher the T-stat here, the lower the probability that this is chance. And the P value to the right of it tells you the actual probability. We’re looking for standard deviations or at least that above two. Over two is like there’s like a 95, 99 percent chance that this is not just chance.
So we got values here three five six. You can see 23, 22. What this is telling me is that number eight there is the last election. The number seven there is the last primary election. Number five is the general election before that. It’s telling me that those three elections are very indicative of whether someone’s going to vote in the next election. The other elections are still statistically significant, but not so much. But altogether this model explains 43 percent of the variation in voter turnout probability.
Now if you apply this to web analytics, what this allows you to do is say, all right, let’s just throw a whole bunch of variables into the mix. Let’s throw in IP address, country, time zone, operating system. Let’s take data from our CRM, from our call center, from our ticketing system. Let’s take time of day, recency. Let’s take our affiliate ID, campaign IDs, all sorts of various information, put them together in sort of a digital fingerprint and let’s try to come up with a model that forecasts, that predicts the likelihood that someone is going to convert. That someone is actually going to purchase our product. Or maybe on the negative side, that someone is going to trip or that someone’s going to reach out to our call center. So using multiple regression, polynomial regression, allows us to look at factors that we can’t control and use them to forecast behavior.
Now this is a lot of work and most people don’t want to have to play around orthogonal arrays and multiple regression in Excel by hand. Unfortunately, this is the plug portion. Adobe does have tools that help you do this. Adobe Target is our tool for both A/B and multivariate testing. And it also allows you to do forecasting and pick up-sell, cross-sell merchandising campaigns and push offers to individuals in a way that we think will maximize conversion.
So taking into account as many numerous factors environmental, psychographic, behavioral factors as possible and using that to say, if you show this individual this offer, we think this is going to maximize their response. And in finding that optimal R-Squared, that formula that explains the most of their propensity to buy or convert automatically. That’s awesome and it’s magical and that’s what Adobe offers.
But at least now you sort of have a glimpse into what’s going on behind the scenes and hopefully you’ll understand a little bit better how these tools work, so that you can make more informed decisions about how to take advantage of them.