Screenshot of article by Brian Christian from Wired magazine, The A/B Test: Inside the Technology That’s Changing the Rules of Business from April 2012
In this class, we’ve been interested in the influence of industry on academia and vice versa. In particular, we started this semester discussing how the term “data science” is used colloquially as a term in tech companies and start-ups, and considered what functions and skills a “data scientist” tends to have to be successful in those jobs.
Oftentimes, the data scientist will use classical techniques or methods– for example, logistic regression, a statistical model with origins in mathematics and statistics dating back to the 19th century. Logistic regression, as you know, is used pervasively at modern-day tech companies such as Google, Facebook, Yahoo, you name it, to predict probabilities of events with binary outcomes such as ad clicks, friend adds, and so on. These companies obviously didn’t invent logistic regression, but rather found a new application for it. Further it’s not just tech companies that use it– we’ve discussed credit rating models before : default or not. And there are countless other examples.
Similarly, experiments, and in particular the design of experiments have their origins squarely back in the 18th century (at least). The goal of an experiment being you want to understand the impact of some intervention or treatment on some population (human beings, plants, animals, children, men over the age of 65…), and you want to isolate the effect of the treatment itself, and control for the rest of the variation in the population. As Ori Stitelman pointed out in lecture last week, the true “gold standard” would be if you could have the same exact human beings both in your treatment group and your control group. But you can’t simultaneously have both realities: give them a drug, not give them a drug. But why is this the gold standard? It’s because your ideal would be if the two groups of people are EXACTLY the same because you don’t know what’s going on in individuals’ bodies or lives that could also be influencing the outcome.
So instead we do the next best thing, randomized experiments, which people call the “gold standard” because it’s possible to achieve, as opposed to the simultaneous alternate realities that Ori described as requiring a time machine. So instead you take two random samples, one set of people will be treated, and the other set of people will be the control group. Here, we are making the assumption that these two groups may as well be the exact same groups, in that they will resemble each other in that the underlying distributions over their covariates (sex, age, weight, smoking/not,…) will be the same. But in practice, the two groups, won’t be exactly the same, there will be some variation. There are of course books and books and papers and papers written to address how to properly design experiments and derive estimators to do the best you can.
So again, experiments have been going on inside and outside of universities for a long time, with methods and techniques coming from biostatistics and statistics departments, but also outside of academia. (R.A. Fisher, for example, developed much of the theory of the Design of Experiments, still used today, by doing agricultural experiments.) Experiments have been done in engineering companies for a while, and now done inside companies such as Google, and m6d (where Ori works) and the industry term that is used is “A/B Testing“. In A/B Testing the goal is to understand the impact of some change in the website UI, underlying algorithm, or marketing campaign for example, on users or customers. The Wired magazine article above explains how in fact the Obama campaign used it. So again, here’s industry using classical methods, but in a new context. But something to point out is that it’s not simply that the context is novel, they are now doing it on a massive scale, and therefore we are faced with new methodological, design, algorithmic and estimation problems, as well as new infrastructure problems, and so new research possibilities open up, that are then solved to some extent inside industry. See for example this article by some of my colleagues at Google, Overlapping Experiment Infrastructure, More, Better, Faster, Experimentation, or from the Facebook Data Science team, Social Influence in Social Advertising: Evidence from Field Experiments.
Now you might think we’re all set: companies have control over their product, they can make changes to their product and they can randomly give some users one version and other users another, at virtually no cost– just flip a bit in the code. And this is often the case. But what complicates things is when it’s not feasible, ethical or practical to decide whether someone receives a treatment or not. Outside of the realm of tech companies, the clearest example is cigarette smoking. If you want to isolate the impact of cigarette smoking on the outcome, lung cancer, you can’t conduct an experiment where you randomly assign some people and force them to smoke, and others not. In the context of online behavior, while it’s not dangerous to “make someone” navigate to a particular website, as it is to “make someone” smoke, the type of person who might navigate to the Banana Republic website, for example, might be a certain type of person, and so if you randomly forced some people to go there, and randomly forced other people not to go there, you wouldn’t really be accurately measuring the effects or analyzing the right population.
So this is where causal modeling and observational studies come in. Now of course, there are entire semesters of material on this, and we only spent two classes on it, but hopefully you now have the intuition, and some of the statistics behind it. For homework, you are asked to read, Andrew Gelman’s paper, Experimental Reasoning in Social Science, linked to in his blog post.