Experiments, A/B Testing and Causal Modeling

Screenshot of article by Brian Christian from Wired magazine, The A/B Test: Inside the Technology That’s Changing the Rules of Business from April 2012

Dear Students,

I want to address explicitly why Causal Modeling and Experiments are part of this course. The last two lectures have addressed observational studies and causal modeling and a bit on experiments.

In this class, we’ve been interested in the influence of industry on academia and vice versa. In particular, we started this semester discussing how the term “data science” is used colloquially as a term in tech companies and start-ups, and considered what functions and skills a “data scientist” tends to have to be successful in those jobs.

Oftentimes, the data scientist will use classical techniques or methods– for example, logistic regression, a statistical model with origins in mathematics and statistics dating back to the 19th century.  Logistic regression, as you know, is used pervasively at  modern-day tech companies such as Google, Facebook, Yahoo, you name it, to predict probabilities of events with binary outcomes such as ad clicks, friend adds, and so on. These companies obviously didn’t invent logistic regression, but rather found a new application for it. Further it’s not just tech companies that use it– we’ve discussed credit rating models before : default or not. And there are countless other examples.

Similarly, experiments, and in particular the design of experiments have their origins squarely back in the 18th century (at least). The goal of an experiment being you want to understand the impact of some intervention or treatment on some population (human beings, plants, animals, children, men over the age of 65…), and you want to isolate the effect of the treatment itself, and control for the rest of the variation in the population. As Ori Stitelman pointed out in lecture last week, the true “gold standard” would be if you could have the same exact human beings both in your treatment group and your control group. But you can’t simultaneously have both realities: give them a drug, not give them a drug. But why is this the gold standard? It’s because your ideal would be if the two groups of people are EXACTLY the same because you don’t know what’s going on in individuals’ bodies or lives that could also be influencing the outcome.

So instead we do the next best thing, randomized experiments, which people call the “gold standard” because it’s possible to achieve, as opposed to the simultaneous alternate realities that Ori described as requiring a time machine. So instead you take two random samples, one set of people will be treated, and the other set of people will be the control group. Here, we are making the assumption that these two groups may as well be the exact same groups, in that they will resemble each other in that the underlying distributions over their covariates (sex, age, weight, smoking/not,…) will be the same. But in practice, the two groups, won’t be exactly the same, there will be some variation. There are of course books and books and papers and papers written to address how to properly design experiments and derive estimators to do the best you can.

So again, experiments have been going on inside and outside of universities for a long time, with methods and techniques coming from biostatistics and statistics departments, but also outside of academia. (R.A. Fisher, for example, developed much of the theory of the Design of Experiments, still used today, by doing agricultural experiments.) Experiments have been done in engineering companies for a while, and now done inside companies such as Google, and m6d (where Ori works) and the industry term that is used is “A/B Testing“. In A/B Testing the goal is to understand the impact of some change in the website UI, underlying algorithm, or marketing campaign for example, on users or customers. The Wired magazine article above explains how in fact the Obama campaign used it. So again, here’s industry using classical methods, but in a new context. But something to point out is that it’s not simply that the context is novel, they are now doing it on a massive scale, and therefore we are faced with new methodological, design, algorithmic and estimation problems,  as well as new infrastructure problems, and so new research possibilities open up, that are then solved to some extent inside industry. See for example this article by some of my colleagues at Google, Overlapping Experiment Infrastructure, More, Better, Faster, Experimentation, or from the Facebook Data Science team, Social Influence in Social Advertising: Evidence from Field Experiments.

Now you might think we’re all set: companies have control over their product, they can make changes to their product and they can randomly give some users one version and other users another, at virtually no cost– just flip a bit in the code. And this is often the case. But what complicates things is when it’s not feasible, ethical or practical to decide whether someone receives a treatment or not. Outside of the realm of tech companies, the clearest example is cigarette smoking. If you want to isolate the impact of cigarette smoking on the outcome, lung cancer, you can’t conduct an experiment where you randomly assign some people and force them to smoke, and others not. In the context of online behavior, while it’s not dangerous to “make someone” navigate to a particular website, as it is to “make someone” smoke, the type of person who might navigate to the Banana Republic website, for example, might be a certain type of person, and so if you randomly forced some people to go there, and randomly forced other people not to go there, you wouldn’t really be accurately measuring the effects or analyzing the right population.

So this is where causal modeling and observational studies come in. Now of course, there are entire semesters of material on this, and we only spent two classes on it, but hopefully you now have the intuition, and some of the statistics behind it. For homework, you are asked to read, Andrew Gelman’s paper, Experimental Reasoning in Social Science, linked to in his blog post.

Yours, Rachel



  1. You know what story I’d like to write Rachel? Why are so many of the field’s brightest minds so bloody focused on getting us to click on ads for hotels?

    1. Go for it, Jed! Jeff Hammerbacher, formerly of the Facebook Data Science Team, now at Cloudera, expressed similar sentiments: ‘The best minds of my generation are thinking about how to make people click ads… That sucks.’

    2. Locke and Demosthenes, qua advocatii diaboli · · Reply

      What other domain has the detailed data and the ability to continuously monitor effectiveness? For once, I’m not being abstruse, I really don’t know.

      1. Glad you asked. Constant monitoring, under the name “Statistical Quality Control” or “Statistical Process Control” happened in manufacturing during World War II and at companies such as Western Electric. The father of statistical quality control is said to be Walter Shewhart. In this wikipedia article, you can see mention of the use of experiments. Shewhart was at Western Electric, which eventually became Bell Labs.

        We keep coming back to Bell Labs. Or maybe I force us to! John Tukey, father of Exploratory Data Analysis was there. Mark Hansen, our guest speaker, Director of the Brown Institute for Media Innovation and visualization expert extraordinaire, was member of technical staff there. My current manager at Google, Daryl Pregibon, was there. Andrew Gelman, professor in the Statistics department and my advisor, interned there one summer when he was a student. David Madgian, Chair of the Statistics Department and one of our guest speakers, was there. Bill Cleveland who wrote the proposal for a new field called “Data Science” ten years ago was there. We need to have this sense of and appreciation for history.

    3. James McNiece · · Reply

      It could be worse. They could be working on nuclear weapons or securities of ever-increasing complexity like in decades past. The arms race in advertising technology seems rather benign in comparison.

    4. I think it comes back to something Claudia touched upon when she was going over her data scientist profile in class the other day. In advertising, there is less to lose if a model underperforms, doesn’t generalize well, or is completely wrong. Statistics is basically the art of calculating what and how much you can learn and how confident you can be about your findings.

      Generally when companies release new versions of software, the amount of testing/QA it goes through is normally determined by risk assessments. A/B Testing is successful and feasible at someplace like Google, but probably not in airplane software where it’s not unheard of for the software QA to involve more time/talent/money than the development of the software did.

      I would argue that this is the same. Advertising doesn’t normally result in lower sales. (And if it does, I would imagine it is probably an issue with the product or creative, not the execution.) It is basically a game where companies try to spend their money as efficiently as possible (highest ROI), and the ROI is not going to be negative. Most likely no one will die and the economy will not fall apart. As such, there’s an opportunity to put more focus on the sensitivity (noticing an effect when there is one) than the sensitivity (not noticing an effect when there is not one), which is not necessarily the case elsewhere.

      So even if the field’s brightest minds were evenly distributed across fields, they might have more freedom to innovate and do things that are noticed in a field where there is less risk, in the same way the flight simulator game on your phone can release more complicated updates far more quickly than the company that makes the software used to fly the plane you are playing the game on can.

    5. Fair enough. While I somewhat agree with you, Jed, let’s consider that some great discoveries were developed from seemingly-unsubstantive applications. Regarding about experiments, I just read a passage on how Fisher started thinking about experimental designs (the passage is in the book “The Lady Tasting Tea”). In the late 1920’s, a lady attending a group of university dons insisted that the tea tasted different depending upon whether the tea was poured into the milk or viceversa. The passage’s author writes: “I can just hear some of my readers dismissing this effort […] These great minds should have been putting their immense brain power to something that would be benefit mankind”. But gracefully, Fisher attended that group, and then devoted a chapter of his “The design of experiments” to discussing an experimental designing in order to test the lady’s hypothesis.
      Maybe we should not take for granted that some applications are irrelevant for the advancement of science. (But again, something makes me agree with you).

  2. Locke and Demosthenes, qua advocatii diaboli · · Reply

    A more pressing problem, as I see it, is the wilful ignorance of confounding variables — especially time. The Wired article points out that companies an experiment-supported change in the sort of content that attracted visitors, and react without understanding the result for the change. That’s fine if you have the resources to constantly run the tests, but what happens when the trend exhibits some seasonality every hour? It’s not that trusting the machine, or ending up in a local maximum, that worries me first.

    Rather, it’s the fact that these experiments aren’t randomised: the sample is visitors in the past, and the population to which you’re generalising is visitors in the future. These people clearly vary, but how much?

    1. Excellent point!

      A while ago I was talking to a senior researcher and faculty member at Columbia who also works with big data, of the type of log records. He was saying that he thinks our statistical needs have changed recently, because unlike before, it often happens that what we have is not a “sample” of the data, but its entirety, and that needs we need different ways to analyze the data than the traditional methods which are all based on estimation and sampling.

      I was thinking about his comment for a while. I couldn’t agree more with the motivation behind his comment, but I didn’t find myself in agreement with his statement either. It may be true that we have “all” the data, in the dimension of people accessible at a snapshot. But we are always sampling in time. And we generally assume that things remain steady over time, while the reality is the only thing that doesn’t change is change itself!

      A very typical example of this is what Cathy talked very briefly about a while ago in the class: stock market predictions! Imaginge you can create a completely causal model using observational data which tells you the effect of various factors (from outside world, to the behavior of buyers and sellers) on the stock prices. Awesome, right?! Nope! You CANNOT create a causally independent model like that EVER. Why? Because the moment you use this model, you are changing it! When you predict the price is going up, and you buy a lot of that stock, you are increasing demand, hence lowering the price!

      That is the reality; no matter how much we sugarcoat it, observational data is NOT causal. No matter how many confounders we think of and adjust for, the reality is that there are more confounders out there that we don’t know about or cannot adjust for.

      Randomization can help there, because it likely distributes all the other factors equally in both groups, therefore you can assume the effect you see is causally related to the only factor that differes in the two arms, the exposure. However, as Rachel said above, randomization is not guaranteeing equal groups and there are books and books about how to assert that the groups are equal except for their controlled exposure of interest. Books which I’m not sure how many data scientists have read.

    2. Although this issue is often neglected in industry, I would like to think that this will get more attention as data science gains popularity. From my experience I’ve found that many people dislike dealing with confounds often because it brings up uncertainty. However, not looking at confounds increases the inaccuracy of our calculations and predictive models. It’s hard to determine the reason why this is so often neglected. This behavior could be a reaction and thereby causal (i.e. it could be that people choose not to look at it based on personal preference) or it might be that it doesn’t help to factor this in (i.e. the effect of including this into certain models could have such a small impact that it is considered unnecessary) or perhaps it’s as simple as people being unaware of causality and confounds. In any case, I would think that this should be a requirement (rather than an option) when designing any experiment.

  3. Though the bulk of the A/B testing article focuses on the A/B testing and its usefulness on companies’ side, I would like to comment on what it means to us as a user. Human decision making is a complicated process and based on a lot of factors other than what is ultimately rational and optimal. Just as powerful as the lesson “don’t judge a book by its cover”, it is easy, if not natural, for us to do so not because we are irrational but because we are human. This seemingly irrational human nature might get in the way and produce unintended effects, especially whenever persuasion is involved–dating, selling, job interview, or clicking an ad. For example, users might simply be turned off by “the inferior site revamp” rather than the content. What A/B test allows companies to do is to iteratively eliminate such potential adversary effects and create an environment conducive for users to doing whatever actions companies attempted to elicit. I think this data-driven approach is extremely effective and powerful, bringing out the “best” version of products to customer’s table. Though I do feel this technique might be potentially exploited, like an orator using fancy rhetoric without much substance, but that is another topic of discussion.

    1. From what I understand, a promenient former Googler was a principle driver in iterative design using data collected from A\B type testing. Thinking about how using statisitics of user response to various forms of a website got me thinking about how much you can truely understand from A\B testing. What I mean by this is, perhaps you can say that the data tells you having the sidebar on the left side is far more appealing than the right side (unconfirmed example), but A\B testing cannot determine the mechanism behind this preference. I remember when Google Chat was first integrated as a part of the Gmail interface. The functionality was great; except, like many of my friends, I have an extensive collection of Labels, which pushed my Google Chat box outside of the generally viewable range. When the Gmail Labs function to move the Google Chat box to the right side of the page, my friends and I instantly spent hours IMing each other on Gmail. So the intervention, allowing the Google Chat box to be moved to a different panel, was highly successful. I can positively say that the intervention was directly responsible for my increase usage, but can someone just looking at the numbers know why? I’m not convinced. There could be a multitude of reasons (“I just liked my chats to start on the right side”, “I have too much cluter on the left side of my Gmail interface”, “My screen is rather small, so having the chat bar on the right panel is better”, etc.), but A\B testing does not reveal this kind of information.

      This kind of information is what explains whether we can replicate behavior. In cognitive psychology, the emphasis is not only on whether an intervention causes behavioral change, but if we can explain the mechanism for the behavioral change. So a website can be well-designed, but we have no idea why. This is not to say that A\B testing is bad, but that A\B testing answers the question of what causes change, without answering the question why is there a change. The latter question should be just as important as the former.

      1. Silis,
        Interesting about the Google sidebar. I also tried that, only to collapse my labels and move the chat bar back to the left side when it made the formatting on my emails all wonky. I think your response is actually a great example of irrationality when it comes to software testing and “what the consumer wants.” You said that your ability to move a chat bar to the right hand side encouraged you to chat more, and that this would not necessarily be obvious in A/B testing. I would argue that the proper way to look at it from a development standpoint is that a broken UX made you unable to chat normally, and that google Labs is a sort of self selecting A/B testing that allows google to track which features users fiddle with the bost – which by an extension of logic are the features that are most likely to be broken. In fact, you and thousands of others like you using labs to switch your chat to the right side probably encouraged google to implement the current feature of chat that hides your labels unless you slide your mouse over them – giving full vision to chat.

  4. Chaoran Liu · · Reply

    A\B Test’s rationale seems quite simple in the world of Academia. It considers the real users as the testees and put treatment on them. Then we get the results and could know whether the new trial can make us better off. In politics, it looks like that A\B test has done a great job. I do think A\B test has a bright future in Marketing.

    As we know, conducting A\B test is not as conducting regression where we only know there exist relations between the factors but we do not know what has prompted the change. So before we start out for A\B test, we have to identify what the factors we want to test. (It relates to the causal modeling and observational studies.) For example, Emploi, a new dress brand from NY, wants to increase its brand awareness and sales online. It has to first figure out where the consumers are coming from and what they expect from Emploi. Some customers are looking for sales, so we want to test whether our customers are discount-sensitive. (Some brands who has a image of quality and value may not have a sensitive sales boost on sales.) Then Emploi could have a test on the website with a homepage focused on sales. This is how I view the causal modeling & observational studies with A\B test. I think this methodology is a good trial in my future career.

  5. In my opinion, one of the major flaws with standard experimental techniques within the social sciences is that they study people in extremely contrived situations, far removed from normal, everyday life. As a result, although the findings may not be biased due to differences between test groups, the testing environment itself may introduce biases.

    One notable example was Coca-Cola’s introduction of “New Coke,” a product which failed within three months of its introduction. According to Malcolm Gladwell, after seeing the rising success of Pepsi, they ran a series of blind taste tests to arrive at a new flavor which would consumers would prefer to Pepsi. They failed to take two issues into account. First of all, their taste tests were really sip tests – subjects took small sips of two beverages and said which one they preferred. However, that is not how consumers would really experience the drinks. They would buy a bottle and drink a few cups at a time. New Coke was very sweet and subjects consequently preferred it for a sip or two but had they been asked to drink a few cups of it, would likely have reversed their preference. Second of all, experimental design demands that the subject not be aware of which soda they are tasting. However, consumers are never forced into this situation – they have certain associations with the brand and the packaging of the beverage which impact their perceptions of the taste. Many other experimental designs force subjects into situations which do not accurately reflect how they would behave in real life.

    Of course, as discussed, observational data can be very flawed as well, particularly if one is looking to prove causality.

    This shows the power of A/B testing. It allows you to create two identical test groups and (theoretically) discover the impact of a single variable. However, these two groups respond to that variable in a natural setting – not in a laboratory somewhere but in the course of their regular web surfing. In this way, it eliminates the flaws of both observational data and normal experimentation.

  6. This ethical issue also exists in healthcare (and was discussed at length by Prof. David Madigan and in the week 10 blog post). Clinical trials need to compare two populations but it would be unethical to provide treatment to only one group. This is especially true where the disease or disorder is question is life-threatening. To circumvent this, the current standard of care can be used instead of a placebo, and the patients chosen randomly to minimize confounders (randomized clinical trial rather than observational studies). This also allows for the experimental drug/treatment to be directly compared to an already-existing drug. I suppose this is a form of A/B testing, and these side-by-side comparisons are also valid evidence that one drug is superior to the other, which matters tremendously to the FDA.

    1. Eurry Kim · · Reply

      David Madigan’s lecture also left a mark on me. I think that the fact that randomized trials are considered the gold standard in experimental research shows us that social science has a long way to go. Are we forgetting that medical studies are also a branch of social science? Madigan’s impressive research showed us all that despite randomized trials, you still can’t take out the social messiness from medical studies. So, it’s really frightening to know that statistical analyses can give a researcher opposing health outcomes based just on the medical database (and some parameters too?)! Gelman mentioned that he thought it inappropriate to borrow from medical studies to explicate experiments in social studies. Honestly, I could not see what was wrong with it because aren’t they kind of the same thing? Experimental data in medical studies are still observational data, no?

      To tie medical studies to A/B testing, I echo jonathan’s sentiment that they’re similar. You’re trying to predict an outcome and tie it causally back to the treatment. But I am afraid of the implications of these types of experiments. In both cases, the users of a “tx” website and the patients of a “tx” medicine are self-selecting into the treatment group. The users go to the website because they were users before or something about it attracted them. And the article made a good point about the dangers of A/B testing in that it caters to its existing users. What about potential users? Future users? Rare users? They’re not going to use the website in the same way as other users. Depending on the time frame of the A/B test, enough observations may not be able to represent these users well if at all. Similarly, the patients of a particular medical regimen are self-selecting because they ALREADY have the disease. In both cases, we can’t observe the counterfactual. How can we truly attribute the “improvement” of the website or the patient’s health status without being able to see what would happen otherwise? As Madigan’s research pointed out, we are MISSING something.

      My overall view is a bit pollyanna and naive, but I hope that we can and want to shoulder the heavy need to UNDERSTAND why something works and something doesn’t… at least to try. Rather than studying the treatment variables, why not the pre-treatment variables? Ask “what causes the disease?” instead of “what cures the disease?” because we can’t do the latter very well anyway!

      1. I agree that David Madigan’s lecture was eye opening. My naive understanding of epidemiological research left me with the assumption that the professionals conducting drug studies were able to deal with the problems of causality through a strict use of experimental design. I knew that social science research must often use observational as opposed to experimental research, but I had thought that the hard sciences, where the stakes are much higher, had better solutions for determining causality.

        The most interesting finding from Madigan’s research is the common success of self case-control studies (SCCS) in the databases that he evaluated. In the field of psychology this experimental design is referred to as repeated measures design. Instead of separating out participants into treatment and control groups based on a stratified or random sample, all of the participants go through both a control and treatment phase. Using this method the difference between treatment and control is measured for the same individual.

        Based on Madigan’s work I would like to augment Gelman’s argument. Gelman is concerned by the simplicity of the statistical methods used by experimenters and the inability to assign causality in the sophisticated statistical work done using observational studies. He argues that randomized experiments are considered a gold standard because they can add value to models built on observational studies.I would argue that a repeated measures design should be added to the gold standard of experimental design. With a randomized sample, we implicitly argue that the randomness of our sampling method will account for differences between the character of the control and treatment groups. The randomness argument holds as long as the samples are large enough to account for intervening variables. When it is possible to use the repeated measures design, we can directly measure the difference produced by adding the treatment.

      2. I was left with similar impressions after David Madigan’s lecture. Thinking about randomized trials, social messiness and ethics in experimental research made me think also of the randomized control trials (RCTs) used for impact evaluation in international development. Esther Duflo and her colleagues at J-PAL (Jameel Poverty Action Lab) are pioneers of this method in the international development field. The aim is to use random assignment to allocate resources, run programs, or apply policies as part of the study design (in a sense formal field experiments described by Gelman), and use randomized evaluations to determine not only whether it has an impact but especially to quantify how large that impact is (a form of both ROI and ROC). It offers the allure to social scientists of finally overcoming the flaw of real-world research – the counterfactual – and establish causal effect. It enables questions of ‘what would have happened had there been no direct intervention?’ to be posed, and answered. There is much debate in the development field over the merits of monitoring and evaluation, but initiatives such as the Millennium Villages that do not use controls have been vehemently criticized.

        Problems arise however as the picture is not so clear; there may be spill-over effects and difficulty in divorcing the effect of treatments from the influence of other factors. Ethical issues prevail where one village gets help and another does not, even if it might benefit from the same interventions in the future. And some influences are hard to randomize, like geographical attributes and organizational culture.

        Arguably, this is a case in which ‘What cures the disease?’ (in this case poverty) does matter as much as (if not more than) ‘What caused the disease?’ and the use of randomized control trials to figure out what works and what doesn’t.

  7. After reading about A/B testing and coincidentally, having just had a lecture in another class on theory and methods used in Human Computer Interaction, it seems to me that A/B testing is just an extreme version of the minimum viable product (MVP) method of product development/usability testing. This leads me to agree with the main argument in Gelman’s paper, that observational researchers and experimental researchers have much to learn from each other. In some instances, it is okay to understand that one thing is better than the other, without knowing why a person made that choice (e.g., when you are only concerned with clicks on a web ad) However, when you venture into more sensitive matters such as health (e.g., when you are using ads on a webpage to drive users to a health intervention), it becomes more important to understand the context and other variables that influence the behavior (including time).

    The combination of these articles has me thinking of possible applications of A/B testing in real world situations. Like, HS mentioned earlier, there are many in the world of biomedical informatics and health care technology in general who are calling for new methodologies and approaches to analysis of big data in healthcare. I also agree with the motivation behind this comment, but I think that these “new” methods already exist and will have to be used in tandem with traditional methods.

  8. The use of A/B testing strikes me as one of many possible approaches in understanding user behavior on the internet. Its utility lies in near real time deployment on a massive scale at a relatively low cost, a convenient practice in the age of big data. It’s a beautiful illustration of the scientific method in practice: formulate a hypothesis, make a prediction, and refine or refute the hypothesis on the basis of observed outcomes.

    As far as data driven decision making is concerned, the randomized experiment setup of the A/B test seems to me as a best possible approach. There is no longer a need to rely on human judgement alone, and the HIPPO effect is certainly reduced. It is certainly not ideal however, as we’re exposing some users to a treatment that is probably not convenient for them. My main concern with A/B testing is over-dependence, in as far as it’s currently used. The UI and design communities are among the primary adopters of A/B testing, and use it widely to test new designs, layouts, user interfaces and other features on the fly. However, their widespread and wholesale use of the A/B test is something I perceive as a “hammer and nail – if all you have is a hammer, every problem looks like a nail” approach to problem solving. In this light, there is a need to understand its limitations, as Rachel articulated so well in her blog post.

    The A/B test is just as good as the A or B tests devised, and should be used as a technique for comparing two competing ideas. Moreover, it answers the question of what works rather than the more important question of why it works. In the field of software design where A/B testing is so widely used, it needs to go hand in hand with product (re)design which is a fundamentally human task. In other words, there is no replacement for a great product, and the human and scientific approaches need to inform one another. Finally, there is no replacement for soliciting user feedback on websites, as it answers the fundamental questions of why something is happening in the first place, and how it can be improved.

  9. James McNiece · · Reply

    I’d like to offer a few thoughts on A/B Testing based on my experience working for a tech company.

    First, we view A/B Testing as a method for optimizing site layouts, button colors, text of marketing emails, site performance and other similar characteristics. We do not consider it to be remotely useful in defining a “Minimum Viable Product”, or site which meets users’ needs most efficiently. For that, we rely on face-to-face interviews and usability tests with real users. Once we have built the capabilities our users require, then we employ A/B testing to optimize how those capabilities are presented to the user. So, to use a mathematics analogy, A/B Testing gets us to a local optimum based upon existing site capabilities, but not necessarily a global optimum. Then we talk to our users again to figure out if there are more features they need and the cycle repeats itself.

    Second, I think Eric hit the nail on the head in the last paragraph of his comment. The real value of A/B Testing is that it is the closest you can get to a truly random experiment in a business environment, since the experiments are being performed on users unbeknownst to them. This is extremely important, since people often make key product decisions based on their intuition about what works best, rather than hard evidence. It’s truly remarkable how a little hard evidence can end longstanding disagreements and improve the decision-making process.

  10. A/B testing give numbers and a clear, easy-to-see winner. They get execs and stakeholders excited, because such improvement is easy to spot. But my concern is: are we doing the right thing when we give so much attention to these tests of trivial design changes? For example, for online shopping platforms, clicking numbers has lots of problems as a measure of success, it focuses purely on the pressing of the purchase button. It doesn’t measure whether the users are happy with that purchase or whether they are delighted with the product they finally received and the way they received it. It’s easy to over-optimize the clicking numbers while sacrificing a great experience.

  11. Sometimes researchers only focus on the data they have, ignoring what is behind the data, or what is not in the data. There are many cases in research. For example, we intuitively think that the patients with serious disease should die faster than healthier patients. But, from the medical records data we have, it is actually the most “healthier” patients who died fastest. What’s the problem? Actually, those patients died before anyone enters their diseased into the system. After their death, it seems not necessary to do this either. Here is the problem: the Medical records only show the admitted patients information when they are admitted. Are they healthy when they are absent? Or have they go to some other place? In either of such situations, it is risky to draw any conclusion from the data on-hand.

    Also, for observational data, there are many practical questions such as the non-response, unavailability, data collecting, treatment interaction, etc. In the example section, Gelman showed the cased in which when gift was giving, it lowered the response rate. I do recall a marketing case. A restaurant wanted to promote its new soup, so the usherette gave away a pair of lovely socks to each table as a gift for purchasing the soup. However, in the end, the marketing turned out a failure. A following survey analysis show that the gift connecting a bad image in guests’ mind. I think this is a good case in which the observational data is quite contrary to our intuition. Therefore, we should careful with practical data when doing inference.

  12. An attempt to relate A/B testing to Andrew Gelman’s article.

    A/B testing allows us to tap into the power of experimentation, serving as an example of an almost perfect experimental design; however, it’s not realistic so far to extend this approach beyond online environment.

    The reality is that the large portion of data available is observational in nature. This complicates the analysis, increases the probability of drawing wrong conclusions from the data etc. As a result, experimental design is sometimes viewed as completely inapplicable to reality.

    For data scientists, however, the value of knowing and understanding experimental design lies in being forewarned and forearmed about the possible caveats of data analysis outside of safe experimental environment. Just keeping in mind that the results drawn from observational data are not completely clean and robust is sometimes the best thing data scientists can do, given the limitations of the data we have to work with.

    1. I think this post highlights both the good and the bad of experimentation. You can go completely wrong with even seemingly randomized experiments, even if you’ve taken the time to design them specifically for your needs. Andrew Gelman’s article highlighted this in his example of the survey experiment. After the fact, he used that data as observational, because even in the experimental design setting, that is what it really was. Even as the “gold standard”, a randomized experiment might not really fit the situation, or it might not be what you think it is at all. But even when they go wrong, they can be more telling than you hoped! The Google example in the A/B article demonstrates this; in an attempt to get a randomized experiment that went wrong, the observational data they collected told them something else equally, or even more important.
      James pointed out a situation where is it perfectly applicable and helpful, though. Like the article, using A/B testing on website features is a great way to possibly gain an advantage in whatever your goal is. It is not the be-all end-all, and the caution is that it shouldn’t be. The article rightfully pointed out that, inevitably, overhaul designs sometimes are necessary, and unlike the minor tweaks A/B testing can decide and design, we still need some sort of intuition to choose them. There’s nothing wrong with doing both, though, and if you didn’t, you wouldn’t be using all the resources you had. Like we’ve talked about before, though: it’s okay to use the data to find new trends in marketing or in genetic components of disease, but unlike marketing, when we’re talking about health, you better feel morally compelled to look beyond the data and figure out what is actually going on.

    2. I totally agree with you Yegor. In my professional experience, it is almost never feasible to conduct an experiment. As an observational data consumer, probably the best reason to learn about experiments is what you pointed out: “For data scientists, however, the value of knowing and understanding experimental design lies in being forewarned and forearmed about the possible caveats of data analysis outside of safe experimental environment”.
      From my point of view, the progress of many quasi experimental methods follows this logic.

  13. I would like to initiate a debate on a case that is, to the degree that it is politically charged, diametrically opposed to the optimization of online-ad clicks: the search-and-frisk policy in New York City. How would we measure its impact on criminality? On the one hand, we could design an observational study, trying to discern whether the introduction of search and frisk, controlling for other independent variables, had a significant impact on the crime rate. On the other hand, we could carry out an experiment: implement search and frisk only in half the NYPD precincts. Such an experiment, however, would be problematic since criminals might adjust their locations accordingly. So is in this case an experiment only the silver standard? And, more fundamentally, is it ethically defensible for a data scientist/statistician to analyze the impact of search-and-frisk on the crime rate without talking about the broader consequences of the policy, such as the intrusion of the police into the private sphere of citizens and, most importantly, racial and ethnic profiling?

  14. By using simple methods such as changing color scheme or text placement on a website and having drastic differences in behavior reactions from different users demonstrates the pure usefulness of A\B Testing. Although it may seem devious in a sense that web designers are intentionally trying to manipulate the user into clicking or even hovering over interesting ads, it is also interesting enough that these same programmers are able to study user behavior and learn how people react to different visuals. As a campaign manager, these things are not new to me, nor have I ever really bothered to think about the reality that I am imposing my own ideals and ambitions onto other users by innately forcing them to click on my ads. Now that I really think about it, is it really the same as not telling someone to smoke but hinting that smoking may lead to an improvement in, lets say, social life?
    Judging from just the data, there is never enough information to fully answer the question of what we should do to be able to fully satisfy every user. Simply because everyone is so different, there is no standard way to treat them all. In a sense, is there really a golden standard in which we can treat everyone the same way and receive a standardized response?

    1. Urban planners and architects have ways to gather feedback with tools like charettes and I believe that A/B testing model could help the field a lot. Rob Lane from Regional Planning Association came to my studio to talk to us about a scenario approach to help audiences make more informed choices and exhibit preferences on development patterns in suburban Massachusetts. Part of the problem with charettes is that you cannot make full-scaled models because of exorbitant cost and impracticality of deploying life-sized developments.
      I think another limitation that planners face is the same as the medical profession. Planners cannot coerce residents to move to a specific development or force people to take on certain jobs. The golden standard seems still quite far away and some fields seem relegated to certain research designs for now.

  15. Chaoran referred to the future contributions of A/B testing in marketing, which congruently I mention that there are dozens of currently experiments in that area, as I can remember there are many of them in the digital marketing, whether for human factors design implications, product development, or even developing ad campaigns. Maybe a better way of putting it into words is a way to measure the distance from intuitions to insights. This being said, it seems that this area is safe for the purpose of split testing, especially in HCI studies or user experience designs.

    Apart from this discussion, when I was reading the articles I was thinking to myself issues from observational studies and our desire for looking through a model to design experiments do not lie in only empirical social science and ethical issues. In the literature of other behavioral sciences such as cognitive psychology this also has been the problem. In this case, we are trying to design experiments, howsoever we are unable to determine a handful of ‘metacognitive’ activations in our brain. This appeared interesting to me since recently I have been getting involved in reading about such processes. Doubting how it is detached from our discussion about the Wired article and paper by Prof Gelman, since those are mostly about data driven analyses, I thought it might be worth mentioning here about the implications in other sciences as well. In these cases, the main issue is when an affect or behavior is attributed to sources when by different types of designing experiments and A/B testings with no limitations(ex. ethical) we may not be able to correctly attribute causes to their should be cognitive sources.

  16. Yige Wang (yw2511) · · Reply

    I totally agree with Rachel’s comment on the barriers and difficulties of conducting A/B testing in reality. Besides ethical issues, I can come up with two more possible difficulties that challenge the feasibility of conducting A/B test.

    First, multicultural factors should be taken into consideration. Especially in a country such as US where diverse cultures are mingled with one another, people with different cultural backgrounds would have different ways of perceiving the world. Moral standards, lifestyles, preferences and habits are developed from and will be largely influenced by cultures. As a result, if there is a huge cultural gap between two groups of people, then the comparison and conclusion derived from A/B testing are biased, since people are basing their judgment on different grounds. Therefore, it is crucial to nail down clear target groups which, to the largest extent, share the same cultural values. For instance, when evaluating the effectiveness of a new advertising campaign, we should have target a particular audience group – in the case of US, either White, African American, Hispanic or Asian- and think about what are the virtues that people within the target group value the most; what is (in)appropriate to include in the ads.

    Secondly, let’s take a step backward and say that we have two perfectly randomized groups who are close to be equal. The next challenge is to determine how to actually conduct A/B testing. In other words, which factors and elements should we change in order to test the difference in effectiveness. Rachel’s article reminds me of another class I am taking – “Digital Marketing”, a field in which A/B testing is widely used to evaluate the effectiveness of a landing page. However, there are a million things that you can do change your landing page: color, font size, titles, images and etc. What makes it even more complicated is more subtle things. For instance, should I use “Order Now” or “Order Today” as the action word on my landing page, and where should I place it? The good news is that there are many examples showing that all these small changes eventually led to a much higher CTR and revenue, but the bad news is that these changes are usually too subtle to find the correct pattern. And that’s why we need data scientists to resolve these problems.

  17. Instead of medical and political applications I am more interested in the financial application. The topic reminds me a very interesting book by Nassim Nicholas Taleb. He depicted in his book Fooled by Randomness: The Hidden Role of Chance in Life and in the Marketsan an experiment that utilizes both the A/B testing (split testing) method and randomizing sample concept to make people believe his capability of predicting the stock market trend. He designed the experiment as randomly splitting the people into two groups and sent email to the two groups saying the stock price will increase (decrease) the next day. The next day, he picked the group to which he sent the correct information and did the same splitting again. After few rounds, there must be some people that always got the correct information and believe in his profession. However, such group is in minority. Do we care how much out of the entire sample size reach to the result we want?

    Andrew thinks social scientists who use medical analogies to explain causal inference are borrowing the scientific and cultural authority of that field for our own purposes. I am trying to think does the argument apply to financial “scientists” as well. I think the answer is yes. Quotes from Nassim Nicholas Taleb: “Bullish or bearish are terms used by people who do not engage in practicing uncertainty, like the television commentators, or those who have no experience in handling risk. Alas, investors and businesses are not paid in probabilities; they are paid in dollars. Accordingly, it is not how likely an event is to happen that matters, it is how much is made when it happens that should be the consideration.” Unlike the social scientists or politicians, finance guys have little interests in the probability or percentage of the good outcome. To them, even a single person can contribute.

  18. Bianca RM · · Reply

    The following sentence in Andrew Gelman’s article – “In particular, I am concerned that “experiment” is taken to be synonymous with “randomized experiment” – struck me as a particularly interesting question to consider. I was really impressed and somewhat shocked when David Madigan presented his analysis of the failure (or at least great lack of reliability) of significant bodies of medical research. He then presented the example of a ‘randomized experiment’. Patients coming into a hospital with heart attacks are given one of two treatments (both which were ethically vetted as being ‘equally’ viable) and the resulting outcomes are analyzed. While this makes sense to me it brought some interesting questions – I’ll focus on the ethical one first. This experiment was designed to determine which of the two treatments has higher success, but yet must ethically begin with the assumption that both treatments are equally effective (to the extent of the medical community’s knowledge). I would be curious to know what that assumption is based on. Firstly, I assume the treatments are likely to have been approved and proved ‘safe’ to some extent, therefore some existing experimentation has been likely been done; or perhaps this is not the case and the randomized experiment is part of the approval process for both procedures, though I think this is unlikely. So – if some data exists – this seems like an ideal situation where causal modeling and experimentation could work closer together. On the other hand, if causal modeling were to reveal possible differences between the treatments that could pose an ethical boundary to the randomized experiment taking place, even if the causal trends may be driven by confounding factors. I think the big ethical questions to ponder here is – if observational data exists – is it ethically okay to pursue a randomized experiment without first considering causal models from the observational data? The answer probably depends on the implications of the experiment (i.e. heart vs. add-clicks), but also offers an opportunity for closer work between field statisticians and more empirical statisticians, as Gelman encourages. It also brings into the light the potentially ‘unrandom’ aspects of this experiment. To some extent, the efficacy of these treatments for heart attacks is known, or has been considered. The mostly random analysis is that which compares one treatment to the other. Anything that considers the efficacy of the individual treatment is not random, and this is a distinction that Gelman rightly makes. Randomized experiment can only consider very small problems at a time; observational and causal models are helpful in making broader statements that are easier to translate into policy or other action. Seeking to combine them then provides good opportunity to increase the reliance of the really important questions (which treatment saves life more often?) and connect them to the broader questions that people will inherently ask in seeking to justify the answer to the first question (why? For what types of patients? Etc…).

  19. As someone who spent a great deal of my pre-Columbia studies focused on experimental cognitive psychology research, Gelman’s article served as a return to a frequently encountered debate in social science research. Political science and much of the social sciences rely heavily on observational research, tending toward inference and after-the fact statistical methods that control for differences. Cognitive psychology operates on a more acute level, looking to examine and understand the psychological sub-processes that aggregate into thought and behavior; almost always relying on strict experimental control. Given the cognitive background, I have always found myself favoring a reductionist approach –one in which it is necessary to understand the smallest factors at a very precise level before extrapolating to larger patterns. I appreciate the usefulness of observational research, but I’ve had a hard time reconciling the certainty to which observational research is often expressed. Ori’s presentation to the class placed a large emphasis on establishing causality; in considering all sources of differing effects and inserting accurate measurement alongside random sampling. I felt a definite affinity for Ori’s perspective and was very interested when I heard of his efforts toward inserting experimental design into the collection and interpretation of online advertising data. I was very happy to see further proof that this type of debate is occurring within data science and industry.

    Returning to challenges in methodology and observational research, it seems clear that separate domains of research will encounter their own distinctive set of shortcomings –often through imprecision or diminished range of applicability. Gelman makes a great point that complex, well-specified statistical methods should be used in social science to better account for irregularities in observational data. He states that it doesn’t make sense to collect expensive data, and then simply apply basic regression models. Gelman’s work applying hierarchical statistical models to better define distributions and variability is a great example of putting this sentiment into practice. I thought another strong point was that domains which have traditionally sustained separate methodologies could do well to adapt some of the strengths of other disciplines. This has and continues to be done through incorporating new methods like natural experiments in political science research or increasingly advanced modeling of traditional experimental research data. Taking these arguments, there seems there is even more reason to support the integration of cross-disciplinary perspectives and methods. Data science seems to be accomplishing this at an unprecedented level. Perhaps the newness of data science methods has been partially responsible, there has been less opportunity to separate and become entrenched. Data scientists need to respond to the progress as it is occurring, so everyone is somewhat united in their search for rapid progress.

  20. Both Ori Stitelman in lecture the week before last week and Andrew Gelman in the paper pointed out that we do randomized experiment because we want to have the same kind of people in both our treatment group and our control group. In reality, as mentioned by Rachel, people are not the exactly the same and the experimental environment are always ideal. My point here is that even though this ideal world exist in which we have exactly same people in both treatment and control group, the result might still be biased because of the experimental design in many cases. A paper written by Feldman and Lynch indicate how easily results are biased because the poor design of survey or experiments in which participants are led to think in the way that they don’t think usually.

    “A/B Testing” talked in the article here is a good way to measure behavior online, but it is hard to be applied offline. As Rachel pointed out that it would violate ethical standard or may just be impractical in some situation like examining the effect of cigarette smoking and side effect of things like medicine products. Observational studies have to be used in these occasions even though one might say that the population in the study is not randomized. As I mentioned above, even though in those randomized experiments, the populations are not totally identical, we have to be realistic that the population is never going to be identical.

  21. My first reaction upon reading Gelman’s piece is that at least at the undergraduate level in this particular university’s Sociology Department, there is hardly any formal discussion of formal study design. Maybe I have spoken too soon, because I have yet to take my required methods class, which I will be taking next semester—but I doubt that the two weeks devoted to quantitative studies in the syllabus will be sufficient to fully explore the important issues of study and experimental design. Part of this lack of focus on experimental reasoning through might be due to the nature of Columbia College and its emphasis on liberal arts education over teaching students the craft of social science research at the undergraduate level.
    Another interesting point that Gelman brings up is the “natural experiment.” The vast majority of natural experiments that I see discussed in sociology literature (I can’t speak for the other social sciences) are generally more qualitatively focused. Natural experiments in sociology, though potentially the basis for quantitative studies, seem to be generally used to ground qualitative and theoretical findings rather than claims supported by statistical tests.
    Finally, a last point that I’d like to discuss is the rise of usefulness of A/B testing in policy situations—which seems to be an avenue toward fulfilling Gelman’s suggestion that “we should be doing more field experiments.” I can see an increase situations which policy interventions can be tested using A/B testing to somewhat depoliticize things like the disbursement patterns of welfare benefits or in government-run savings programs. Instead of using A/B test only to measure marketing interventions in web-environments, we can bring this methodology to the real world. With more bureaucratic processes and transactions being automated, there are lower barriers to testing out new policies and strategies. To find where the possibility for setting up A/B tests exists, we should look for processes run or supported by rule’s based engines and whose outcomes are recorded. I see opportunities in the educational space, social services, police staffing, and government services. To do testing like this If we this is a strategy that takes off though, it will become increasingly important that social science and policy researchers have both the programming and technical ability to understand how they might setup these A/B tests and that the knowledge that this kind of testing is even a possibility.

  22. I truly enjoyed reading Andrew Gelman.’s article. Luckily, I have had the opportunity to contemplate the usage of formal experiments vs. observational studies because we discussed a similar topic in another class, Research Method at Business School. As researchers, I truly believe that the ideal is to take advantage of both techniques in a complementary manner. Formal experiments teach us things that we could never observe from passive observation or informal experimentation, and observational studies offer us broad ways of grasping aspects that researchers in formal experiments usually restrict. If researchers choose only one or the other, they are forgoing entire aspects of their topic.

    Personally, I am more familiar with formal experiments, which have questions whose answers we seek, but not with observational studies, which gather discrete bits of understanding about pieces of the puzzle. This article and the former lectures regarding observational studies, confounders, epidemiology and estimating causal effect will impact how I design my future research. For example, I am planning to conduct new research about social networks for my master theses starting December. When designing my research, I will place observational studies first, in order to grasp broad perspectives of my subjects and come up with unanticipated insights to develop more focused formal experiments. I believe that the more focused and informed my former experiments become the better I will be at proving my theories or hypothesis, or addressing my topic.

  23. zaiming yao · · Reply

    A/B testing do give us a good scientific frame work of measuring the difference of product than purely intuition. However I’m rather concerned if it really worth it to conduct every trivial changes. A very simple example, what if I changed the font of the byline of the article. Who would behave differently? Maybe none, because it’s too trivial to be noticed. Maybe it’s time to come up with different metrics to measure how content the user feels than cold number of CTR?

  24. When I first got in touch with the differences between randomized experiments and observational studies, Professor Andrew Conway, a Senior Lecturer in the Department of Psychology at Princeton University, said that one of the most important differences is that randomized experiments can make the experiments randomly, as the name showed, which means that you can manipulate your independent variables. However, in observational studies, you can just analyze the relationship within the variable. And you cannot manipulate your predictors and it’s usually impossible to make the inverse experiment. Therefore he would prefer to call the “causality relationship” in observational research as “Correlation relationship”. For example, if you find out that most of the patients who broke their bones are tend to lack calcium, you can come to a conclusion that there is a high correlation relationship between lack of calcium and fracture. However, since you cannot make an experiment that ask people to take let down the calcium in their body and see whether they’ll break their bones easily, you cannot come out the conclusion that lack of calcium means easily fractured. So it’s safer to regard this relationship as “correlation” instead of the exact “causality”.

    But just as the realistic world showed, and like Professor Andrew Gelman mentioned that even he agree that randomized experimentation is gold standard, almost all his research use observational data, it’s hard to keep some EXACTLY same things. So it’s quite reasonable to do the observational research instead of spending thousands money and time to do the randomized experiment. But when I saw the A/B test, I really think that’s interesting. It reminds me of the Darwin’s Theory of Evolution, which demonstrate that which fits the best, survivals the most. As a company, you can really manipulate your product, which can be regarded as that you can manipulate your predictors. And as long as you have enough data, you can ignore the outliers whose reaction may due to some personal reasons.

    The other thing I want mention is concerning Jed’s question. “Why our whole semester focusing on let customs to press ads.” I totally agree with the opinion. As a custom, it’s really annoyed to be designed to press ads. But that reminds me a story. A man entered in a retail store to buy a tiny stuff. And the salesman talked with him and knew that he bought that for his wife. Then the salesman suggested the man to buy some other stuff to take better care of his wife. And that man felt the salesman just remind him to buy something really important. So he bought the additional thing happily and he really think the salesman be a great help. In that story (sorry I read it so long ago, I forgot the details.), the man is so happy to buy the extra things. So that’s what every company pursues, not just negatively waiting for the customs.

  25. A/B testing or split testing compares the effectiveness of two versions of a web page, marketing email, or the like, in order to discover which has better response rate or better sales conversion rate. A classic direct mail tactic, this method has been recently adopted within the interactive space to test tactics such as banner ads, emails, landing pages or even entire websites. This article reminds me an essay I read before. Jason Cohen, in his post titled Out of the Cesspool and Into the Sewer: A/B Testing Trap, argues that A/B testing produces the local minimum, while the goal should be to get to the global minimum. We can understand the difference between the local and global minimum (or maxima) by thinking of the conversion rate as a function of different elements on a page. It’s like a region in space where every point represents a variation of your page; the lower a point is in space, the better it is.

  26. Luyao Zhao · · Reply

    I read somewhere else that one example of the utilization of A/B test can be what the team of Obama did for their website during the competition. There was a page about free Obama T-shirts. If someone donated a specific amount of money on the website, he/she would get a free T-shirt. Obviously, if the picture of the T-shirt was attractive, they could get more donations. Therefore, the team worked hard on the design of the picture, utilizing A/B test. They had 4 kinds of pictures for the treatment group, used Google Website Optimizer (or some other similar tools) to analyze the behaviors of the users, and then selected the best one. (See graphs below)

  27. The article on A/B testing was fascinating, but proved the gap between real-world practice and academic intuition (as outlined in Prof. Gelman’s paper) has not successfully bridged. The responsibility that falls on a Social Scientist to produce valid and clear results requires extensive amount of planning and checking of assumptions and methods. The A/B testing, on the other hand, is asking for a streamlined crowd-sourced answer to smaller challenges. I found it encouraging that Gelman cautiously welcomed the practice of blending methodology in design, but caveated it some basic ground rule (minimum characteristics for a formal experiment).

    The A/B testing seems to drive at resolving issues with less gravity than a social scientist would care for, but do bring about a source of data that can be used in understanding the user behavior. As the article indicated, most engineers/companies do not care as to why the pattern of like/dislike is happening, it’s more important to adapt. However this brings me back to our original though experiment revolving around the end of the Scientific Method.

    By not investing in the understanding of the subject are we truly “learning” anything of significance. Once again, I see the two worlds operating at a close parallel as opposed to merging. Running an A/B test will help decide smaller performance issues, but fine-tuning a product that has an end-point or finite life-span does not prepare a company for the big picture shifts on the market. Optimization of a website has been served well by the A/B testing, but this practice maintains its power in the narrow scope of the online world.

    As issues have been expressed many times here and throughout the semester, the implications of a causal relationship need to be expressed as such and explored beyond the direction of the data. Where the Engineers maintain a narrow scope, the Data Scientists involved in such practice need to maintain an eye on the world/industry they are impacting.

  28. Andrew Gelman’s post is interesting as it reflects broader tendencies to shroud unreliable information in the cloak of science and data. This is often seen in political opinion writing. Numbers are often used to make a case based on their presence in the argument alone rather than what conclusions should actually be drawn from them. I thought of this recently when reading this post at the Atlantic (linking to a New Yorker article): http://www.theatlantic.com/politics/archive/2012/11/the-obama-mandate-in-context/265643/

    Getting back to the specifics: the recent guest lectures on observational studies did discuss the limitations of such studies (i.e. the inability to draw causal inferences). The challenge is that these often come as qualifiers after, I suspect, the message has already been interpreted by the person hearing it or reading it. Statisticians, scientists, engineers and similar professionals might be attuned to listening for such nuance, but this likely does not extend to the general population. I suppose I might worry that this has the dual impact of those presenting observational studies neglecting to fully explain their limitations and those receiving such presentations not receiving the message as nuanced.

  29. It is interesting to note Prof. Gilman’s proclamation of experimental data as the Gold standard of knowledge in social sciences. However, it is worthwhile debating the authenticity of the fact that an experimental data is devoid of biases. Especially with the fact that, as Prof. Gilman notes, field experiments are expensive and have small sample sizes. For an experimental study, there is more than money at stake, by which I mean human capital. Under these circumstances, I wonder how immune is the academia and Bayesian statisticians in particular, from introducing inadvertent biases in their sample space.

    I am not sure how it has worked for the field of social sciences, but in finance, which theoretically believes in randomness of experimental data as the best source of information, things have gone awry. Signals were processed where none existed, and experiments were fashioned to model observational data, which is an irony in itself. In this respect, I believe, something like Fisher’s theory of design experiments could go a long way to define the standards of experiments. It is for this reason that I have a very positive outlook towards the methodology of A/B testing. In my opinion, the evolution of this experiment through the internet represents a high degree of randomness and little bias, as there is no direct interaction involved.

  30. I just saw this article, might be interesting.
    Obama Campaign – 240 a/b tests, 49% increase in donation conversion rate

  31. There is another article about A/B testing and the “Obama Startups” in the Economist that I just saw on HackerNews:


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: