This is a guest post from our course’s teaching assistant, Ben Reddy. Ben is a second year PhD student in the statistics department. He’s had the chance to view the class from a unique angle, so I wanted for us to get his perspective. Also, thank you Ben for your support this semester!
We’re nearing the end of the semester, so I thought it a good time to reflect on some things I’ve learned and thought about while TAing the class. The students have come a long way over the course of the semester. Some of them couldn’t load a dataset into R (ah, the memories), and now they’re training classification models on text features for the class Kaggle competition. Hopefully they feel good about the progress they’ve made and the things they’ve learned. Now that they’ve taken a (very) big step towards becoming data scientists, some perspective is in order.
Here’s the short version:
- Data science is rooted in computer science and statistics.
- Coding is a prerequisite for getting started as a data scientist.
- Doing statistics is not the same as understanding statistics.
- Data scientists work on important problems.
- Data scientists should make understanding statistics a high priority.
- Statistics as a field has a reputation in the data science community (and elsewhere) as outdated and stodgy. It is neither.
- Statisticians should do a better job of communicating with the non-statistics world.
- Statistics departments need to update their curriculum and pedagogy.
Here’s the long version:
Foundational fields of data science
One of the central themes we’ve revisited throughout the class is the question, “What makes a good data scientist?” On the first day, we all took an inventory of our skills and each guest speaker has opened their talk with their own inventory. The skills of a data scientist were broken down into machine learning, visualization, stats, math, CS, domain expertise, communication. Whether those are all- or over-encompassing is the subject of another discussion, but there seems to be general agreement that these skills are a good starting point, at the very least.
A related, but different, question is, “What disciplines contributed to making data science what it is today?” A decent starting point is this. Reading through, it becomes clear pretty quickly that data science has two parents in traditional academia: statistics and computer science. This shouldn’t surprise anybody, and surely I’m not the first person to make such a claim.
Recognizing one’s intellectual parents is well and good, but it doesn’t mean anything if a data scientist doesn’t have a good grounding in CS and stats. There seems to be pretty widespread recognition that data scientists need to be able to write code, quite possibly for the practical reason that if you can’t at least use R or NumPy or something similar, you’re not going to be able to work with real data. Even getting real-world data into a form where you can analyze it often requires some combination of command-lining, sed/grep/awk, and Python/Perl/whatever-you-like. And let’s not get started on the truly massive data sets, which require another set of skills (SQL/Pig, MapReduce, etc., etc.) that takes us into the magical realm of data engineers. For someone just getting started in data science, it’s pretty obvious where the learning priority is. You need to code to work with data, and you need to work with data to get experience, which is a lot of the reason there has been such an emphasis on coding in this class. The intention was to get the students on their feet as quickly as possible so they could learn through experience — there’s no point in taking a class on data science if you can’t actually practice data science.
Barriers and bumps (and Doers and Understanders)
Hopefully we all agree that inability to code is a practical barrier to entry (though as Ian Wong from Square noted, coding and coding well are very different things). Poor old statistics, on the other hand, doesn’t have a similar barrier. It’s more like a speed bump, if that. Load your data, press a few keys, and voila, you’re doing statistics. But this should be the starting point. Anyone who can type <t.test(data)> into R (not to mention <lm()>, <knn()>, <gbm()>, etc.) can “do” statistics, even if they misuse those methods in ways that William Sealy Gosset wouldn’t approve on his booziest days at the Guinness brewery. It’s an entirely different thing to really understand what’s happening when you type those commands, and, more importantly, to use that knowledge to drive each stage of your analysis.1 I can’t stress enough how important this is. The biggest thing I would change about this class would be to place more of an emphasis on understanding the statistics involved in our models. But we were time-constrained, Rachel (and Jared and I) wanted to get the students operating at a basic level quickly, and as we all found out, there is A LOT of material. Plus, it’s the course’s maiden voyage. Lessons for next time.
This is not to say that every data scientist needs to be an expert in statistics, or that you can’t be a data scientist if you didn’t major in stats, or even that you need to have taken a stats class in school. However, there is a big difference between I-kind-of-get-logistic-regression-and-I-can-implement-it-in-R (Doer) knowledge on the one hand, and knowing that logistic regression is part of a more general class of models, knowing how the coefficients and confidence intervals are estimated, etc., on the other. Let’s call the latter I-understand-general-linear-models (Understander) knowledge. I submit that Understander-type knowledge is important, especially when your logistic regression starts spitting out crap results or your estimation routine doesn’t converge or, God forbid, you have weird data and have to write your own estimation routine because there isn’t an R package for what you want to do. This is to say nothing of the possibility of re-inventing an inferior wheel to do something that smart people came up with good methods for a while ago.
Data science is important. So is being an Understander
In an ideal world, not really understanding stats (i.e., consistently operating on a Doer-level) would eventually catch up with you as a data scientist, but it’s often hard to catch someone out on this without a detailed examination of their analysis, and I’m not sure that there is enough transparency and self-policing so that it actually happens all that often. That statistics have a somewhat deserved reputation for being easily manipulated in order to express a certain point of view (though the abuse of statistics is more the realm of, say, politicians, than of statisticians) says something about how difficult it is to see when someone is building faulty statistical models, especially when their intentions are good and even more especially when their results look pretty. Unintentionally faulty statistics are every bit as misleading as intentionally faulty statistics, however good your intentions may be. David Madigan talked a bit about this in class. When more often than not the data scientist is the expert in the room, it’s up to them to understand statistics and to make sure they’re building sounds models.
But really, who cares if someone at an online advertising company implements a wacky model? Other than some awkwardness when I try to show my family a YouTube video of a chimpanzee riding a Segway (still one of my all-time favorites) and an ad for lingerie pops up, does it matter if a data scientist gets it really, really wrong? With data scientists working on more and more problems that impact public welfare, the answer has to be a resounding, “Yes!”
A student asked a great question to Claudia Perlich in class: “Why do so many smart data scientists go into marketing?” Claudia gave a couple of reasons, but the one that stood out for me was when she compared working in marketing to doing research in breast cancer detection, which she talked about earlier in the class in the context of winning the 2008 KDD Cup. She said that it doesn’t matter so much if you get it wrong in marketing, but getting it wrong in either direction in cancer detection carries much more serious consequences, which would keep her up at night. This doesn’t mean that data scientists are cowards or shouldn’t work on those types of problems, but it does mean that if you’re going to work on a problem that affects others, you should do your damn well best to make your model every bit as sound and robust as it can be. It means knowing every assumption, every conceptual detail involved in your model, and how they might affect the people on the other side of the model.
Now let’s pretend you’re a full-fledged data scientist but you never really made the effort to understand statistics beyond the Doer level, and you’re hired to work on the breast cancer detection problem. You might fit a support vector machine (SVM) model, carefully choosing the basis functions and tuning the parameters. Nice. Feel good about yourself. It likely performs well from a prediction perspective. Your model is estimating whether, given the data in the images, the probability of a patient having breast cancer is greater than 50%.2 A statistician or, even better, a data scientist with an Understander grasp of statistics, wouldn’t be satisfied by this, though, and might fit a penalized logistic regression model (PLR) instead. Why? Not only does this model estimate the probability of a patient having breast cancer, it also enables inference: Is that estimated probability significantly different from 50%? Can we trade some bias for lower variance to get a better predictor? What are the properties of the estimator when we have unlimited data? These types of questions, not to mention things like sampling, experimental design, analysis of confounders and causality, sample size, etc., have been the bread and butter of statistics for a long time.
There’s something more here. Given the Understander’s assumption on the distribution of the data (namely, that conditional on the data in the image, the outcome — cancer or no cancer — has a binomial distribution, which we’re pretty much doing implicitly in the SVM case) and their prior knowledge/beliefs about the model parameters (which takes the form of the penalization term), they’re estimating the model that is most likely to explain the data. That is why it’s called maximum likelihood estimation (MLE), which is one of the most important concepts in statistics. (Nevermind that PLR is doing maximum a posteriori (MAP) estimation — it’s just the Bayesian equivalent of MLE, and anyone still passionately fighting the Bayesian vs. Frequentist war can kindly take that elsewhere.)
Breast cancer detection is clearly a very difficult problem and I’m not claiming one method is better than the other. There are certainly situations when prediction is all that’s needed and it’s efficient to just scrap the inference. I’m using this example to highlight the value in approaching the problem like a statistician. The approach is the core issue. Claudia said that she thinks things as important as breast cancer diagnoses should be done by human experts, i.e., doctors, and that machines can serve to provide more information or to raise a flag if it sees something the expert doesn’t. I generally agree with her. In that setting, wouldn’t it be better to have a system report a probability, with the corresponding confidence interval, of having identified cancer rather than just reporting a binary prediction? When important decisions are made based on that information, it seems obvious to me which is more valuable.
Scientists of all kinds, including data scientists, work on important problems. When those problems intersect with human lives — in the case of the data scientist, this almost all the time — the work becomes even more important, and the results have a big impact. Data scientists are working in healthcare, education, city planning, politics, and yes, marketing — the list goes on. Their work is highly relevant and impacts the lives of many people, and as such, data scientists should carefully weigh the consequences of their analyses, and make them as informative as possible. Being able to provide information about the precision and accuracy of their estimates is often just as important as the estimates themselves.
Stodgy statistics is dead (or dying)
Given the clear advantages of approaching these problems with a strong understanding of statistical methodology, I would expect statistics to be a highly valued area of expertise in the data science community, but somehow the hacker mentality prevails and statistics often gets cast as a cranky, crotchety old man who should be ignored by us rebellious youth. I’m imagining some young, stylish data scientist bursting into a statistics classroom full of bored-to-death students listening to a crusty old professor talking about t-tests and yelling, “Throw off the chains of your oppressive statistical models! Data science will liberate you!” Cue music as the students throw their papers in the air and run out the door laughing and shouting.
Sure, there’s some truth to statistics being dull and uninspiring. For a long time, most of statistics as a discipline was too rigid, wed to classical approaches with elegant solutions disconnected from actual data, and many times not particularly interested in solving real-world problems. But that view is itself largely outdated. While there is still a non-trivial percentage of academic statisticians practicing stodgy statistics, that era is largely over. In whatever sense that data science is cool, so is statistics — at least I think so. I nerd out over things like this and this and this. (Then again, I also like stuff like this and this, so I guess cool might be the wrong word.)
Even if you don’t find the theory behind the methods particularly thrilling, it’s clear that a lot of statisticians are working on interesting real-world problems. Much like data scientists, actually. So much so that you might be tempted to say that data science is just statistics done poorly, which I strongly disagree with. First of all, plenty of data scientists know statistics well enough that they could call themselves statisticians and nobody would raise an eyebrow. Secondly, while statistical analysis may be at the heart of the work a data scientist does, it’s usually only one step in the process. We’ve seen a lot of good examples of this from the guest speakers in class, whether it’s visualization, fraud detection systems, or building statistical analysis back into improvements to a website. So while there’s significant overlap between the two, statistics and data science are not the same thing, and they usually have different objectives.
Statistics is cool — spread the word
If statistics actually is interesting and relevant, why, then, the lingering odor of outdatedness? I consider this a failure on the part of statistics as a field. It has a history of isolationism when it comes to working on methodology across fields, and as a result has been slow to move, both intellectually and pedagogically, into many areas of real current interest. Thankfully, there have been statisticians who realized this a while ago and have been important in bridging some of the gaps between statistics and other fields like computer science and machine learning.
Even so, what we’ve got here is a failure to communicate. There’s a thin line between telling the outside world that you came up with something useful and shameless self-promotion, but statisticians have rarely done either. When the typical new data scientist comes from a field other than statistics and only has time to pick up whatever stats knowledge they deem useful (i.e., Doer-type knowledge), I can hardly blame them. They’re solving a constrained maximization problem in which their objective function doesn’t seem to increase much by amassing Understander-type knowledge. Statisticians need to do a better job of communicating the importance of statistics — of showing how much better a data scientist who could moonlight as a statistician really is.
I keep thinking of the talk that Mark Hansen gave to the class a few weeks ago. In one part he described his work at the New York Times on Project Cascade, which, by the way, is pure viz bliss. As much as I like the viz, though, I find the methods being used to drive it way more interesting. Here’s what got recorded on the class blog about it: “There were of course data decisions to be made: a loose matching of tweets and clicks through time, for example. If 17 different tweets have the same url they don’t know which one you clicked on, so they guess (the guess actually seemed to involve probabilistic matching on time stamps so it’s an educated guess).” Cathy is extremely thorough with these kinds of things, so I’m comfortable saying that Mark didn’t actually say how they made the educated guess. Luckily for me, I saw Jake Porway talk about Project Cascade last spring, and he described how they did it: Bayesian inference done with Gibbs sampling. Unluckily for the students of this class, this really interesting bit of modern statistics got swept under the viz. So now when they think about how they might do something similarly ambitious, the list of skills is: manage a massive data set; make an educated guess; design and program something that looks really, really awesome in Processing. But something’s missing, something worth learning and understanding, and the students aren’t even aware that it’s missing. To be fair, Mark’s talk wasn’t about statistics, per se, but even a brief mention of the methodology would have been better than nothing. This seems like a missed opportunity to inspire curiosity about statistics, and statisticians need to do a better job of just that.
Catching up the curriculum
How to teach statistics (or anything, really) in ways that are consistently inspiring is a difficult problem, but a good place to start is by taking the theory classes, especially at the undergraduate level, out of the vacuum. Statistics has been taught for too long as if it’s disconnected from the world. Students will be more motivated to learn the math and the details, to become Understanders, if they know how interesting and rich the potential applications are. Keep teaching MLE, and show how it might be useful in detecting breast cancer. Keep teaching Bayesian inference, and show how it might be used to match tweets to clicks. Iterate. Unless a student has their heart set on being a probabilist, their interest in statistics will be driven to a large extent by applications of statistics to real-world problems — like those encountered by data scientists. The failure to adapt curriculum is as much responsible for statistics’ reputation as boring and outdated as is anything else.
I’m continually amazed by the new and important ways in which people are using data. In doing so, they’re enriching the lives of many others. As data scientists continue to explore and innovate, I hope they develop an appreciation and deep understanding of what statisticians have been doing for all these years. At the same time, in order to thrive as a field, statistics needs to continue its descent from the ivory tower and adapt its research, teaching, and communication. I’m committed to holding up my end of the bargain as a statistician. Hopefully the data scientists in this class (and elsewhere) are convinced that it’s a good deal.
1. This actually applies to any field using statistical methodology, particularly those that don’t necessarily keep up with current statistics research.↩
2. You can get approximate estimates of the probability using SVMs, but that’s not really what they’re good at and anyway, that’s not the Doer’s approach here.↩