Yesterday Gil Press featured our class in his Forbes column, Big Data News of the Week. I’d like to take the opportunity to briefly discuss the ideas discussed in his article, as well my broader struggles with the celebrification of data scientists.
Our class in the press
I call this to your attention because I am proud that we have the power to impact the wider dialogue around Data Science, and I want you to know that you are a part of it. We have the power to steer the substantive discussions about Data Science to important discussions around data intuition, values and integrity, the human role in Data Science, ethics and hubris, and not just the technical and algorithmic aspects, which are important too, but not particularly new.
Most importantly, Press calls attention to our in-class historian’s (AKA Columbia Professor Matt Jones, who has been auditing the class) blog post of last week, Data and Hubris. Press then goes onto connect it to Josh Wills’ “big data economic law”, which Josh mentioned to us in his guest lecture last week. His “law” being that individual records are less meaningful separately, but collectively they can have meaning. Press then goes onto describe ways Josh’s law is getting discussed and debated elsewhere in the press.
The value of Big Data
Let me address the debate briefly. How much information can be extracted from data has long been a concern of statistics, and is related to sample size calculations, the granularity level of the analysis, and the estimation or prediction problem one is trying to solve. Ultimately it comes back to what kind of question are you trying to answer or what kind of problem are you trying to solve? Are you building an algorithm to predict or a model to explain? At what level of granularity do you need to be? How accurate do you need to be? How much uncertainty are you willing to withstand? And so on. This is what the field of statistical inference is all about.
Ariel, a student in the class, brought up: why can’t we just sample? Great question. In many cases we can just sample. As Josh pointed out, it’s in cases where the long tail matters (i.e. very rare events or sparse data sets) where if you sampled, even from massive amounts of data, you would lose information. In search this would mean “rare queries”. In health care or marketing, it could involve rare health conditions or characteristics, people in sparsely populated geographical regions or demographics (people between the ages of 72-84 who bought an iPod yesterday and downloaded more than 100 songs to it already, for example). There are Bayesian methods such as hierarchical or multi-level modeling where it’s possible to borrow strength from less sparse regions of the data set to sparser regions to make inferences in cases where we don’t have as much data. (And this can happen in even very large data sets — that at a granular level there isn’t much information.) Makoto, who is one of my colleagues at Google, was in the classroom that day, and afterwards we were discussing it, and he reminded me of the unsolved sampling issues around network data, including very large network data sets. Even sampling is tricky.
All this to say that these are actually quite complicated questions and problems, and there is a lot of research still to be done. And while soundbites and tweets can sometimes capture the essence of ideas quite well, they oftentimes make it all seem so much simpler than it is, and serve to degrade the tremendous amount of work researchers of high caliber do across universities and industry to solve these problems, and have been working on for decades already, since well before “Big Data” became a buzz word. And there is no quick one line answer, other than “It’s complicated”, and we must build our work on solid foundations.
I should add that there are new interesting, research problems created by the new types of data we have as well, specifically around issues of time and streaming data, text and other forms of unstructured data, and what inferences and predictions can be made. There are also interesting problems around what inferences or predictions can be made at the individual or user-level even if you have very large data sets. And I view these all as open, non-trivial problems.
Celebrification of Data Scientists
On the one hand, I’m excited that data gets so much attention because that’s why our class is getting attention and we are discussing important ideas, which deserve attention. On the other hand, I struggle with the fact that beyond our class, the discourse around data and data science takes place in vapid tweets and in sensationalized reports in the media. Oftentimes bloggers, tweeters and the press are not qualified to distinguish between fact and fiction, between what is actually new and what others have been laboring over for years, and then celebrate those who are standing on the shoulders of giants, or worse, people who do not know what they are talking about in the least.
My concern is: how are readers to know what is real and what is fake? This problem could be re-framed as a machine learning problem, but that’s another story.
As the course comes to an end, lingering with me are still some of the same struggles and questions I’ve had with the broader “Data Science ecosystem” since before the course began.
Celebrification of Data Scientists and Data Science Posers is disturbing. We are in a world where self-appointed experts with very little experience or expertise get celebrated in the press with no due dilligence or peer review process; where falsehoods spread like wildfire on twitter; where self-promotion carries more weight than actually Doing Data Science; where people with no credentials are pontificating on subjects they know nothing about, and do not have the strength of character to say “You know, you shouldn’t be asking me about this. I am not an expert”, but instead take the opportunity to climb their way up “best-of” lists and industry expert panels on the basis of a weak foundation and lies, while real statisticians, analysts, researchers, software engineers and computer scientists, labor away relatively anonymously actually doing work.
This troubles me deeply because the consequences could be devastating. If it continues to be possible for virtual charlatans to pass themselves off as “scientists”, then the models and algorithms they are building could do damage through the products they build and the public policies that get built upon their models and analyses. Further still, if they don’t have the integrity to say “I am not qualified to talk about this”, then they certainly don’t have the integrity to evaluate the consequences of their analyses or work.
In class, I asked Josh what he thought about celebrification of data scientists, and he said it didn’t really bother him. The word “Data Scientist” put a name to what he already was doing, and he’s benefitting from it. And in general, he just doesn’t have the same level of angst over the whole thing that I do.
Cathy O’Neil and I discussed Data Science Posers a lot over the summer. I wanted to create a class that as Cathy said was a “no bullshit zone”.
“Big Data” and “Data Science” is steeped in so much marketing hype and attention-seeking, self-promoting behavior, that it was essential to create a classroom where ideas and truth-seeking trumped hype and sensationalism. I think we achieved this with the wonderful guest lecturers we had throughout the semester who were honest about the struggles they have with building models, algorithms, evaluation metrics, feature selection and so on, as well as the ethical dilemmas they faced, and how they bring their humanity with them to problem solving.
But still this course comes up right against the fact that we live in a society where ideas don’t matter anywhere nearly as much as who tweeted it first. Statistics classes never faced this problem. Then again, I brought this on — I created the class in the first place to address this very problem.