Data Science is an emerging field in industry, yet not well-defined as an academic subject. This is the first course at Columbia that has the term “Data Science” in the title. So recently, Allen Bernard, a freelance journalist working on an article for CIO.com about the emerging role of the data scientist asked me three questions, and in my answers, I feel we’re getting closer towards The Case for Data Science. Here are the questions, and my answers:
- Why did you put this course together?
- Why now?
- What is a Data Scientist?
Why did you put this course together?
I proposed this course in March, 2012. There are three primary reasons. The first will take the longest to explain.
Reason 1: In short, I wanted to give students an education in what it’s like to be a data scientist in industry and give them some of the skills data scientists have.
I was working on the Google+ Data Science team with an exciting interdisciplinary team of PhDs. There was me (a statistician), a social scientist, an engineer, a physicist and a computer scientist. We were part of a larger team that included talented data engineers who built the data pipelines, infrastructure and dashboards, as well as built the experimental infrastructure (A/B testing). Our team had a flat structure. Together our skills were extremely powerful and we were able to do amazing things with massive data sets including predictive modeling, prototyping algorithms and unearthing patterns in the data that had huge impact on the product.
We provided leadership with insights for making data-driven decisions, while also developing new methodology and novel ways to understand causality. Our ability to do this was dependent on top-notch engineering and infrastructure. We ourselves each have a solid mix of skills including: coding, software engineering, statistics, mathematics, machine learning, communication, visualization, exploratory data analysis, data sense and intuition as well as expertise in social networks and the social space. No one of us excelled at all those things, but together we did, and we recognized the value of all those skills and that’s why we thrived. What we had in common was integrity and a genuine interest in solving interesting problems, always with a healthy blend of skepticism as well as a sense of excitement over scientific discovery. We cared about what we were doing and loved unearthing patterns in the data.
I live in New York and wanted to bring my experience at Google back to students at Columbia University because I believe this is stuff they need to know, and I enjoy teaching. I wanted to teach them what I had learned on the job. And I recognized that there was an emerging Data Scientist community in the New York Tech scene, and I wanted students to hear from them as well. So one of the aspects of the class is we have guest lectures by awesome Data Scientists, each of whom has a different mix of skills. We hear a diversity of perspectives, which contributes to a holistic understanding of data science. See more about the guest lecturers here.
Reason 2: Data Science has the potential to be a deep and profound research discipline impacting all aspects of our lives. Columbia University and Mayor Bloomberg announced the Institute for Data Sciences and Engineering in July, 2012. This course creates an opportunity to develop the theory of Data Science and to formalize it as a legitimate science.
Reason 3: I kept hearing from data scientists in industry that you can’t teach data science in a classroom or university setting and I took that on as a challenge. The students I have right now can certainly be turned into top-notch data scientists. They’re an extremely impressive and interesting group. I’m thinking of my classroom as an incubator of awesome data science teams.
Today human civilization has access to massive amounts of data about many aspects of our lives, and, simultaneously, an abundance of inexpensive computing power. A lot of human behavior (shopping, communication, reading news, listening to music, finding out information, expressing our opinions) went from offline to online. The thing about our online behavior that’s not the case for offline behavior is we have data about it. So now we have all this information about how humans behave and there’s a lot to learn about who we are as a species right there. It’s not just internet data though, it’s finance, government, medical, pharama, bioinformatics, social welfare, government, education, retail, the list goes on. Across many sectors there is data. In some cases, it’s Big, in other cases it’s not.
It’s not only the massiveness that makes data interesting or poses challenges. It’s that the data itself, often in real-time, becomes the building blocks of data products. In the internet, this means Amazon recommendation systems, friend recommendations on Facebook, film and music recommendations, and so on. In finance, this means credit ratings, trading algorithms and models. In education, this is starting to mean dynamic personalized learning and assessments coming out of places like Knewton and Khan Academy. In government, this means big data impacting policy and intervention. A feedback loop where our behavior changes the product and the product changes our behavior. Technology makes this possible: infrastructure for large-scale data processing, increased memory, and bandwidth, as well as a cultural acceptance of technology in the fabric of our lives. This wasn’t true a decade ago.
Bernard politely interrupted me here and said “And don’t forget about the Internet of Things.” True! We musn’t forget about the Internet of Things. Things and Humans. Let’s investigate that line of thought another time.
What is a Data Scientist?
Let me start with academia because that’s quicker. Then industry.
In Academia: No one calls themselves a Data Scientist yet in universities. There are 60 students in my class from across disciplines. I thought when I proposed the course it would be statisticians, applied mathematicians and computer scientists who showed up. Actually it’s them plus sociologists, journalists, political scientists, biomedical informatics students, students from NYC government agencies and non-profits related to social welfare, someone from the architecture school, environmental engineering, pure mathematicians, business marketing students, and students who already work as data scientists. Am I missing someone? They’re all interested in figuring out ways to solve important problems, often of social value, with data.
For the term Data Science to catch on in academia at the level of the faculty, the research area needs to be more formally defined. I see a rich set of problems that could be many PhD theses. My current working definition is a Data Scientist in this setting is a Scientist (from social scientists to biologists) who work with large amounts of data, and must grapple with computational problems posed by the structure, size, messiness and nature of the data, while simultaneously solving a real world problem. Across academic disciplines, the computational and deep data problems are the same. So if researchers across departments join forces, they can solve multiple real-world problems from different domains.
It depends on the level of seniority and whether you’re talking about the internet industry in particular. The role of data scientist need not be exclusive to the tech world, but that’s where the term originated so for the purposes of the conversation, let me say what it means there:
A Chief Data Scientist should be setting the data strategy of the company which involves a variety of things: setting everything up from the engineering and infrastructure for collecting data and logging, to privacy concerns; deciding what data will be user-facing, how data is going to be used to make decisions, and how it’s going to be built back into the product. She should manage a team of engineers, scientists and analysts and she should communicate with leadership across the company including the CEO, CTO and product leadership. She’ll also be concerned with patenting innovative solutions, and setting research goals.
More generally, a data scientist is someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. She spends a lot of time in the process of collecting, cleaning and munging data, because data is never clean. This process requires persistence, statistics and software engineering skills– skills that are also necessary for understanding biases in the data, and for debugging logging. Once she gets the data into shape, a crucial part is exploratory data analysis which combines visualization and data sense. She’ll find patterns, build models and algorithms, some with the intention of understanding product usage and the overall health of the product, and others serve as prototypes that ultimately get baked back into the product. She may design experiments, and is a critical part of data-driven decision making. She’ll communicate with team members, engineers, and leadership in clear language and using data visualizations so that even if her colleagues are not immersed in the data themselves, they will understand the implications.