This is my last blog post for Statistics 4242, Introduction to Data Science at Columbia University. All final projects have been turned in; grades have been given; the semester is over. I reserve the right to start blogging again at a later date.
From the beginning, this course viewed Data Science simultaneously in two (seemingly) contradictory ways:
(1) a set of best practices
(2) a (potentially) deep scientific discipline, still yet well-defined.
So we were set up right there with a tension between the practical and the profound. Allow me to elaborate further.
We explored both these perspectives throughout the semester, and this blog has documented some of that, though not whatever has been going on in your heads.
Question-askers and Problem-solvers
On the last day, I mentioned that what matters to me is that you become problem solvers, and cultivate habits of mind that allow you to figure out what to do in situations, or when faced with problems, where you don’t know what to do. To struggle with the ill-defined, with the unstructured, and with the nebulous. To be curious, creative problem-solvers and question-askers, to use data to make the world better, not worse.
Recently, DJ Patil, who with Jeff Hammerbacher first introduced “Data Scientist” as a job title in Silicon Valley, spoke about data scientists, and two key takeaways for me were:
(1) When deciding upon this new job title, they considered the job title “Data Artist”
(2) Two key traits, in his view, are curiosity and story-telling
Observe that many people may have these two traits, but not the pre-requisite “best practices” as well: the ability to write code, to build models and algorithms, feature selection– the statistician-software engineer hybrid. But the reverse is also true — it is possible to get many of the technical skills in other courses: machine learning, applied statistics, multiple CS courses — so I wanted this course to cultivate the more artistic side of data science. The effective data scientist is adept at the technical, while simultaneously embracing their humanity.
Gain Mastery over Technique so you can Create
Art departments, architecture and journalism schools, engineering schools, have a history of teaching subjects where both Mastery over Technique, as well as human expressions of creativity co-exist. Degree programs in these disciplines emphasize both a set of skills one must master, while simultaneously allowing students to create products of their own ingenuity: works of art, designs for new buildings, newspaper and magazine articles, software and code, robots. Let’s add data visualizations, data products to the list.
I attempted to embrace this duality in the design of this course– giving you a set of practical skills through the labs, problem sessions and homework assignments, while simultaneously throwing you into projects where your own ingenuity and creativity would be called upon, culminating in the Kaggle competition and the Think piece. While this was only version 1.0 of Introduction to Data Science, I feel comfortable saying I was on the right track with this, though I recognize that it was uncomfortable for many of you.
Think Piece Accepted to O’Reilly Radar
Shall I start with proof of success? The Think Piece has already been accepted to O’Reilly Radar, and will be published later next semester. The piece investigates the paradigm shift going on in industry and universities with respect to data. This assignment caused some amount of chaos in the class. I know you didn’t all like it. But I did warn you about simulated chaos early on in the semester. You were transformed from Data Scientists to Data Journalists — question askers, story tellers (see Patil above). Though some might argue that it was not the transformation itself that was initially bothersome, but the lack of structure (see Habits of Mind above for why this made sense pedagogically, as well as Data Science for Change for structure provided).
Data Science and Data Journalism
I have come to think of Data Scientists and Data Journalists as duals of each other. For non-mathematicians (from wikipedia): “a duality, generally speaking, translates concepts, theorems or mathematical structures into other concepts, theorems or structures, in a one-to-one fashion, often (but not always) by means of an involution operation”. Or another analogy to consider is that of gene expression. Consider a collection of skills and traits: coding, ability to work with data, data wrangling, data visualization, story-telling, curiosity, question-asking, statistics, machine learning, writing, communicating… To do the job of the data scientist you need these. To do the job of a data journalist, you need them as well. They will express themselves differently in different people.
Use Data Science to Solve Humanity’s Big Problems
Similarly to do anything that could be classified as scientific research, or more narrowly research that involves data, one must have cultivated these. My hope is that some of you will go onto use Data Science to solve humanity’s big problems. I believe that the key to this will be formation of teams of people who collectively have a set of skills, and habits of mind that include both the Data Scientist Skill Set as well as Domain-specific knowledge. The students in this course came from across disciplines and so collectively you have the potential to solve problems in environmental engineering, sociology, political science, urban planning, epidemiology, bioinformatics, the list goes on.
Data Science Curriculum and Pedagogy; Cause a Better Future
I have set out in this course to develop a data science curriculum and pedagogy that embraces the complexity of field only now emerging. One that cultivates skills sets, habits of mind, and a philosophy and set of ethics around building models and algorithms that will not just Predict the Future, but Cause the (a better) Future. One where we embrace our humanity, while simultaneously building up a technical foundation that involves working with machines, to ultimately solve important problems. More on my advocating for using data for a better humanity, and embracing one’s own humanity, can be found on my Tedx Women talk, which I did on December 1st, and has not yet been posted as of this writing, though I expect it will in the next week. It is partially dedicated to you (the students). [Updated on Jan 8, 2013: here is the TEDx talk]
Dear Students, thanks for going on this investigative journey with me. I think we made a lot of progress together in understanding the potential of Data Science, and in proto-typing a Data Science course. Ideally this wouldn’t have been a single course, but several courses — you got the intensive version! Keep in touch, and remember that you are Next-Gen Data Scientists. I have high hopes for you!