Languages are used for communication, and the better we are at communication, the more effectively we can solve problems. I want to revisit the issue of language in the following senses: programming languages, the languages that people in various disciplines (or domains) speak and the language of Data Science. I want to raise the issue of religious wars over preferred language, and the danger this poses for scientific advancement; and how as a class, we want to make progress towards overcoming this.
Also I want to anoint you the Next-Gen Data Scientists, with all the responsibility and hope this entails! I’m inspired to do so after a conversation with Greg Wilson,the lead instructor of this weekend’s boot camp, and Columbia Statistics Department’s Professor Victoria Stodden, and after reading Cathy O’Neil’s blog post yesterday.
Background: Software Carpentry
To provide some immediate context, we organized an (optional) two-day computational skills workshop (or Boot Camp) on Friday-Saturday for students in the class. It was offered through Software Carpentry, a non-profit run by Greg Wilson, with the following stated mission:
Our mission is to help scientists be more productive by teaching them basic computing skills. Our approach combines short, intensive workshops with self-paced online instruction. The benefits are more reliable results and higher productivity: a day a week is common, and a ten-fold improvement isn’t rare.
Greg and his team of instructors (scientists themselves) use evidence-based instructional methods, and when I met the students at the end of the workshop on Saturday, there was an evident level of enthusiasm for what they had learned, ranging from the very practical to what could perhaps be described as finding new religion. Some topics covered:
— Pair programming. In addition to Greg’s recommendation, anecdotally, Ian Wong at Square is an advocate (he’ll be coming in a few weeks to speak to us); also Princeton Professor Brian Kernighan (the “K” in AWK), is a proponent. Here’s an interesting discussion of pair programming by Stuart Wray.
— Time management (brain only being able to be productive for 1 hour at a time; chunking coding tasks into hour long chunks)
— Code reviews (chunking code to make it easier for the reviewer; fresh eyes help debug)
— Version control
— Unit tests
— Creating organized, readable code
— Building a solid foundation of software engineering skills
— Although I wasn’t in the room most of the time, from speaking to Greg and the students, and observing small portions, it was evident that Greg infused the workshop with a lot of his hard-earned wisdom.
Students who were there, feel free to add your comments to this if I missed something or was off, or also if you agree.
A lot of these practices and culture are baked into Google, and I learned it on the job, but rather than waiting for students to have to learn it on the job, or worse-yet, never learning it at all, Greg and his team (and, I hope, me and my team) are helping build the foundations now. Now remember! The students in this class are primarily not computer scientists or aspiring software engineers, but you are Scientists (social scientists, biologists, physicists,…). A lot of you are going to be dealing with data sets specific to your domain, and the stronger you are in computational, coding and engineering skills, the easier it’s going to be for you to solve the science problems in your domain. Greg asked how many students in the room actually liked programming and a fair number raised their hands. Greg said this was unusual for scientists. But you are “Data Scientists”! [Next-Gen Data Scientists who are going to save the world! No pressure.]
There was some selection bias here. Greg readily admitted that he lost a large number of students at some point because of installation/implementation issues and this is something he’d want to improve upon going forward. It’s also something I need to find out more about. Students weren’t required to take this boot camp, so we probably started with about 1/2-2/3 of the class on the first day, and by the end, it seemed down to about 1/4-1/3. I need to find out why those who left did. (Please feel free to email me privately or talk after class if you have feedback – It would be extremely helpful)
Language Religious War: R versus Python
When I walked in the classroom towards the end of Day 1, there seemed to be a little bit of a religious war (or maybe I instigated it?) over R versus Python. A lot of statisticians in Columbia Statistics department code in R, and that’s primarily what we’ve used in the course until this point. Greg and his team teach Python, (which I also want students to learn, and why I brought in Greg). The reason thus far for Greg’s not teaching R is he doesn’t believe that rigorous coding practice is part of the culture that surrounds R. I buy this in the sense that students generally learn R as needed in the classroom and don’t take an “Intro to R” course. But that’s the point of Jared’s Lab that we’re offering. Just because good coding practice may not be part of R’s culture in academia, doesn’t mean that it’s not possible to code well in R. We want students to code well in R. More importantly, we want students to code well. (Greg mentioned they found a good instructor for R going forward, and are always looking for more).
R and Python are both good choices for scientists who deal with data. You don’t need to choose; you can use both. Developers and contributors are constantly introducing new versions, packages and functionality, so there’s always more to keep learning. But whichever language you start with, you can pick up other languages and the peculiarities of their syntax later on as you need them. Don’t get too attached to tools, languages and methods; use what gets the job done. Be versatile.
Silos of Academia (Domains) [Big Data Domain Surfing(Part 2)]
This whole discussion of choice of language reminds me of my unfinished discussion of domain-specific languages a couple weeks ago. Key points I was making in an obscure way:
(1) Different domains (bioinformatics, sociology, finance) use fairly specific vocabulary that people outside the domain don’t understand.
(2) This can be alienating.
(3) It’s important, as a data scientist, not to be alienated, but instead to ask lots of questions so as to get to the data problem underneath. Don’t let the fact you don’t know certain words make you think you can’t understand the underlying problem. You just have to be honest with yourself about what you don’t know and be willing to ask questions of the person (domain expert) who does know.
(4) Scientists across domains are solving the same problems, just in different contexts. The spam classification problem is “data science equivalent” (DSE, I made that up) to the suicide note classification problem which is DSE to the ad-click prediction problem at M6D. The structure of the data is the same, the computational aspects are the same, the method is the same.
(5) If the Scientists have to spend all their time struggling with the underlying engineering and infrastructure, getting their computational skills up to speed, learning all the algorithms, struggling with massive data sets, etc; then they’re not going to have much time to Save the World. Students and Scientists are going to spend their entire PhD program just learning the computation, when really they ought to be finding a cure for leukemia, unlocking the secrets of autism, and finding ways to help foster children.
(6) The silos of academia are a problem. There needs to be solid engineering infrastructure and support (the kind that I benefit from at Google), along with experts at understanding the structure of data and Machine Learning and Statistics, so that Scientists (Data Scientists) across domains can all rely on that same data engineering support and in fact collaborate across domains. Even though they may be focusing on solving *different* important real world problems, they are in fact trying to solve the *same* data problems; and they need to be able to get past their differences in domain-specific language and talk to each other using the language of Data and Data Science.
(7) I’m confident, even in my brief time so far with the students in this class, that some (if not, all) of you are motivated and capable of doing this. And that’s why you’re the Next-Gen Data Scientist = [Real] Scientists who work with Data.
What Languages do I use at Google?
Some students have asked what language I program in at Google. I’ve used at various points: C++, R, Python, Sawzall, Dremel and shell scripts. I use them with varying degrees of sophistication and have code checked into production in some of them. I learned programming in introductory C/C++ classes as an undergrad and grad student when I was at the University of Michigan (1994-1997) and Stanford (1998-1999), respectively. That got me fairly far in learning good coding practice, and helped me bring that to coding in R, which I learned as a PhD student at Columbia(2005-2009). I learned Python, Sawzall and Dremel on the job. Sawzall and Dremel are internal Google languages, with corresponding open-source versions. Sawzall is used for mapreduce, which we’ll get to later in the semester. An open source version is called Pig. Dremel is a version of SQL internal to Google, and the open-sourced version is Hive. R is widely used by statisticians (AKA “quantitative analysts” or maybe now, “data scientists”) at Google. My approach is to use whatever helps me get the job done.
I started my career before the term Data Scientist existed. Next-Gen Data Scientists ought to gain solid computational skills to the extent possible in the classroom. Or to put another way, universities need to find ways to support this aspect of Scientists’ education.
Language of Data Science
Updates to our vocab list are welcome!