Here’s a list of 10 important ideas we’ve explored this semester so far.
10. Interdisciplinary Data Science teams My experience at Google, along with DJ Patil’s piece on Building Data Science teams, informs my understanding of the importance of interdisciplinary teams. The students who showed up to take this class are from across departments and disciplines. I want you to build upon your individual strengths, as well as find ways to effectively collaborate with those who have complementary strengths, as this will be a critical component of your success going forward.
9. Democratization of Machine and Statistical Learning Algorithms Machine Learning algorithms used to be primarily used in Computer Science and Statistics departments. Now with the proliferation of new types of data sets, these algorithms are starting to get used across academic disciplines and within companies across sectors. With this democratization, it becomes imperative that those using the algorithms understand their meaning and potential impact.
8. Build a solid foundation of good coding practices Scientists need support in building a solid foundation in writing code, and coding practices such as paired programming, code reviews, debugging, and version control. We brought in Software Carpentry for a bootcamp this semester, and also we have the lab sections. Next semester, Ian Langmore and I will offer a new course, Applied Data Science, that puts this idea at the center.
7. Data Strategy For data scientists taking leadership positions in start-ups and industry, thinking in terms of a data strategy is a useful paradigm. Data strategy involves figuring out what data to collect or log, how to store it, legal and space constraints, the pipelines that are built on top of it, how data will be used as part of the company’s core business and how decisions will be made from data.
6. Little Data In addition to working with massive data sets and the engineering and infrastructure that’s been built to analyze and process that, we also still work with Little Data at Google. Andrew Gelman gave a talk recently about the relevance of Little Data. David Huffaker mentioned the small-scale surveys and user-experience interviews he uses at Google to supplement the analysis we do on much larger data sets. Oftentimes we sample from Big Data, which creates a Little Data set which can be used to explore and prototype.
5. The Space between the Data Set and the Algorithm Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well.
One needs to conduct exploratory data analysis as I’ve emphasized; and conduct feature selection as Will Cukierski emphasized. Brian Dalessandro emphasized the infinite models a data scientist has to choose from– constructed by making choices about which classifier, features, loss function, optimization method and evaluation metric to use. Huffaker discussed the construction of features or metrics: transforming the variables with logs, constructing binary variables (e.g. the user did this action > 5 times), and aggregating and counting.
As a result of perceived triviality, all this stuff is often overlooked, when it’s a critical part of Data Science. It’s what Dalessandro called the Art of Data Science.
4. Being Human We now have tons of data on user (human) behavior. The data scientist brings with her not just a set of machine learning tools, but her humanity to interpret and find meaning in data, and make ethical data-driven decisions. Dalessandro mentioned our humanity in the context of knowing our own limitations.
3. Causation or Causality, Correlation and Experiments Related to the fact that Little Data is still important, so are the classical statistical concepts of causation, correlation and experiments. Experiments and causal inference (e.g. propensity score modeling) are important parts of an engineer’s and statistician’s tools at Google. We’ll be exploring these more in November.
2. Feedback Loop The data generated by user behavior becomes the building blocks of data products which simultaneously are used by users and influence user behavior. We see this in recommendation systems, ranking algorithms, friend suggestions, etc. And we will see it increasingly across sectors including education, finance, retail and health. Cathy O’Neil described this feedback loop beautifully in class. Keep the financial meltdown in mind as a cautionary example.
1. Causing the Future Prediction and Causation are two important themes in statistics, machine learning and data science. Much is made about Predicting the Future (see Nate Silver), Predicting the Present (see Hal Varian), and exploring Causal relationships from observed data (the Past) (see Sinan Aral). The next logical concept then is: models and algorithms not only capable of Predicting the Future, but also of Causing the Future.