I received the following questions from students
(1) How is this class different from the Machine Learning and Data Mining class?
Good question. I can understand why you ask this given the appearance of many of the same books on the syllabus. So let me give you some insight into how I developed this course. When I proposed this class earlier in the summer, I faced the issue of there being no well-defined data science curriculum (by which I mean a textbook). This reflects the fact that data science is new, and not a necessarily well-defined term or field, as we discussed in class. I gathered together some of the best data scientists I know (who are among the speakers for this class) and brainstormed topics and themes we wanted addressed in the course. We needed some underlying structure and tools/models/methods seemed like a decent backbone. We also agreed that the other threads were equally important in our experience: data visualization, good communication, asking good questions, coding, the messiness of getting data into good shape, etc., and real world context (case studies). [I by no means claim we were the first group of people to be having these conversations, but in our experience, these were things we wish students were taught.]
So for this first iteration of the course, ML/DM seems like a useful organizing principle. But the other themes of the course are equally important and will be a large component of the class. Each guest speaker will focus on a specific ML/DM topic and will use that as a jumping off point to address all the other course themes.
In addition to machine learning and data mining, we have components of classical statistics (exploratory data analysis, experiments and observational causal modeling), updated to modern massive data sets; and social network analysis and data journalism/visualization, which are non-standard topics.
An important aspect is that we will be hearing many different lecturers’ perspective on what data science is and how it’s done in practice. We’ll explore these various angles, and see whether it is possible to teach and learn data science in the classroom setting. One of the features of data science is that it’s learned by doing, and it’s done under messy circumstances. Can we simulate that experience in the relative sterile environment of a classroom? Can we provide sufficient structure to learn while also instilling in you the sense of resourcefulness and creativity required to be a good data scientist?
(2) Why did you put data processing and engineering at the end of the class? Doesn’t that happen before everything else?
Yes, data processing and engineering does need to be done to produce data and process massive amounts of data before you can do anything with that data. I come to data science as a statistician, so I came to learn about the analysis of data before learning about processing and munging data at a large scale. In school, I learned statistics, and then later at work, I learned the engineering side of things by working closely with engineers. We’ll meet some of those engineers later on in the semester.
“I learned it that way, so so should you” is not a strong basis for developing a course. A more valid one is I think it’s a pedagogically sound way to do things. I can’t throw everything at you at once: messy data sets, large data sets, new models and methods, new programming language (R). So we’ll start with cleaner data sets and move towards messier ones. We’ll learn the algorithms, models and methods on smaller data sets and then figure out how to do them at scale. We’ll have a lab running in parallel to build you up with the coding and programming language (taught by Jared) and we’ll have our TA, Ben, supporting you with the underlying math and statistics.
In addition, one shouldn’t be bothered with the implementation level prior to being very familiar with the concept and logic of it. How to, comes afterwards.
(3) Will this help me get a job as a data scientist?
I don’t know if it will help you get a job, but I hope it would help you do such a job well. I certainly designed this course to be useful, but this is my first time implementing it, so the outcome is unknown. It’s an experiment! While I understand the impulse to be career-focused, I do hope it will be an interesting intellectual experience for you.
(4) You brought up examples in class that were about the internet and tech companies, what about financial engineering and pharma? Will this be relevant if I want to go into those industries?
Knowing how to work with data will be relevant across sectors, and we will explore that in this course.
The type of data being generated on the internet and within tech companies is a new exciting type of data and opens up a lot of possibilities for us to understand human behavior. DJ Patil and Jeff Hammerbacher understood the novelty of the data itself being the building blocks of tech companies such as Google, Facebook, Amazon and Linked In. This is a deep insight. While data may be analyzed to inform product development in the pharmaceutical industry, the data itself isn’t the ingredients that go into the drugs.
Given this is the cultural context we find ourselves, our approach will be to begin here in our exploration of Data Science, and then expand to other applications in the non-profit world, genomics, finance, etc. Our goal by the end of the course will be an expansive definition of Data Science across sectors.
(5) I don’t really get inspired by tech companies or the internet. I care about completely different kinds of data. Should I still take the class?
First, see (4). Second, internet companies and the internet are interesting because we quickly reach the boundaries and end up at interesting research problems across many disciplines. Third, we’ll use a variety of data sets, and you’ll build some of your own.
(6) But one of the homework questions was about coming up with a data strategy for an internet company. I don’t care about that. Should I still take the class?
No one’s forcing you to take the course! First, this is one example of being on the frontier. Second, you need to know where your data comes from. But more generally, the homework assignments will vary a lot, and will be designed to build up many of your skill sets at once. If you find aspects of the course hard, but interesting, you should still take the class. If you think it’s boring, then don’t take the class.