This is the version published at the beginning of the semester. Logistics and content get modified slightly throughout the semester. The most current information can be viewed on the About the Class and the Guest Speakers pages.
Introduction to Data Science (W4242)
Professor: Dr. Rachel Schutt
Location: 503 Hamilton Hall
Lab Instructor: Jared Lander
Labs: Mondays 6:10-7:25pm
Location: 503 Hamilton Hall
Teaching Assistant: Benjamin Reddy
Problem Sessions: Thursdays 7:45-9:15pm
Office Hours: Tuesdays time TBD
Location: Statistics Department, 1255 Amsterdam Ave, School of Social Work Building, 10th floor
Prerequisites: Some linear algebra and previous exposure to probability and statistics is ideal; as well as some programming experience.
Goals of Course:
1) Learn about what it’s like to be a data scientist
2) Be able to do some of what a data scientist does
I’ll teach the first two weeks to build up sufficient background and foundation. After that, each class will be divided into two parts: (1) Review of previous material and introduction of any new material necessary to understand the guest lecture, (2) Guest lecturer teaching new algorithms, methods, or models, giving case studies, showing their actual code, and describing their role as a data scientist emphasizing the course themes.
Machine learning and data mining algorithms, and statistical models and methods; prediction vs. description; exploratory data analysis; communication; visualization; data processing, munging and engineering; big data; coding; ethics; asking good questions
Course Schedule and Topics
September 5: Introduction: What is Data Science?, Getting started with R, Exploratory Data Analysis, Review of probability and probability distributions, Bayes Rule
September 12: Supervised Learning, Regression, polynomial regression, local regression, k-nearest neighbors,
September 19: Unsupervised Learning, Kernel density estimation, k-means, Naive Bayes, Data and Data Scraping (Guest Lecturer: Jake Hofman, Microsoft Research)
September 26: Classification, ranking, logistic regression (Guest Lecturer: Brian Dalessandro, Media 6 Degrees)
October 3: Ethics, time series, advanced regression, finance (Guest Lecturer: Cathy O’Neil)
October 10: Decision trees, Best practices, feature selection (Guest Lecturer: William Cukierski,Kaggle). Kaggle competition (final project) announced;
Applying data science in a hybrid research environment (Guest Lecturer: David Huffaker, Google)
October 17: Recommendation engines, dimensionality reduction, indexing large-scale data, and implementing / optimizing machine learning algorithms. (Guest Lecturer: Matt Gattis, eBay)
October 24: Data visualization, data journalism, dashboards? (Guest Lecturer: Mark Hansen, Columbia)
October 31: Social network analysis (Guest Lecturer: John Kelly, Morningside Analytics)
November 7: Sampling, Stratification, Experimental design, pharma (Guest Lecturer: David Madigan, Columbia)
November 14: Observational causal modeling (Guest Lecturer: Ori Stitelman, Media 6 Degrees)
November 19*: Sampling, data leakage, data incest (Guest Lecturer: Claudia Perlich, Media 6 Degrees)
*Scheduled for Monday because Wednesday, November 21 is the evening before Thanksgiving
November 28: Data engineering, sharding, Hadoop, mapreduce and proto buffers (Guest Lecturer: Josh Wills, Cloudera)
December 5: Data engineering (Guest Lecturer: David Crawshaw, Google)
Recommended Texts and Readings
As this is an emerging field, there is no single good textbook for it yet.
I will be drawing from some of the following texts:
Data Mining and Machine Learning:
The Elements of Statistical Learning: Data Mining, Inference and Prediction, Trevor Hastie, et al.
Pattern Recognition and Machine Learning, Christopher Bishop
Bayesian Reasoning and Machine Learning, David Barber
Programming Collective Intelligence, Toby Segaran
Data Mining with R: Learning with Case Studies, Luis Torgo
Data Mining: Practical Machine Learning Tools and Techniques, Ian H. Witten et al
Artificial Intelligence: A Modern Approach, Stuart Russell and Peter Norvig
Introduction to Machine Learning (Adaptive Computation and Machine Learning), Ethem Alpaydim
R in a Nutshell: A Desktop Quick Reference, Joseph Adler
Learning Python (O’Reilly), Mark Lutz and David Ascher
The Art of R Programming: A Tour of Statistical Software Design, Norman Matloff
Hadoop: The Definitive Guide, Tom White
The Elements of Graphing Data, William Cleveland
Visualize This: The FlowingData Guide to Design, Visualization, and Statistics, Nathan Yau
Statistics for Experimenters: Design, Innovation, and Discovery, George E. P. Box, et al
A First Course in Probability or Introduction to Probability Models, Sheldon Ross
Course requirements and Grading
Homework Assignments (40%)
Final Project (40%)
Final In Class Exam (15%)
Attendance / Participation (5%)
You are encouraged to discuss problems with other people, but the write-up and code must be your own. Please include a copy of your code, and format it in Courier font. No late assignments accepted.
The final project will be a Kaggle-style competition. You will form teams and work together. The competition will be announced October 10th and the deadline will be in December. More details to come in October. But feel free to check out Kaggle in the meantime.
A Note on Programming Languages
Most of my instruction will involve either R or Python. Guest lecturers may give examples using different languages but will explain what the code means. Homework assignments will generally require R or Python. If you feel that you can complete them in a different language successfully, you can, but we won’t be able to necessarily help you if you get stuck.