Exploring the Data Science Universe

Dear Students,
We’ve now had six weeks of blog posts, guest lectures, labs and homework assignments that have brought up a vast number of topics and issues across multiple dimensions that covers some subspace of Data Science.

Finding your own way of understanding Data Science
I hope you are finding your own ways of figuring out how these all fit together, and organizing it all in your minds. Perhaps now it feels like an overwhelmingly large space of ideas. I’m sure after the course is over, certain topics and themes will emerge for you as more important than others. I’ll use this post to map out the universe/time-space I think we’ve explored so far, and identify ten important concepts (Why ten? Because we have ten fingers).

Dimensions of the Data Science Universe
Here’s one way to organize the course’s topics and ideas we’ve explored far. Think of different dimensions of the space we’ve explored, and then think of examples or a range of possible values within each dimension. Let me do that here:

Types of Data:

  • time-stamped event logs,
  • text or content (essays, email, articles),
  • graph/network (nodes & edges; user-user; bipartite: users & movies, users & content)
  • time series (e.g. returns)
  • user-level (e.g. NYT ad click data)

Size of Data Set:

  • David Huffaker’s surveys of 100’s of users
  • millions of GetGlue event logs;
  • data sharded into multiple pieces (NYT homework 1)

Algorithms and Models:

  • k-nearest neighbors (week 2)
  • Regression (week 2 and 5)
  • Naive Bayes (week 3)
  • k-means (week 4)
  • Logistic regression (week 4)
  • Decision Trees (week 6)
  • Random Forests (week 6)

Machine Learning Concepts

  • over-fitting
  • bias-variance tradeoff
  • cost or loss functions
  • training set, test set

Goals:

  • Prediction
  • Classification
  • Establishing causal relationships
  • Recommendation/ranking
  • Data-driven decision-making

Domains:

  • On-line advertising (Brian,md6)
  • Finance (Cathy)
  • Education (Will,Kaggle; The theme of the final project)
  • Entertainment (tv & movies, GetGlue)
  • On-line social networks (Google+, GetGlue),
  • Real estate market (Real Direct, NY Housing market data)
  • Museums (Data Science of Art post)
  • Astronomy (weekly data viz #3)
  • Olympics (Jake’s study),
  • Content classification (Spam filter + NYT article classification)

Data Products:

  • Spam classifier
  • Personalized tv show recommendations
  • Algorithm for trading stock
  • NYT article classifier
  • Google+ circles
  • Google + privacy settings,…

Ideas and Concepts:
This is a space of it’s own that exists in our minds. I’m making it into its own post: 10 important ideas— some I’ve been thinking about since  before I proposed the course and others have emerged as important through listening to the guest lectures, having conversations with guest lecturers, friends and you.

Yours, Rachel

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: