Each week Ethan Rouen, a student in the class, will post on a topic of his interest based on class lectures. Ethan is a Ph.D. student in accounting at Columbia Business School and a columnist for Fortune.com.
Props to the professors for the timing of the Kaggle competition announcement on Wednesday night. I’m sure I wasn’t the only one who was so beaten down by homework 1 that the challenge of this project didn’t quite settle in until I woke up the next morning and realized that none of the last week was a dream (because in order for it to have been a dream, I would have had to sleep).
After terrifying us with the description and competitive nature of the project, Kaggle-er Will Cukierski threw us a delicious lifeline. At first, it sounded like there was an R package called Burrata, named after the most delicious cheese in the world.
But Boruta, named after some Norse forest god, may be even tastier!
As Will stressed in class, feature selection (choosing independent variables) is one of the most important parts of the data science process. According to Kursa and Rudnicki, who brought us Boruta from on high, “modern data sets are often described with far too many variables for practical model building.”
The obvious disadvantage to throwing everything at your dependent variable is that more features means more computing power means swirling rainbow wheel of death. Even worse, stats wizards have found, too many variables may cause a decrease in the accuracy of algorithms.
There has been a lot of focus on developing “minimal-optimal” algorithms that find a small feature set giving the best results.
This strategy, though, may leave out a lot of attributes that are relevant to our problem, which is known as the all-relevant problem.
Enter the Norse god of the forest.
Boruta, created by Kursa and Rudnicki, attempts to identify all features that are relevant to the problem trying to be solved regardless of redundancy, allowing us to get better insight into the problem trying to be solved and what influencers can play a role in solving it.
So how does Boruta work, and does it involve leaving an offering of fresh Italian cheese on the alter of CRAN?
The package uses a random forests (get it? forest god) approach to identifying relevant variables. The random forests strategy involves selecting a collection of variables from the dataset to determine a decision at a node on a decision tree. A simulation of the decision is run multiple times (with replacement), taking the same number of variables (but not necessarily the same variables) each time to determine which variables are relevant to that decision. (For more information in layperson’s terms on random forests, check out this great post.)
Boruta, ignoring redundancy (that’s important for our analysis!), adds junk features that have no relation to the dataset and determines relevance by comparing the seemingly most important junk features to the least important input features, creating a boundary for feature acceptance.
Sounds easier than deleting “>” from a messy dataset, no?