Each week Ethan Rouen, a student in the class, will post on a topic of his interest based on class lectures. Ethan is a Ph.D. student in accounting at Columbia Business School and a columnist for Fortune.com.
With Andy Lehren and Steve Lohr speaking to the class this week, I finally have the opportunity to write about a topic I know a bit about (Arg! I’m still stuck in the trees with Boruta).
Both Times reporters offered some fascinating insights into how data is shaping journalism and our world, but Andy’s talk offered me an a-ha moment:
A journalism professor once told me that the best journalists are those that love reporting. You can write with enough style to impress E.B. White, but if you don’t have the meat behind that writing, the article is worthless.
Andy clearly exemplified this practice. The stories he shared with us weren’t about flowery language or flashy graphics. They were about good old fashion shoe leather (or whatever the digital version of shoe leather is; maybe number of Macbooks destroyed). It’s easy to wonder how many people cheat in the New York City Marathon, but without the data provided by New York Road Runners, he would have been left wondering instead of writing a front-page story about dozens of cheaters. Timothy Thomas would have been just a career criminal who got shot running from cops if traffic ticket data wasn’t available so Andy could uncover that all his offenses were minor.
As we trudge through Kaggle and the start of our projects, it is becoming apparent that reporting may also be the most important aspect of data science and, in part, is what separates data scientists from statisticians or field-specific researchers.
Our current assignment to visualize the Kaggle data drives that point home.
Until this week, my Kaggle strategy was just to log, multiply, divide and pray. For loops let me throw more and more at my model with smaller and smaller returns. But up until I had to think about visualizing the data, I hadn’t really thought at all about what that data was telling me. In order to use the data effectively, we need to interview it just as a reporter peppers a source with questions. Don’t get the right answer? Perhaps you need to ask the question differently.
Asking questions and developing tests are important steps in the process of doing data science, but the data that we are able to collect force us to hone our questions and structure our tests. We’ve been warned that the majority of a data scientist’s job is obtaining clean data (oh UFO data, how I miss you not), but while hearing Andy speak about searching through shipping import lists and FOI’ing traffic tickets, it also seems that, for a lot of us, it also will be the most important part of the job.
This collection process brings us even closer to journalism. We are lucky to be working at a time when mass amounts of data are available and easy to find (for a story I did in 2006, a police department charged me 50 cents/page for a 2,000-page document… and they refused to give me a digital copy!).
But this opportunity comes with great responsibility. Chances are, the answers to our questions are out there. Maybe they are available on data.gov, but maybe not. Maybe we have to cold call strangers and beg for the data or become fluent in the FOIA and go head-to-head with government lawyers. Whatever our jobs, as Andy made clear, the most successful among us will be eager to step away from our computers and think bravely and creatively about how to track down the clues that will answer our questions.