I spent some of yesterday morning at Strata NYC, the big industry event on big data. First, to be clear, this is not an academic conference. But I do find it interesting as a sociological phenomenon, and to understand how people are talking about Big Data. A lot of the emphasis was on enterprise software and how to use Hadoop, which is natural given this was Strata and “Hadoop World”. There were also talks by some NYC data scientists including Claudia Perlich, who is coming to our class in a few weeks. So some of the talks more naturally align with the emphasis of this class, and others were more about how to handle data on an enterprise scale, which is not our concern at the moment. The keynotes yesterday therefore reflected this mixture. Let me just call out three of the more relevant ones.
So as many of you have probably not gone to a conference before, or especially an industry conference, you should picture that these keynotes were held in a big ballroom with a stage. I was told there were about 3000 people in the room. This wouldn’t happen at an academic conference: there were actually flashing lights and loud music as they introduced speakers! Including our own Cathy O’Neil. (Just can’t imagine them playing “Empire State of Mind” during a math conference.) Each one only had 10 minutes.
Joseph Hellerstein: The talk that got my attention the most was from Berkeley professor, Joseph Hellerstein, who was talking about cleaning data, something we’ve chatted about in this course a lot. He quoted DJ Patil as saying he spent 80% of the time processing/cleaning data; and then suggested that DJ was over-qualified to be spending his time doing this and it should be automated. His analogy was the washing machine– that it used to be that clothes washing had to be done by the river and was time-consuming (and mostly done by women, he pointed out) and now we have washing machines which are much less time-consuming and automatic. I like this analogy in retrospect. In the moment, it wasn’t all sinking in. Then he mentioned the launch of a new company he has with a really impressive board that will be focused on automating these “mundane” tasks.
As a related aside, Tamraparni Dasu at AT &T labs has some papers on data cleaning, and the importance of preserving the statistical properties of the data set, which is hard to do when “cleaning”. I want to spend more time reading these. This is an example of something being “not sexy”” and therefore much less attention getting focused on it, when it’s actually an important part of the process, and there are many interesting research problems embedded in this process. Also, she has a book, Exploratory Data Mining and Data Cleaning. (She wasn’t a speaker, but I’m bringing her work up to demonstrate there is precedent for academic rigor in this area).
Cathy O’Neil: This was set up as a “fireside chat” (with no fireplace, they joked) with Julie Steele, an O’Reilly editor who co-authored a recent book on Data Visualization, Designing Data Visualizations. Their chat was about the difficulties of teaching data science in a university setting because of existing constraints around credentialing, interdisciplinary collaboration across departments and internal politics. I saw several people come up to her afterwards to tell her how much they loved her talk. She has important ideas around this.
Samantha Ravich: This was interesting in that it connected data to policy, which is something we’ve chatted about in class. She is a very senior security expert and was in the White House during the Bush Administration making quick, critical decisions about foreign policy and security issues. She mentioned Kahneman’s book, “Thinking, Fast and Slow” (which I’m reading now, and Eurry also mentioned the other day). She brought this up to explain that when experts make decisions, what some people call “intuition” is actually the ability to very quickly sift through information. And apparently it’s getting difficult for policy makers to do this in the age of the data deluge.
She used King Midas as an analogy. He was given the gift of anything he touched being turned into gold and eventually killed his daughter, by mistake. And then someone else was given the gift of monetizing it. There can be too much of a good thing, including data, she claims. Apparently in the situation room in the White House, this caused some poor decisions to be made over poppies (think opium). I agree with the premise that we need to find ways for policy makers to understand the data, and figure out ways to help them make decisions and not get overwhelmed in the face of too much of it.