Exploratory Data Analysis (EDA) is often relegated to Chapter 1 (by which I mean the “easiest”, and lowest level) of standard introductory statistics textbooks and then forgotten about for the rest of the book. Notable examples of textbooks used in statistics curriculum that embrace EDA are Andrew Gelman‘s books (which are by no means introductory). I was privileged to have Andrew as my thesis advisor, so he’s been a tremendous influence on my approach to practicing statistics and working with data. Further still, I now am fortunate to get to work alongside two former Bell Labs/AT&T statisticians, Daryl Pregibon and Diane Lambert, who are also in this vein of applied statistics and I’ve learned from them to make EDA a part of my best practices. Yes, even with very large Google-scale data, we do EDA. In the context of data in an internet/engineering company, EDA is done for some of the same reasons it’s done with smaller data sets, but there are additional reasons to do it with data that has been generated from logs.
Standard reasons *anyone* working with data should do EDA:
(1) Gain intuition about the data
(2) Make comparisons between distributions
(3) Sanity checking– make sure the data is on the scale you think is, in the format you thought it should be
(4) Finding out where data is missing or there are outliers
(5) Summarizing the data
In the context of data generated from logs (e.g. internet-type data), EDA helps with
(1) Debugging the logging process. (“Patterns” you find in the data could actually be something wrong in the logging process that needs to be fixed. If you never go to the trouble of debugging, you’ll continue to think your patterns are real.) The engineers I’ve worked with are always grateful for help in this area.
(2) Making sure the product is performing as intended
Exploratory Data Analysis is distinct from Data Visualization in that EDA is done towards the beginning of analysis and data visualization is done towards the end to communicate one’s finding. I personally find EDA relaxing because it’s ok to make mistakes and it’s just me and the data. “Long before worrying about how to convince others, you first have to understand what’s happening yourself”. (Gelman and Hill, Data Analysis Using Regression and Multilevel Hierarchical Modeling, p.551)
With EDA, you can also use the understanding you get to inform and improve the development of algorithms. I gave specific examples of this in class. Plotting data and making comparisons can get you extremely far, and is far better to do than getting a data set and immediately running a regression just because you know how.
The father of Exploratory Data Analysis, John Tukey, also ultimately influenced the development of S (in Bell labs) which is now R, the preferred (programming) language of (many) statisticians.
(Thanks to Chris Wiggins for recent conversations about this.)
Some references to understand best practices and historical context:
(1) Exploratory Data Analysis, John Tukey, 1977
(2) The Visual Display of Quantitative Information, Edward Tufte, 1983
(3) The Elements of Graphing Data, William S. Cleveland, 1994
(4) Statistical graphics for research and presentation, Appendix of Data Analysis Using Regression and Multilevel Hierarchial Modeling, Andrew Gelman and Jennifer Hill, 2007
(5) Exploratory Data Analysis for Complex Models, Andrew Gelman, American Statistical Association, 2004