Each Tuesday, Eurry Kim, a student in our class, will pick one example of data visualization to share with us.
I decided to go with something a little different for the viz of the week this week. A serendipitous posting from Flowing Data coincided with our second homework assignment of text mining:
I’m going to start with Eurry’s email, and use that to point out ten things.
(1) Flowing Data is an important visualization reference: Eurry got the original reference from flowingdata.com, which for those of you new to data visualization is a blog by a UCLA student, Nathan Yau, who wrote the book, Visualize This, which is on our syllabus. There’s a ton of useful stuff on flowingdata, so I encourage you to spend time browsing it. To start out with if you’re feeling a bit lost about visualization, try How to visualize and compare distributions. There are a bunch of other tutorials that I would recommend spending time with.
(2) Visualizing Conditional Probabilities: Flowing Data links to the Scientific American article, which then links to Robert Simpson’s blog, orbitingfrog.com, which shows a visualization of words appearing together in astronomy journals. More accurately, it visualizes: P(word 1 appears| word 2 appears) which is empirically estimated as (# of times word 1 and 2 both appear)/(# of times word 2 appears).
He calls these “correlations” but they’re not Pearson correlations. Note that the heat map is not symmetric because P(A|B) is not equal to P(B|A).
(3) Heat Maps: Heatmaps can be a good way to visualize Pearson correlations, covariance, co-occurences or in this case conditional probabilities. In all cases, the heat map is the visualization of a matrix.
(4) Scaling and Normalization One needs to be careful of issues of scale. For example word 1 might occur only once, and that one time word 2 also appeared, (# of times word 1 and 2 appear together | # of times word 1 appears) = 1/1 =100%. Compare this to word 1 appearing 100 times, and word 1 and word 2 appearing together 100 times = 100/100=100%. Both these would appear the same in the matrix and visualization, even though the strength of the relationship is not necessarily the same.
(5) Color and shading He used color and shading to convey information. Color as a way to visually cluster “similar” words (which he knew from domain expertise); and shading to capture the strength of the relationship. The color and shading helps him find patterns; but also he imposes existing belief/understanding about the relationship with his choice of color. One needs to be careful about this.
(6) Order One also needs to be careful about the order one puts the words. If we rearranged the order of the words, our perception of the relationships might change. Clusters can be artifacts of the chosen (perhaps, arbitrary) order.
(7) Word selection A human chose which words to include. Could or should this have been automated? Does it introduce bias or is domain expertise sufficient? Are there relationships we might miss?
(8) Exploratory Data Analysis for Scientific Discovery. What Simpson is doing, when looking at a single heat map, I would call exploratory data analysis. He wants to understand the relationship between words in the hopes that it helps him generate hypotheses. (I’m not entirely sure what the nature of these hypotheses are, are you?). Scientific American calls it a “hypothesis generator”; Yau calls it “Visualization for Interactive Exploration”. This is because Simpson automated it, more or less, and created a tool that could load different data sets. (Building a data product?). It’s worth reading the Scientific American article for more insight on this concept.
(9) Counting. The underlying math is simple. It involves counting and dividing. You can get pretty far with counting and dividing.
(10) Sets and their intersections. Conditional probabilities involve counting the number of elements in sets and their intersections. I bring this up because right after I got Eurry’s email, I left work and was on the subway heading home thinking about visualization and then noticed this ad on the train so I snapped a photo. This Venn Diagram bothers me. It makes no sense! Look at this and think about what information it conveys: