Data Skeptic

Let's Kill the Word Cloud

Duration:: 15m
Broadcast on:: 01 Jan 2016
Audio Format:: other

This episode is a discussion of data visualization and a proposed New Year's resolution for Data Skeptic listeners. Let's kill the word cloud.

(upbeat music) - Data Skeptic features interviews with experts on topics related to data science, all through the eye of scientific skepticism. (upbeat music) - Welcome to the first episode of Data Skeptic for 2016 and Happy New Year, everyone. As I mentioned in last week's episode, shows that are released on major holidays have notably lower downloads. I thought it would be a disservice to my upcoming guests to have them on for a day when I know the show will get less exposure. Yet, I also feel compelled to keep my weekly format. So going forward, on major holidays, Thanksgiving, Christmas, New Year's and Arbor Day, I'll be releasing special episodes. A while ago, I did a special episode called proposing annoyance mining, which features me alone, standing on a soapbox, which garnered some positive feedback. So I've decided to do something similar for this occasion. On a day when I presume only the truest of data skeptic fans are listening. So first off, thank you for continuing to join me week after week, and I hope this break in format keeps your attention. In fact, I'm actually glad that only my truest fans might be listening now, because I'm going to ask you to become conspirators with me. I'm going to propose a murder of sorts that I hope you'll help me execute. In fact, it's more than a murder. I'm actually hoping for an extinction level event. Data skeptic listeners, it's a new year, and a time that we use culturally to turn over new leaves, try new things, and put bad habits behind us. So I implore you to join me in committing homicide this year. Let's kill the word cloud. Just in case any listeners aren't clear on what a word cloud is, I'll give a brief description. A word cloud is a visualization of text data that shows a jumble of words or phrases in various sizes, where the font size of the element indicates its relative importance. Generally, these are just soup of terms in a roughly oval assemblage. So why do people make word clouds? I think the answer is perhaps because it's easy. There are simple free tools to automate it, and admittedly, word clouds could be considered sort of pretty. But the truth is, they're really lazy. In my opinion, they convey the illusion of insight without providing any actual insight. I think they're one of the worst forms of data visualization, possibly even less useful than a 3D donut chart. Data skeptic hasn't gotten too deep into data visualization topics. I'd like to recommend everyone check out the Data Stories podcast, which focuses on data viz. It's one of my favorite shows and complements data skeptic nicely. But let's take a moment and discuss at a high level what data visualization is all about. Data viz is a process of creating a graphical representation of data to convey a finding or communicate information more clearly and efficiently than presenting the data alone. Why would a data scientist create a data visualization? Broadly, I would say there are two reasons. First, to help them explore the data. Histograms and scatter plots quickly convey information that tabular data does not. When doing data exploration, knowing how to rapidly explore your data visually, is the key to moving quickly. Second, data visualization is done to share a finding or insight with others. Thus, it's fair to say that the purpose of data visualization is to convey information more clearly, quickly, or efficiently. For a long time, I thought I understood what data viz was all about. Then, I discovered Edward Tufti's the visual display of quantitative information, from which I discovered that my understanding of data viz had only just begun. So how would I apply the things I learned from Tufti in criticism of word clouds? One of the first tenets put forward is, above all else, show the data. OK, you could show all the words in a document in a word cloud. But is that the data? No. The data being displayed are not just the selection of words and phrases, but also a weight, presumably indicating their relative importance expressed via the font size of the data. There's a seminar work by Cleveland and McGill titled Graphical Perception, Theory, Experimentation, and Application to the Development of Graphical Models. I think it might not be terribly controversial to declare that this paper is to data viz what Newton's Principia is to physics. In it, the authors describe an experiment in which they present data in a variety of forms, ask readers to report the data they viewed, and measure the difference between the actual data and the reported observations. You really should read this paper. But to summarize, the order from most to least effective, they identify, was to present your data using position, then length, angle and slope, area, volume, and lastly, color. That's right, color is least effective of all. You hear that heat maps? You get a pass for this episode, but you walk a fine line with me. So how is the data conveyed in a word cloud? Well, it's definitely not positional. There are two lengths present, height and width. Width is a product of the phrase and is therefore actually a bit misleading. Height correlates with, but does not necessarily match the font size. So I suppose word clouds rely on length in the form of the height. But let's review the others to be sure. Next comes area. Yeah, I think area is also relevant, but due to the width issue, it could do as much to mislead as it does to inform. This ambiguity highlights another flaw, the use of non-monospace fonts, typical curning, and even typographical thickness all convey the illusion of weight that may or may not be appropriate, being a product of the label, not the data. Getting back to Tufti, he also suggests that one maximizes the data ink ratio. For every pixel that isn't part of the background, determine if it's providing the viewer with data or not. The number of data conveying pixels divided by the total number of non-background pixels is the data ink ratio. And one would like this to be as near 1.0 as possible. There are some brilliant breakdowns in the visual display of quantitative information in which Tufti strips away the chart junk, elements that convey no data, and oftentimes only serve to distract, distort, or damage to the data. Tufti goes on to suggest one erase non-data ink and redundant data ink. That box around the legend, useless, get rid of it. The legend itself, can we get rid of it? Maybe we could take those text labels and just position them near the data they actually describe, removing the need for the legend altogether. Perhaps strategic positioning could further pull the reader's eye towards a specific aspect of a series you wish them to take notice of. How does a word cloud score from the data ink ratio? Hmm, I guess by this metric it's fine. Every pixel does convey information. However, I would argue that while there might not be much non-data ink, there's a tremendous amount of redundant data ink. Every word appearing in large type does seem wasteful. But I suppose the truest criticisms aren't necessarily rooted in Tufti's practices. Instead, let's ask how effective the reader of the word cloud is able to read and retain the underlying data. For a moment, consider a photograph of a fulcrum weighing scale. That's the kind where two buckets or trays hang from a beam with a pole in the center of the beam, the image we often associate with courts. Picture on the left scale sits Yoshi, the lilac crown, Amazon parrot, and official data skeptic mascot. Picture on the right side, a stack of apples, just enough to balance it. What data does this contain? It contains the weight of Yoshi as measured in units of apples. This is admittedly a unit no metrologist would approve of due to the lack of standard. It's also an unnecessarily complex way to share a single rational number. Yet, stay with me for the sake of argument. Now, picture a similar setup, but instead of apples, to balance the weight, there are a few handfuls of gravel and wood chips. This visualization is notably inferior. The reader can count apples, and likely has an intuitive and approximate sense of the weight of the apple. The gravel and wood chips are not easily countable, nor is their weight intuitive to most readers. Someone interested in data visualization should familiarize themself with the Weber-Fechener law. It describes the relationship between the magnitude of the stimuli and the amount it must change for a typical human to notice it. In other words, if you have two dumbbells, one weighing 10 kilograms, and the other weighing 10 plus X kilograms, how big does X have to be before you can distinguish which dumbbell is heavier? The law states that the just noticeable difference between two stimuli is proportional to the magnitude itself. Thus, if the X in our scenario is 0.1 kilograms, then the second weight has to be at least 10.1 kilograms before one can notice the difference, and proportionally, we expect a 0.2 kilogram change to notice the difference in a 20 kilogram weight. I am unaware of any empirical measurement of the just noticeable difference for font size. If such a study exists, please point me in the right direction. If you conduct such a study, this is an open invitation to be a guest on data skeptic to present your findings. For fonts, in their typical unit of points, the general range of sizes seems to be about 8.72 points. Based on a completely anecdotal finding in a non-blinded study of N equal one, i.e., I tested on myself, I found I could not reliably distinguish between text of 68 point and 72 point type. Implying that for 68 point type, the just noticeable difference is a five point change. Your mileage may vary. Perhaps more significantly, the random position and rotation of values in a word cloud confound the problem of visual comparison and likely increase the just noticeable difference, and therefore, reducing the practical resolution of the visualization due to the Weber Fettner law. A good data visualization is not a picture. It is a sentence written in the language of data. We can therefore empirically compare two competing visualizations, asking readers to report the exact value they observe in the image. We can measure the precision and variance of their responses and quantitatively label a winner. And in my mind, the data one could render in a word cloud would be better served in just about any honest presentation of the data by these metrics. Now I'm open to one counter argument for the word cloud, art. If you want to claim that the arrangement of words of various sizes and colored in various ways has artistic merit, I can be ambivalent. I may not understand or appreciate it as art, but despite Linda and I being lack of members, there's plenty of art I don't understand. To me, the word cloud fails to achieve the desirable qualities of good data visualization I mentioned previously. But what can be done with the data? What's a more ideal way of representing this data? Let's consider what we have. We're given the output from some algorithm, which returns a select list of words or phrases and a weight of some kind on each value. Let's presume normalized between zero and one. As an aside, how did we arrive at this data? I have no idea. How do we interpret the weights? It's ambiguous. This part would be a red flag for me in general. If I don't know the precise meaning and provenance of data, I'm skeptical. But let's set that aside and talk about a better way to visualize the phrases and their weights. I propose a simple histogram. We have phrases that the reader should read. So let's make it a horizontal bar chart so that the text is more natural to read. Let's order them descending so the visualization includes implicit ordinal characteristics, provides the most noteworthy phrases for first reading at the top, and conveys a sense of the underlying distribution of weight over the phrases. Will the important phrases be normally distributed, log normally distributed? With the histogram approach, the distribution is shown to the reader, but with the word cloud, it's unnecessarily obscured. Recognizing that our histogram presentation might grow a bit too vertical, perhaps we should chop off the long tail and report everything else as a final bar, perhaps in a different shade or color to highlight its heterogeneous nature. This simple visualization embeds the same textual data, a cleaner read of the associated weights, an ordinal dimension, a sense of the distribution, and a better description of the cardinality. All in all, I'm comfortable calling it a superior presentation in every possible way to the word cloud. So would you please consider this, or frankly, anything else, the next time you're presenting data, you might have otherwise made into a word cloud. I've put some code samples in the show notes describing what I did above. If you'd like to use them, please feel free. So in conclusion, what are my primary objections to the word cloud? They convey much less data than a more straightforward presentation would. I believe they unnecessarily obfuscate the data by leveraging font size and random meaningless positional data. Most importantly, they're difficult to read and convey the illusion of insight. Okay, data skeptical listeners. It's 2016, there are better ways. Say it with me now, everyone out loud, whether you're alone in your car, out hiking in the woods, or on a crowded subway car. Okay, well, maybe don't say it with me if you're in the crowded subway car or in a generally public place, but everyone else. All at once, shout it with me. It's 2016, let's kill the word cloud. Thanks for joining me listeners, new and old. I'm extremely excited for what we've got coming up on the show in this next year. It's an election year in the U.S. and we're going to cover topics related to the data of elections. We're going to explore an event that happened recently, which sent my wife to the ER and the data surrounding it. And for those of you that are counting, we'll hit episode two raised to the seventh. In episode two raised to the sixth, I attempted to calculate the probability, however small, that Bigfoot might be real. Our next data skeptic investigation is already underway. Beyond that, you can expect the same high caliber guests and mini episodes in your podcast feed every week and perhaps even a few video segments and live events. If you've got feedback on the Kill the Word cloud initiative, I invite everyone to share comments and criticisms in the discussion forum for this episode found at datasciptic.com. Our regular programming resumes next week with a mini episode on gradient descent. Until then, this is Kyle Polich for datasciptic.com, reminding everyone to keep thinking skeptically of and with data throughout all of 2016 and beyond. - For more on this episode, visit datasciptic.com. If you enjoyed the show, please give us a review on iTunes or Stitcher. (upbeat music) (upbeat music)