attract all the hype these days within data science, but I’d argue they’re both secondary to a more important—and often-ignored—section of the field.
When dealing with data, there are two essential steps:
- Processing and analyzing the data to extract meaningful insights.
- Conveying these insights to others.
The second point is crucial and often overlooked. The world’s most advanced algorithm or beneficial insight is useless if no one can understand it. As a data scientist, you must learn to convey your insights to others. There is more than one reason for this, with the apparent one being that if the right people understand the data, the world at large will benefit. However, there is another equally important reason: It is often in describing our findings to others that we discover errors, more profound knowledge, or further areas for exploration.
In this article, we’ll examine a powerful and effective tool which can help achieve the second step above: data visualization. This is the first in a series of articles that will take absolute beginners deep into the realm of data visualization. This first article is general and light, intended as an introduction to the field as a whole. In later articles, I’ll get into the more technical aspects, eventually concluding by teaching you how to build your own data visualizations.
With that knowledge, you’ll be armed to tackle your data in new, exciting ways.
“The greatest value of a picture is when it forces us to notice what we never expected to see.” –John Tukey
What Counts as a Data Visualization?
Many people view data visualization through a restricted lens, only classifying standard graphs, such as bar charts, line charts, and the like, as true data visualizations. Viewed from this perspective, data visualization did not materialize until the middle of the 18th century. (We’ll see some examples below.)
However, we would do well to broaden our minds. Visual transformations of data are by no means limited to our traditional ideas. They have been around for thousands of years. For example, here is the Imago Mundi [1], the oldest known map in the world, discovered as a relic of the ancient city of Babylon:
This map places Babylon at the center and was likely an extremely useful tool for visualizing what we now formally call geospatial data. It is one of the world’s earliest data visualizations.
There are a plethora of similar figures and images from various ancient civilizations—cave paintings, calendars, stone carvings, even Egyptian hieroglyphics—these are all effectively visual representations of data that were difficult to understand in their initial form. Viewing these examples as data visualizations leads us to an important principle:
At its core, data visualization is nothing more than taking some data—be it numerical, textual, or otherwise—and applying a transformation to represent it visually.
This foundational principle leads to several related topics primarily involving the most effective methods to conduct these transformations, where effective loosely translates to “honest, easy to understand, and informative.”
Early Examples of Data Visualizations
Now that we have broadened our perspectives concerning what constitutes a data visualization, let us take a look at some modern examples. Below is a chart from 1644 developed by Michael Florent Van Langren [2]. It is one of the earliest graphical representations of what we consider to be traditional statistical data, depicting estimates of the difference in longitude between Rome and Toledo.
Let’s consider a more involved example next—one which directly highlights Tukey’s quote above.
Below is a map of London’s Soho District in 1854 [3]. It was designed by John Snow in order to determine if there were any patterns in the cholera outbreak that was debilitating the town at the time:
Looking toward the center of the map, we can see an exceptionally large number of deaths near the water pump on Broad Street. An investigation determined that this pump was contaminated and was a major cause of the spread of the disease.
This example highlights exactly the principle from John Tukey we noted above: One of the best uses of data visualization is to quickly see insights that are difficult to find in the data’s initial form.
Precision and Flexibility
Data visualization is a broad and deep topic that can be approached in many ways. That said, there are two principles that you should keep in mind irrespective of the specific form of data visualization you engage in: precision and flexibility.
A good data visualization does not try to accomplish ill-defined tasks, such as displaying the essence of or summarizing everything important about a data set. Statements like these are subjective and essentially impossible to achieve.
Rather, a good data visualization highlights a specific and well-defined aspect of the relevant data in a way that makes it easier to understand for the user. You should always articulate exactly what you want to express about your data before you even begin designing a visualization.
To internalize this principle, it is helpful to recall what the purpose of a data visualization is to begin with: to display insights from a data set in a clear and useful way. We want to make the data easier to understand. Being precise ensures we achieve this goal. A visualization that attempts to do too much might end up confusing the viewer even more. It is much better to produce a visualization which covers less data in a clearer way. Quality is more important than quantity.
Take a look at the data table below, which contains information about salaries from different cities around the United States.
Name | City | Income | Occupation |
---|---|---|---|
Sarah Mitchell | Denver, CO | $72,500 | Marketing Manager |
Jamal Rodriguez | Houston, TX | $58,300 | Electrician |
Priya Desai | Seattle, WA | $91,200 | Software Engineer |
Thomas Nguyen | Chicago, IL | $64,800 | Nurse |
Which of the following is the better visualization choice for the above data?
- A visualization that attempts to simplify the information in the data table using a bar chart that has names on one axis and salaries on the other axis, uses color to differentiate among cities, and uses a texture on the bars (dashed lines, diagonal lines, etc.) to distinguish among careers.
- The same visualization as above, but this time excluding the majors. In other words, a bar chart of names and salaries which colors the bars based on location.
It’s tempting to choose the first one, but the fact is, it tries to do too much. Better to display limited, targeted information than to confuse your audience.
In addition to being precise, maintaining flexibility is also important. There is no such thing as a perfect data visualization. There is always room for improvement, and data visualizations generally become better with each revision. Of course, at some point, a data visualization must be shared with others and serve its purpose.
This leads to a quandary—how much revision is enough revision? There is no definitive answer to this question. The process of revising a visualization must be undertaken with care. Asking too many people for advice will likely result in a bunch of half-baked, conflicting opinions. On the other hand, publishing the first draft of a visualization—i.e., not revising it at all—is likely to lead to a subpar result.
Although there is no perfect solution, there are a few guidelines you can follow:
- Identify 2-3 people to give you feedback on your visualization.
- Try to ensure your list of people encompasses the following:
- A reviewer who is proficient in designing data visualizations
- A reviewer who has a strong understanding of the data that is being used to develop the visualization (e.g., a political scientist for election data)
- A reviewer who is part of the intended audience for the visualization
- Go through 2-3 rounds of feedback and revision with this same list of people. This will ensure that improvements to the visualization are continuous and logical.
Final Thoughts and Looking Forward
In many ways, data visualization is akin to writing. Even the most prolific and talented authors have editors, and their books go through extensive revision before being approved for publishing. Why? For the simple reason that good writing is largely dependent on the audience, and carefully curated revision ensures the best experience for the eventual readers of a book. The same idea applies to data visualization.
By following these guidelines, you can ensure you develop a robust data visualization which is grounded in best practices, correctly displays the data at hand, and is understandable for the intended audience.
They are the key to effective data visualization, and the foundation for advanced visualization techniques that will be discussed in future articles. Until then.
References
[1] https://commons.wikimedia.org/wiki/File:The_Babylonian_map_of_the_world,_from_Sippar,_Mesopotamia..JPG
[2] The Visual Display of Quantitative Information, Edward Tufte
[3] https://picryl.com/media/snow-cholera-map-1-cbadea
Source link
#Data #Visualization #Explained #Matters