Data Visualization Explained (Part 3): The Role of Color

This is the third article in my data visualization series. See Part 1: “Data Visualization Explained: What It Is and Why It Matters” and Part 2: “Data Visualization Explained: An Introduction to Visual Variables.”

do you see in the picture below?

Most people see four: white, green, and two different shades of pinkish-red. In reality, those two shades are exactly the same; there are only three colors in the image.

This popular optical illusion illustrates an important fact to consider when designing data visualizations: Poorly chosen color combinations can trick the human eye. For a complete treatment of color, I would need to delve into physiological details of the human eye and learn how we actually “see” color.

However, seeing as this is not an optometry article, I’ll instead focus on the fundamentals of color usage that you will need to build clear data visualizations.

The Difference Between Color Hue and Color Value

When I introduced visual encoding channels in the previous article, I presented two different channels related to color: hue and value. Let us discuss these formally.

Color hue is what you generally think of when you hear the word “color.” Red, green, blue, pink, yellow, etc. are all different hues. Color value, on the other hand, refers to the “lightness” of an individual hue. The image below illustrates different values of the rainbow colors, showing how the same hue can vary greatly in lightness/saturation:

Image by Wikimedia Commons

While both of these can be effective visual encodings (see my previous article in this series for a detailed discussion on visual encodings), color value has one notable advantage over hue: It can still be perceived if a visualization is printed in grayscale.

Types of Color Scales

If you want to use color as a visual encoding, you need to start by choosing a color scale. In doing so, there are a few characteristics you need to consider:

If your data is nominal, then you can use a categorical color scale, which relies solely on color hue.
For quantitative data, you’ll need to make two additional decisions: 1) whether your scale will be sequential or divergent (i.e., if it will use one or two hues), and 2) whether your scale will be continuous or divided into classes.

Thus, there are five color scales at our disposal, all of which we will discuss below: 1) sequential and unclassed, 2) sequential and classed, 3) divergent and unclassed, 4) divergent and classed, and 5) categorical [1].

Sequential scales (one hue) are useful for visualizing numerical values that go from low to high. Divergent scales can prove helpful when values go from negative to positive or when the designer wishes to emphasize some difference between the colors on two ends of the scale.

Of course, these are just general rules. Different types of scales are best depending on the particular visualization, and sometimes more than one can work.

Sequential and unclassed

The following map uses a sequential, unclassed color scale to illustrate the fraction of Australians that identified as Anglican at the time of the 2011 census. We can see that a single hue, green, increases in value from light to dark. Since there is only one color, there is no divergence, and since the scale is continuous, there are no classes.

Image by Toby Hudson on Wikimedia Commons

Sequential and classed

In contrast to the visualization above, we can see that the map of the United States below has discrete classes which vary the color value. It is still sequential, as only a pink hue is used. The color value is increased as the percentage of adults in their early 20s within a county increases.

One noteworthy element of this visualization is the uneven nature of the classes. (Note the width of the largest category.) This is not always good practice, especially if no reason is given. Image by Derek Montaño on Wikimedia Commons.

Divergent, classed and unclassed

Divergent scales are a bit trickier to understand, so let’s consider both types together in a comparative example. In doing so, we’ll also see the different advantages of classed and unclassed scales.

The two charts below were generated in Python using mock data. The data consists of the following visual representations (i.e., visual encoding channels):

The x-axis consists of a number representing store location.
The y-axis represents the months of the year.
The color represents a “customer satisfaction score” collected by the fictional stores via monthly surveys.

The classed vs. unclassed aspect of these visualizations is much like in the sequential scales above. In the left (unclassed) scale, the full totality of values is represented, whereas in the right (classed) one, colors represent grouped buckets of values. The left visualization provides more precision, but the right one is easier to interpret and apply.

The divergent aspect of these scales is more convoluted. Let’s break it down:

The divergent scale here uses two colors: red and green (not the most accessible colors in the world, as we will see later in the article).
The neutral, white color (or the two light colors in the classed scale) represents a logical “middle point” in the data, which in this case is the value 0.
This middle point is key, as it makes for a situation where a divergent scale lends itself naturally to the data. It makes little sense to use more than one color if values are just moving in one direction without a meaningful center.

Categorical

The final, and arguably most straightforward, color scale type is a categorical one. The chart below, which shows government funding breakdowns across various countries, provides a clear example.

If you have been paying attention to the principles discussed in this chapter this far, you will likely notice that this is not a particularly well-designed data visualization. It gets the general point across, but there are a few too many different colors, resulting in a confusing final design.

That said, it is an effective use of a categorical scale, correctly applying this scale type to nominal data (data that has distinct, unordered categories). A common mistake in data visualization—and one you should take care to avoid—is using a categorical scale with several different hues when your data shows a clear numerical increase or decrease. In those situations, refer to one of the color scales discussed above, depending on your specific data.

That sums up the basics of color scales that you must know to engage in effective data visualization. To conclude, let’s look at a couple more tips for using color well.

(Don’t) Use Color Redundantly

It can be tempting to use color in a visualization when it is not needed. For example, it’s quite common to see bar graphs with clear x-axis labels to distinguish the bars that still have bars of different colors.

This is not wrong, but it may be needless. If there are only a few categories and they’re linked with other visualizations, by all means use color to provide an additional visual cue. However, if the visualization functions fine without it, then don’t force it.

In general, any and all redundant encodings (representations) should be avoided unless they provide some additional ease of interpretation for the viewer. It is either wasteful, as that encoding channel could be used for a different variable, or confusing, as the viewer is left to determine if the additional encoding is depicting something that is going over their head.

Make Color Palettes Accessible

This last point it short, but incredibly important. Do not assume that simply because you can distinguish among the colors in a visualization, so can everyone else. Data visualizations should be accessibly by everyone, including people who have various types of colorblindness [2].

For example, consider the Python visualizations in the section on divergent color scales above. Do you think someone with red-green color blindness will be able to interpret it correctly? Unlikely.

Luckily, we don’t need to do too much extra work to ensure our visualizations are accessible. There are countless online tools [3, 4, 5] which automatically check the accessibility of your chosen color palettes. Some will even help you generate them. Take advantage of them to make your visualizations as accessible as possible.

Final Thoughts

Congratulations! With the third article in this series, you have learned the essential principles you will need to design compelling data visualizations. In the articles to come, we will finally start designing and building our own visualizations! Until then.