color: the cinderella of dataviz

“Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.”  — Envisioning Information, Edward Tufte, Graphics Press, 1990   

image

Color is one of the most abused and neglected tools in data visualization. It is abused when we make poor color choices; it is neglected when we rely on poor software defaults. Yet despite its historically poor treatment at the hands of engineers and end-users alike, if used wisely, color is unrivaled as a visualization tool.

Most of us think twice before walking outside in fluorescent red underoos. If only we were as cautious in choosing colors for infographics. The difference is that few of us design our own clothes. But until good palettes (like ColorBrewer) are commonplace, to get colors that fit our purposes, we must be our own tailors.

While obsessing about how to implement color on the Dataspora Labs’ PitchFX viewer I began with a basic motivating question:

Why use color in data graphics?

If our data are simple, a single color is sufficient, even preferable. For example, below is a scatter plot of 287 pitches thrown by the major league pitcher Oscar Villarreal in 2008. With just two dimensions of data to describe — the x and y location in the strike zone — black and white is sufficient. In fact, this scatter plot is a perfectly lossless representation of the data set (assuming no data points perfectly overlap).

Fig 1. Location of Pitches (Villarreal, HOU, 2008)

Simple black and white scatter plot

But what if we’d like to know more: for instance, what kinds of pitches (curveballs, fastballs) landed where? Or their speed?  Visualizations live in two dimensions, but the world they describe is rarely so confined.

The defining challenge of data visualization is projecting high dimensional data onto a low dimensional canvas. (As a rule, one should never do the reverse: visualize more dimensions than what already exist in the data).

Getting back to our pitching example, if we want to layer another dimension of data — pitch type — into our plot, we have several methods at our disposal:

  1. plotting symbols – vary the glyphs that we use (circles, triangles, etc.),
  2. small multiples – vary extra dimensions in space, creating a series of smaller plots
  3. color – we can color our data, encoding extra dimensions inside a color space

Which techniques you employ depend on the nature of the data and the media of your canvas. I will describe all three by way of example.

Multivariate Method I:  Vary Your Plotting Symbols

Fig 2. Location and Pitch Type (Villarreal, HOU, 2008)

Scatterplot with varied plotting symbols.

In this plot, I’ve layered the categorical dimension of pitch type into our plot by using four different plotting symbols.

I consider this visualization an abject failure.  In fact, the prize for my most despised graphs in graduate school goes to bacterial growth curves rendered this way . The reason these graphs make our heads hurt is because (i) distinguishing glyphs demands extra attention (versus what academics call ‘pre-attentively processed‘ cues like color), (ii) even after we visually decode the symbols, we have yet another step: mapping symbols to their semantic categories. (Admittedly this can be improved with Chernoff faces or other iconic symbols, where the categorical mapping is self-evident).

Multivariate Method II:  Small Multiples on a Canvas

Folding additional dimensions into a partitioned canvas has a distinguished pedigree in information graphics. It has been employed everywhere from Galileo sunspot illustrations to William Cleveland’s trellis plots. And as Scott Mccloud’s unexpected tour de force on comicsmakes clear, panels of pictures possess a narrative power that a single, undivided canvas lacks.

In this plot below, the four types of pitches that Oscar throws are splintered horizontally.   By reducing our plot sizes, we’ve given up some resolution in positional information. But in return, patterns that were invisible in our first plot, and obscured in our second (by varied symbols) are now made clear (Oscar throws his fastballs low, but his sliders high).

Fig 3:  Location and Pitch Type (Villarreal, HOU, 2008)

black and white strip plot

Multiplying plots in space works especially well on printed media, which can hold more than ten times as many dots per square inch as a screen. Both columns and rows can be used to lattice over additional dimensions, the result being a matrix of scatter plots (in R, see the ‘splom‘ function).

Multivariate Method III: Color Your Data

So why bother with color?

First, as compared to most print media, computer displays have fewer units of space, but a broader color gamut. So color is a compensatory strength.

For multi-dimensional data, color can convey additional dimensions inside a unit of space — and can do so instantly. Color differences can be detected within 200 ms, before you’re even conscious of paying attention (the ‘pre-attentive’ concept I mentioned earlier).

But the most important reason to use color in multivariate graphics is that color is itself multidimensional. Our perceptual color space — however you slice it — is three-dimensioned.

In the example below, I’ve used color as a means of encoding a fourth dimension of our pitching data: the speed of pitches thrown. The palette I’ve chosen is a divergent palette that moves along one dimension (think of it as the ‘redness-blueness’ dimension) in the CIELUVcolor space, while maintaining a constant level of luminosity.

Fig 4. Location, Pitch Type, and Velocity (Villarreal, HOU, 2008)

isoluminant, diverging color ramp

color strip plot

Holding luminosity constant is important, because luminosity (similar to brightness) determines a color’s visual impact. Bright colors pop, and dark colors recede. A color ramp that varies luminosity along with hue will highlight data points as an artifact of color choice.

I chose only seven gradations of color, so I’m downsampling (in a lossy way) our speed data – but further segmentation of our color ramp is not likely to be perceptible.

I’ve also chosen to use filled circles as my plotting symbol, as opposed to the open circles in all my previous plots. This is done to improve the perception of each pitch’s speed via its color: small patches of color are less perceptible. But a consequence of this choice — compounded by our choice to work with a series of smaller plots — is that more points overlap. We’ve further degraded some of our positional information. However, in our last step, we attempt to recover some of this.

Now I’ve finally brought color to bear on this visualization, but I’ve only encoded a single dimension — speed. Which leads to another question:

If color is three-dimensional, can I encode three dimensions with it?

In theory, yes. Colin Ware researched this exact question. In practice, it’s difficult. It turns out that asking observers to assess the amount of ‘redness’, ‘blueness’, and ‘greenness’ of points is possible, but not intuitive (I suspect it’s somewhat like parsing symbols).

Another complicating factor is that a nontrivial fraction of the population has some form of color blindness. This effectively reduces their color perception to two dimensions.

And finally, the truth is that our sensation of color is not equal along all dimensions; it’s thought the closely related ‘red’ and ‘green’ receptors emerged via duplication of the single long wavelength receptor (useful for detecting ripe from unripe fruits, according to one just-so story).

Because the high level of dichromacy in the population, and because of the challenge of encoding three dimensions in color, I  feel color is best used to encode no more than two dimensions of data.

So, for my last example of our pitching plot data, I will introduce luminosity as a means of encoding the local density of points (using a kernel density estimator). This allows us to recover some of the data lost by increasing the sizes of our plotting symbols.

Fig 5. Location, Pitch Type, Velocity, and Density (Villarreal, HOU, 2008)

two-dimensional color palette

multivariate color strip plot

Here we have effectively employed a two-dimensional color palette, with blueness-redness varying along one axis for speed, and luminosity varying in the other to denote local density.

One final point about using luminosity. Observing colors in a data visualization involves overloading, in the programming sense. We rely on cognitive functions that were developed for one purpose (perceiving lions) and use them for another (perceiving lines).

Since we can overload color any way we want, whenever possible, we should choose mappings that are natural. Mapping pitch density to luminosity feels right because the darker shadows in our pitch plots imply depth. Likewise, when sampling from the color space, we might as well choose colors found in nature. These are the palettes our eyes were gazing at for the millions of years before #FF0000 showed up.

Color, used thoughtfully and responsibly, can be an incredibly valuable tool in visualizing high dimensional data.

FutureMan Asks: What about Animation?

This discussion has focused on using static graphics in general, and color in particular, as a means of visualizing multivariate data. I’ve purposely neglected one very powerful tool:  motion. The ability to animate graphics multiplies by several orders of magnitude the amount of information that can be packed into a visualization.  But packing  information into a time-varying data structure has to be done by someone (you or me) and from my view, this remains a significant challenge.  Canonical forms of animated visualizations (equivalent to the histograms, box plots, and scatterplots of the static world) are still a ways off, but frameworks like Processing and Prefuse are a promising start towards their development.

Methods

The final product of these five-dimensional pitch plots — for all available data for the 2008 season — can be explored via the PitchFX Django-driven web tool at Dataspora labs.

All of the visualizations here were developed using R and the Lattice graphics package.  (Of note, Hadley Wickham is developing ggplot2, a bold re-write of the R graphics system based on a grammar of graphics).

References for Further Reading

Comments

9 Responses to “Color: The Cinderella of dataviz”

  1. Joshua Reich on March 13th, 2009

    Great post Michael.

    In the world of computer animation (mostly of yore, but still sometimes today), there is a common phrase of ‘coders colors.’ When left to their own devices, software people tend to choose colors that programmatically explore the RGB tuple space.

    While you don’t explicitly mention in, the RGB space, while perfectly logical for designing computer monitors or building CCD’s, does not map to the structure of the retina nor how humans perceive color, and thus is not ideal for data representation.. Few humans are skilled enough to pick harmonious colors directly as RGB tuples, yet most software systems default to this method.

    This makes some historical sense in that no additional computations are required to translate RGB pixels into monitor signals – it was up to the developer to add their own colorspace transforms. But today, the cost of applying a simple linear matrix math to these tuples is inconsequential, yet many packages still only provide a default RGB space.

    R is great here in that the base package provides hsv() and hcl() in addition to rgb(). And many of the programmatic techniques that would otherwise result in ‘coder colors’ in RGB turn out fine in these other color spaces.

  2. Michael E. Driscoll on March 13th, 2009

    Josh – I did not want to bring our dear readers down the rabbit hole of color spaces, but I couldn’t agree with you more w.r.t. RGB. Our actual perceptual color space is not a perfect cube — but I suspect the same engineers who brought us function keys F1 through F12 were also behind choosing these ’system colors’. We are only now slowly shrugging off those frozen accidents — and our machines are no longer visually shrieking at us.

  3. Edward Tufte on March 15th, 2009

    Dear Mike Driscoll,

    This is an interesting exploration. Some suggestions to try:

    Report some real findings about the baseball pitching to demonstrate that the displays have produced something interesting.

    Use a much larger data matrix (100 or 500 pitches).

    Make dots smaller.

    Make tick marks much smaller. On this idea, see Visual Explanations on Smallest Effective Differences.

    Try color patches (ala Ware) instead of dots.

    See Bill Cleveland’s two excellent books on data displays and do some Cleveland versions of these data.

    Take Ware and Cleveland’s work more seriously.

    Don’t give up color’s third dimension because some viewers (4% of men, 1% of women) are color deficient. That’s way too much to give up; instead design all out and then afterwards see if it is possibly to gently accommodate color deficiencies by color value or saturation (in HSV space).

    Try hue, saturation, and value for three dimensions,

    Use gray for all those black grid lines; eliminate as many lines as possible.

    Eliminate and lighten up gray boxes.

    Try colored letters instead of dots to ID changeup, fastball, sinker (S? N?), slider (S? L?) on a larger common plot.

    Make graph labels more informative.

    Don’t write in first person history of what you did; main subject and main verb of each sentence should be about the graphics and baseball (see how sparklines are presented in Beautiful Evidence; 14 pages and not a single “I”).

    Best, ET

  4. Abhishek Tiwari on March 24th, 2009

    Dear Michael,
    Excellent post as well as blog, I just posted a small article on this entry. I hope my readers will find their way to this blog.
    Thanks,
    Abhishek

  5. Maureen Stone on March 24th, 2009

    Michael,

    An interesting exploration. One weakness I see in the visualization, however, is the mapping from speed to color. A monotonically ordered set of values is most naturally mapped to a change in saturation or lightness, not to an interpolation between two colors. If you used saturation for speed, you would then vary lightness to indicate your other quantitative variable (density of pitches).

    However, I suggest you reconsider using color for speed and instead, use color to indicate the type of pitch. You can then use distinct colors (red, green, blue, etc.) to label the types, as labeling is the most effective use of color. The distinct colors would also let you use rings instead of disks, which makes it easier to estimate the density of the data. It could also be combined with a mapping by letter or symbol, to aid those with less than perfect color vision.

    Use the small multiples for quantized speed ranges (Question: are raw speed values the most interesting, or would it be better to have average, plus, minus?). Then you can combine lightness and saturation to indicate pitch density (as in the Brewer ramps).

    Or, maybe it would make more sense for your audience to quantize pitch density and map speed to the ramp. Either way, my intuition is that you would see more interesting patterns in your data than trying to use color for both quantitative dimensions.

    Keep up the good work.

    Maureen

  6. The Importance of color in data visualization on Datavisualization.ch on March 24th, 2009

    […] Michael E. Driscoll over at Data Evolution comes the original article “Color: The Cinderella of dataviz” about the lack of focus on color in visualizations. It’s an elusive read for anybody […]

  7. Mike Williamson on April 12th, 2009

    Hi Michael,

    I originally saw this presentation as you gave it at the “Use R” group last Wed. I then installed “colorspace” on R and played around with it for a little while these past couple days. I have 2 questions, if you don’t mind, since I am curious how you may have handled the same problems I am having:

    1) In order to automatically generate any decent color palette automatically, regardless of gradations, I need to use the “mixcolor” function.
    As this function says in its manual, it mixes colors “additively”. I am not great with color recognition, but it is either this additive mixing, or the fact that it is in fact mixing in the RGB scheme, regardless of what colorspace I put in for “where”, that is messing up the mixing. If I try something similar to what you did for your baseball stuff, and I generate it using mixcolor, I will get a MUCH brighter luminosity than what you show there, so that the “grey” between blue & red is nearly white.
    Is there a way in “R” using mixcolor or whatever to preserve the luminosity when blending colors?

    2) It is clear that the colorspace package is not really “comfortable” with generating colors in anything other than the RGB scheme. I say this because if I try to mixcolor in with colors in anything other than the RGB scheme, it is possible that the mixed color will generate “NA”s for the values. (Specifically, I only tested LAB, LUV, and their polar versions.)
    While I like what everyone is saying that these other color schemes are better for the human eye, and it makes total sense, I have more “fear” of generating a color key with NAs in it (which will simply not plot) than I do of having a color scheme that is less than ideal. Am I doing something wrong, or do others have this problem with mix color? I suppose the completely reasonable solution is to just create a better mixcolor function, has anyone done this so I don’t recreate the wheel?

    Thanks!
    Mike

  8. Michael E. Driscoll on April 15th, 2009

    Hi Mike –

    I’d have to see your code to understand what’s happening, but a few thoughts:

    (1) The LUV and LAB colorspaces separate chromaticity (the u and v coordinates) from luminosity: so luminosity is held constant when you create a mixture of two different chromas. Specifically, here is the code for creating a 2D palette:

    ## builds a 2d palette mixing 2 hues (col1, col2)
    ## and across two luminosities (l1,l2)
    ## returns C, a matrix of the hex RGB values
    library(colorspace)
    plot2d <- function(col1,col2,l1,l2,m,n,...) {
    C <- matrix(data=NA,ncol=m,nrow=n)
    alpha <- seq(0,1,length.out=m)
    lum <- seq(l1,l2,length.out = n)
    for (i in 1:n) {
    c1 <- LAB(lum[i], coords(col1)[2], coords(col1)[3])
    c2 <- LAB(lum[i], coords(col2)[2], coords(col2)[3])
    for (j in 1:m) {
    c <- mixcolor(alpha[j],c1,c2)
    hexc <- hex(c,fixup=TRUE)
    C[i,j] <- hexc
    }
    }
    return(C)
    }

    (2) Once you make or mix colors in the LAB or LUV space, you need to cast them back into RGB using the ‘hex’ function, but you must include the ‘fixup=TRUE’ parameter in your call to avoid getting NAs in your result. From the documentation for ‘hex’:

    fixup: Should the color be corrected to a valid RGB value before
    correction? The default is to convert out-of-gamut colors to
    the string ‘”NA”‘.

    E.g. write,

    library(colorspace)
    ## 50% mixture of blue and red
    red <- LAB(50,64,64)
    blue <- LAB(50,-48,-48)
    gray < - mixcolor(0.50,red,blue)
    rgbgray <- hex(gray, fixup=TRUE)

    I also thought I’d point folks to the “Building Web Dashboards with R” talk that you reference:

    http://files.meetup.com/1225993/Dataspora_Building_Web_Dashboards_with_R.pdf

  9. O’Reilly Radar on May 4th, 2009

    Big Data: SSD’s, R, and Linked Data Streams…

    If you haven’t seen it, I recommend you watch Andy Bechtolsheim’s keynote at the recent Mysqlconf. We covered SSD’s in our just published report on Big Data management technologies. Since then, we’ve gotten additional signals from our network of al…

Published by Michael Driscoll

Founder @RillData. Previously @Metamarkets. Investor @DCVC. Lapsed computational biologist.

Leave a Reply

%d bloggers like this: