# Monthly Archives: July 2017

## Joy unconfined

There’s been much excitement online about ggjoy which is an R package making vertically stacked area charts with a common x-axis. It is reminiscent of the Joy Division album cover, so I guess that ticks some cultural boxes in the dataviz community. The original version by Henrik Lindberg had the areas extending into the chart above,

which provoked some killjoys to complain and the height got standardised. Here’s Xan Gregg tackling it with a stacked (small multiple? see below) horizon plot.

Fair enough, though, as Robert Kosara pointed out, it’s more joyless. Bob Rudis did a heatmap – violin plot cross

And Henrik did a classic heatmap

None of which beat the original in my estimation.

You might use this look if you’re visualising continuous x, continuous y and categorical z. z is the stacking variable. Now, z could be ordinal or you might want to order it in some sensible way. What way? Like in the example above, it works best when there is an approximately smooth curve through the chart from top to bottom. Lindberg and Gregg have used different orders. There isn’t a right or wrong, because you emphasise different messages either way. Such is the way of dataviz. I think this is an important point, but for now consider a random z order.

Not so good huh? To play with the order, clone Henrik’s github repo and play with the line

` mutate(activity = reorder(activity, p_peak, FUN=which.max %>% # order by peak time`

I did:

` mutate(activity = reorder(activity, p_peak, FUN=function(x) { runif(1) })) %>% # order randomly `

I think the appeal of these charts is that they emulate 3-D objects. That’s why I’m laid back about the chart extending into the block above.

This is all straightforward stuff with the three variables. Back in the day, I made an R function called slicedens that does something similar with two continuous variables. I chop up one of them (y) into slices, get kernel density plots for x in each slice of y, and off you go. Note the semi-transparency please.

This attacks the recurring problem of a scatterplot with Big Data where it just turns into one massive blob of ink. Hexbins, contour plots etc are alternatives. Which of these can you batch-process when the data exceed the RAM or are streaming? Hexbins yes, because you just add to the count (and for real-time data, subtract the oldest set of counts). Contour maps no! Slice density yes, for the same reasons, that a kernel density plot is a sum of a load of little curves. If you want to draw it for a trillion data points, just break it up and accumulate density heights. (“Just,” he says!) If you want a real-time slice density for the last ten seconds of stock market trades, batch it up in seconds (for example), add them together. Every second, add the new one and subtract the one from ten seconds ago. Simples.

So, the 3-D object is important because this leverages a load of cognitive power to see things in the real world and makes the perception of pattern in the dataviz easier for the reader. Choses visibles et invisibles.

There’s a related question I want to expand on another time: when does a stacked chart become small multiples? This is not just semantics because they get understood in different ways.

Filed under Visualization

## Dataviz of the week, 12/7/17

Here’s a map of the EU. Nothing more to it than that. I just like the fact that, making it look like a globe, it’s more engaging and eye catching. I decided in the forthcoming dataviz book to divide the design into encoding, format and aesthetics. This map is not dataviz, but you could imagine how the aesthetics could be used to good effect.

I found it at http://blog.nycdatascience.com/student-works/forecasting-economic-risk-eu-2020/ and they got it from http://www.mckinsey.com/global-themes/employment-and-growth/new-priorities-for-the-european-union-at-60

Filed under Visualization

## Dataviz of the week, 5/7/17

This week we look at a clinical trial of treatments for tuberculosis, the PanaCEA MAMS-TB study. I’ve been involved with TB on and off since project-managing and statisticizing the original NICE guideline back in the day. I won’t go into detail on TB treatments but the trial compares various combinations of drugs, and there’s a new candidate drug called SQ109 in the mix. The paper is here (I hope it is not paywalled). You can see the Kaplan-Meier plot on page 44. Without going into detail, these are classic formats for clinical trials looking at time-to-event data. As time goes by, people either get recurrence of the disease or disappear out of the trial, and the numbers at risk go down. You want to be in a group whose curve descends less steeply.

But there are different ways of measuring and counting events, so the authors made an interactive web page showing these as a sensitivity analysis. Hooray!

It’s a pity Lancet paid such lip service to it, tucked away as a link in the margin of page 45. Boo!

I found the transitions in the table of patients at risk weird – I guess that’s the d3 transition deciding to move the numbers horizontally and it might be clearer to fade them out, remove them, then put them back from scratch. It’s also clear that Mike Bostock never had to deal with step functions in transition. But otherwise a really nice example of how trials can provide more layers of info.