Joy unconfined

There’s been much excitement online about ggjoy which is an R package making vertically stacked area charts with a common x-axis. It is reminiscent of the Joy Division album cover, so I guess that ticks some cultural boxes in the dataviz community. The original version by Henrik Lindberg had the areas extending into the chart above,

joy-plot-lindberg

which provoked some killjoys to complain and the height got standardised. Here’s Xan Gregg tackling it with a stacked (small multiple? see below) horizon plot.

joy-plot-gregg

Fair enough, though, as Robert Kosara pointed out, it’s more joyless. Bob Rudis did a heatmap – violin plot cross

joy-plot-rudis-heatmap-violin-thing

And Henrik did a classic heatmap

joy-plot-heatmap-lindberg

None of which beat the original in my estimation.

You might use this look if you’re visualising continuous x, continuous y and categorical z. z is the stacking variable. Now, z could be ordinal or you might want to order it in some sensible way. What way? Like in the example above, it works best when there is an approximately smooth curve through the chart from top to bottom. Lindberg and Gregg have used different orders. There isn’t a right or wrong, because you emphasise different messages either way. Such is the way of dataviz. I think this is an important point, but for now consider a random z order.

outrandom

Not so good huh? To play with the order, clone Henrik’s github repo and play with the line

 mutate(activity = reorder(activity, p_peak, FUN=which.max %>% # order by peak time

I did:

 mutate(activity = reorder(activity, p_peak, FUN=function(x) { runif(1) })) %>% # order randomly 

I think the appeal of these charts is that they emulate 3-D objects. That’s why I’m laid back about the chart extending into the block above.

This is all straightforward stuff with the three variables. Back in the day, I made an R function called slicedens that does something similar with two continuous variables. I chop up one of them (y) into slices, get kernel density plots for x in each slice of y, and off you go. Note the semi-transparency please.

slicedens

This attacks the recurring problem of a scatterplot with Big Data where it just turns into one massive blob of ink. Hexbins, contour plots etc are alternatives. Which of these can you batch-process when the data exceed the RAM or are streaming? Hexbins yes, because you just add to the count (and for real-time data, subtract the oldest set of counts). Contour maps no! Slice density yes, for the same reasons, that a kernel density plot is a sum of a load of little curves. If you want to draw it for a trillion data points, just break it up and accumulate density heights. (“Just,” he says!) If you want a real-time slice density for the last ten seconds of stock market trades, batch it up in seconds (for example), add them together. Every second, add the new one and subtract the one from ten seconds ago. Simples.

So, the 3-D object is important because this leverages a load of cognitive power to see things in the real world and makes the perception of pattern in the dataviz easier for the reader. Choses visibles et invisibles.

There’s a related question I want to expand on another time: when does a stacked chart become small multiples? This is not just semantics because they get understood in different ways.

Leave a comment

Filed under Visualization

Dataviz of the week, 12/7/17

Here’s a map of the EU. Nothing more to it than that. I just like the fact that, making it look like a globe, it’s more engaging and eye catching. I decided in the forthcoming dataviz book to divide the design into encoding, format and aesthetics. This map is not dataviz, but you could imagine how the aesthetics could be used to good effect.

Picture2-e1498527246740

I found it at http://blog.nycdatascience.com/student-works/forecasting-economic-risk-eu-2020/ and they got it from http://www.mckinsey.com/global-themes/employment-and-growth/new-priorities-for-the-european-union-at-60

Leave a comment

Filed under Visualization

Dataviz of the week, 5/7/17

This week we look at a clinical trial of treatments for tuberculosis, the PanaCEA MAMS-TB study. I’ve been involved with TB on and off since project-managing and statisticizing the original NICE guideline back in the day. I won’t go into detail on TB treatments but the trial compares various combinations of drugs, and there’s a new candidate drug called SQ109 in the mix. The paper is here (I hope it is not paywalled). You can see the Kaplan-Meier plot on page 44. Without going into detail, these are classic formats for clinical trials looking at time-to-event data. As time goes by, people either get recurrence of the disease or disappear out of the trial, and the numbers at risk go down. You want to be in a group whose curve descends less steeply.

But there are different ways of measuring and counting events, so the authors made an interactive web page showing these as a sensitivity analysis. Hooray!

Screen Shot 2017-06-23 at 12.55.33

It’s a pity Lancet paid such lip service to it, tucked away as a link in the margin of page 45. Boo!

I found the transitions in the table of patients at risk weird – I guess that’s the d3 transition deciding to move the numbers horizontally and it might be clearer to fade them out, remove them, then put them back from scratch. It’s also clear that Mike Bostock never had to deal with step functions in transition. But otherwise a really nice example of how trials can provide more layers of info. 

Leave a comment

Filed under healthcare, JavaScript, Visualization

Dataviz of the week, 28/6/17

Jake Conway, Alexander Lex & Nils Gehlenborg have made an R package called UpSetR which, as the name suggests, puts on an iron shirt and chases the devil out of Eart’. The devil in this case being Venn diagrams. Invariably, when people want to count up combinations of stuff, they end up hand-bodging some crappy diagram that isn’t even a real Venn. They use powerpoint or some other diabolical tool. Now you can do better in R.

Screen Shot 2017-06-27 at 11.27.58

Leave a comment

Filed under Visualization

Dataviz of the week, 22/6/17

All I’m going to do this week I point you to Andy Kirk’s blog. He’s considering bivariate choropleth maps. You what? Each region has two variables. Maybe one gets encoded as hue and the other saturation. No way. Yes way, and it’s not necessarily the train wreck you’d imagine. Check them out.

Bivariate-choropleth

Of course, by superimposing objects rather than colouring in (because colour is, you know, so beguiling as a visual parameter to mess around with, yet so poorly perceived), people have been doing this for ages. Bertin’s much-quoted and less-read book has many such examples, which mostly fall flat in my view. As he wrote: “It is the designer’s duty … to flirt with ambiguity without succumbing to it”. Bof!

Leave a comment

Filed under Visualization

Discomfiting jumps

I have been writing a book review of Efron & Hastie’s CASI for Significance magazine. Here’s a tangential half page I wrote but didn’t include.

Students of statistics or data science will typically encounter some discomfiting jumps in attitude as their course progresses. first, they may have a lot of probability theory and some likelihood-based inference for rather contrived problems, which will remind them of their maths classes at school. Ah, they think, I know how to do this. I learn the tricks to manipulate the symbols and get to the QED. Then, they find themselves suddenly in a course that provides tables of data and asks them to analyse and interpret. Suddenly it’s become a practical course that connects to the real world and leaves the maths behind for the most part. Now, there’s no QED given, and no tricks. The assessments suddenly are more like humanities subjects, there’s no right or wrong and it’s the coherence of their argument that matters. Now they have to remember which options to tick in their preferred stats software. They might think: why did we do the mathematical parts of this course at all if we’re not going to use them? Next, for some, come machine learning methods. Now, the inference and asymptotic assurances are not just hidden in the cogs of the computer but are actually absent. How do I know the random forest isn’t giving me the wrong answer? You don’t. It seems at first that when the problem gets really hard, like 21st-century-hard, land-a-job-at-Google-hard, we give up on stats as an interesting mental exercise from the 1930s in favour of “unreasonably effective” heuristics and greedy algorithms.

One really nice thing they do in CASI is to emphasise that all estimation, from standard deviations of samples to GAMs, are algorithms. The inference (I prefer to say “uncertainty”) for those algorithms follows later in the history of the subject. The 1930s methods had enough time to work out inference by now, but other methods are still developing their inferential procedures. This unifies things rather better, but most teaching has to catch up. One problem is that almost all the effort of reformers following George Cobb, Joan Garfield and others has been on the very early introduction to the subject. That’s probably the right place to fix first, but we need to broaden out and fix wider data science courses now.

Leave a comment

Filed under learning

Dataviz of the week, 15/6/17

It’s Clean Air Day in the UK. Air pollution interests me, partly as I worked in medical stats for many years, partly because I don’t want to breathe in a lot of crap, and partly because I don’t want my baby to breathe in a lot of crap. London is really bad, the worst place in Europe. Not Beijing, sure, but really bad, and it’s hard to imagine that Brexit will lead to anything but a relaxation of the rules.

Real World Visuals (formerly CarbonVisuals, who made the amazing mountain of CO2 balls looming over New York) have made a series of simple, elegant but powerful images about volumes of air and what they contain, and the volumes of air saturated with pollution which are left behind by one car over one kilometer travelled.

C1

The tweet is accidentally poetic as it can’t accommodate more than the first four images, which leaves you on a cliffhanger with the massive stack looming behind the mother and girl. You know what it is but you can’t see its enormity yet.

The crowd visualisation of 9,416 dead Londoners as dots is not bad, though I like physical images of numbers of people, like this classic (adapted from http://www.i-sustain.com/old/CommuterToolkit.htm):

images.washingtonpost.com_

Here’s a picture of apparently 8-9000 people marching in Detroit:

636206168967435532-lansing-march

All dead by Christmas. And then some.

You might like to compare and contrast with higher-profile causes of death, like terrorism.

Leave a comment

Filed under Visualization