There is a considerable amount of human craftmanship involved in this, but I don’t think we should shy away from that. We tend (me too!) to look for coding solutions to everything but shouldn’t let that keep us from doing nice things by hand too. Having said that, it reminded me of this very cool totally automatic JavaScript generator of fantasy maps.

Using metaphor to reinforce a dataviz message is an important tool. I decided to close my dataviz book, now in draft version 1, with examples of that. I went with Carbon Visuals, Gun Deaths 2013, 5W Infographics’ solar system exploration, and WP’s flight MH graphic.

]]>

]]>

Brilliant, and a reminder that sound, well used, is incredibly powerful. They credit Seismic Sound Lab at Lamont Doherty/Columbia (http://seismicsoundlab.org ) with making the animation. You should take a look at their other work too.

I think this is in with a great chance of being my dataviz of the year.

]]>

]]>

which provoked some killjoys to complain and the height got standardised. Here’s Xan Gregg tackling it with a stacked (small multiple? see below) horizon plot.

Fair enough, though, as Robert Kosara pointed out, it’s more joyless. Bob Rudis did a heatmap – violin plot cross

And Henrik did a classic heatmap

None of which beat the original in my estimation.

You might use this look if you’re visualising continuous x, continuous y and categorical z. z is the stacking variable. Now, z could be ordinal or you might want to order it in some sensible way. What way? Like in the example above, it works best when there is an approximately smooth curve through the chart from top to bottom. Lindberg and Gregg have used different orders. There isn’t a right or wrong, because you emphasise different messages either way. Such is the way of dataviz. I think this is an important point, but for now consider a random z order.

Not so good huh? To play with the order, clone Henrik’s github repo and play with the line

mutate(activity = reorder(activity, p_peak, FUN=which.max %>% # order by peak time

I did:

mutate(activity = reorder(activity, p_peak, FUN=function(x) { runif(1) })) %>% # order randomly

I think the appeal of these charts is that they emulate 3-D objects. That’s why I’m laid back about the chart extending into the block above.

This is all straightforward stuff with the three variables. Back in the day, I made an R function called **slicedens** that does something similar with two continuous variables. I chop up one of them (y) into slices, get kernel density plots for x in each slice of y, and off you go. Note the semi-transparency please.

This attacks the recurring problem of a scatterplot with Big Data where it just turns into one massive blob of ink. Hexbins, contour plots etc are alternatives. Which of these can you batch-process when the data exceed the RAM or are streaming? Hexbins yes, because you just add to the count (and for real-time data, subtract the oldest set of counts). Contour maps no! Slice density yes, for the same reasons, that a kernel density plot is a sum of a load of little curves. If you want to draw it for a trillion data points, just break it up and accumulate density heights. (“Just,” he says!) If you want a real-time slice density for the last ten seconds of stock market trades, batch it up in seconds (for example), add them together. Every second, add the new one and subtract the one from ten seconds ago. Simples.

So, the 3-D object is important because this leverages a load of cognitive power to see things in the real world and makes the perception of pattern in the dataviz easier for the reader. Choses visibles et invisibles.

There’s a related question I want to expand on another time: when does a stacked chart become small multiples? This is not just semantics because they get understood in different ways.

]]>

I found it at http://blog.nycdatascience.com/student-works/forecasting-economic-risk-eu-2020/ and they got it from http://www.mckinsey.com/global-themes/employment-and-growth/new-priorities-for-the-european-union-at-60

]]>

But there are different ways of measuring and counting events, so the authors made an interactive web page showing these as a sensitivity analysis. Hooray!

It’s a pity Lancet paid such lip service to it, tucked away as a link in the margin of page 45. Boo!

I found the transitions in the table of patients at risk weird – I guess that’s the d3 transition deciding to move the numbers horizontally and it might be clearer to fade them out, remove them, then put them back from scratch. It’s also clear that Mike Bostock never had to deal with step functions in transition. But otherwise a really nice example of how trials can provide more layers of info.

]]>

]]>

Of course, by superimposing objects rather than colouring in (because colour is, you know, so beguiling as a visual parameter to mess around with, yet so poorly perceived), people have been doing this for ages. Bertin’s much-quoted and less-read book has many such examples, which mostly fall flat in my view. As he wrote: “It is the designer’s duty … to flirt with ambiguity without succumbing to it”. Bof!

]]>

Students of statistics or data science will typically encounter some discomfiting jumps in attitude as their course progresses. first, they may have a lot of probability theory and some likelihood-based inference for rather contrived problems, which will remind them of their maths classes at school. Ah, they think, *I know how to do this*. *I learn the tricks to manipulate the symbols and get to the QED*. Then, they find themselves suddenly in a course that provides tables of data and asks them to analyse and interpret. Suddenly it’s become a practical course that connects to the real world and leaves the maths behind for the most part. Now, there’s no QED given, and no tricks. The assessments suddenly are more like humanities subjects, there’s no right or wrong and it’s the coherence of their argument that matters. Now they have to remember which options to tick in their preferred stats software. They might think: *why did we do the mathematical parts of this course at all if we’re not going to use them?* Next, for some, come machine learning methods. Now, the inference and asymptotic assurances are not just hidden in the cogs of the computer but are actually absent. *How do I know the random forest isn’t giving me the wrong answer?* You don’t. It seems at first that when the problem gets really hard, like 21st-century-hard, land-a-job-at-Google-hard, we give up on stats as an interesting mental exercise from the 1930s in favour of “unreasonably effective” heuristics and greedy algorithms.

One really nice thing they do in CASI is to emphasise that all estimation, from standard deviations of samples to GAMs, are algorithms. The inference (I prefer to say “uncertainty”) for those algorithms follows later in the history of the subject. The 1930s methods had enough time to work out inference by now, but other methods are still developing their inferential procedures. This unifies things rather better, but most teaching has to catch up. One problem is that almost all the effort of reformers following George Cobb, Joan Garfield and others has been on the very early introduction to the subject. That’s probably the right place to fix first, but we need to broaden out and fix wider data science courses now.

]]>