**what happened to dataviz of the week?**

It’s over; there are no more weeks. Hand of God has struck the hour, no more pie charts have the power, etc etc. It became a staple of the blog through 2017 but I don’t know if it is really useful to anyone. There are many places where you can hear about new, funky viz; I suggest Twitter + Feedly. It started on my noticeboard when I was a lecturer. Some viz was good and some was bad. I just stuck it up each week without explanation. My colleagues often mis-classified them to good/bad, which was interesting in itself. On the blog, some critique is needed, but (see below), I can’t do so much of that any more.

I think it would be more useful to people to have posts about how to achieve X dataviz goal (where X is not an element of R-bloggers), and thinking about dataviz, weighing up options etc.

You can expect more building from basic components, including R base graphics, SVG and D3. There are no free lunches.

Of course, I have a book in the pipeline, which contains some overarching thoughts on making and critiquing dataviz, and those will get reflected here over the spring and summer.

**Bayes**

I did a little Twitter poll asking what people wanted to hear about from @robertstats, and it was Bayes that came out in front. I was quite surprised by that, but I’m happy to oblige. I won’t try to compete with in-depth technical blogs like Christian Robert’s. The popularity of how-to posts makes me think that I should apply that here. So, there will be approachable posts for people who have passed the absolute beginner stage but are still fumbling through the foothills; I think that’s a gap in the proverbial market.

Expect Stan to feature heavily, plus latent variables and various forms of uncertainty arising from biases and imperfect data.

**criticism**

You might notice that, naked of the university indemnity and legal precedent of academic freedom, I don’t cuss people nearly as much as I used to, and that’s not going to change. I have a kid to feed, you know.

**philosophy of science**

I don’t have much left to say about this, but I think a Platonic style dialogue where some ultra-frequentist gets made to look stupid could be a useful way of showcasing the various reasons why Bayes wins.

I also have notes on “Aboutness” by Stephen Yablo to share with you.

**learning**

The main thrust of my freelance activity is training and coaching, and I’ll talk a bit here about what makes for good learning and teaching in data science / statistics / machine learning. Like Colonel Saunders, I won’t tell you all my secrets, but I might lure you in by revealing that there are some. And if you are a teacher you will probably recognise what I’m getting at.

The coaching front might feature some thoughts on building a career and an identity in this data science age. If you’re a millennial and work in tech, I feel sorry for you, son. But you don’t have to work yourself to an early grave. My homeboy Henry Thoreau might turn up here, or he might not.

]]>

Fresh, impactful, memorable, fun, genuinely informative. Canadians can have 100,000 babies at a time eh?

Methodviz (visualisation that explains not one data set or model but an analytical method) of the year goes to Autodesk Research’s gifs where each image has the same summary statistics.

The clever thing about these is the method used to get them all, which you can read here. I have to admit that I belong in the old fart camp who contend that this doesn’t upstage Anscombe, but it is eye-catching and if someone remembers this message, then it’s all been worthwhile. (Tip your xmas hat to Alberto Cairo’s datasaurus while you’re about it.)

Here’s to 2018.

]]>

Dorothea Langeโs Censored Photographs of FDR’s Japanese Concentration Camps

Doomsday prep for the super-richย “Some of the wealthiest people in Americaโin Silicon Valley, New York, and beyondโare getting ready for the crackup of civilization”

The Hermit Who Inadvertently Shaped Climate-Change Science

The gig economy celebrates working yourself to death

A child soldier sees his mother after 6 years. But why doesn’t he speak?

Where Totem Poles Are a Living Art (and Relics Rest in Peace)

Burrowing under luminous ice to retrieve mussels

My family’s slaveย “She lived with us for 56 years. She raised me and my siblings without pay. I was 11, a typical American kid, before I realized who she was.” Perhaps the most shared and read and talked-about article of the year. You should read it too.

Can an Archive Capture the Scents of an Entire Era?ย “A molecular record of smells could give future generations a sense of the past.”

Coralroot, a rare beauty among the old gravesย (and by extension, many, many more Country Diary entries)

Mike Olbinski, storm chasing photographer who I actually found via Outside, but his own website is the mutha lode.

https://www.theguardian.com/business/2017/may/02/where-oil-rigs-go-to-die

Lesotho is a secret mountain bike paradise

The dark evolution of British drinking culture

Just a standard NDA, New Yorker.ย I wish I could tell you more, but I can’t.

The peculiar melancholy of parking lots

On the water, and into the wild

John Margolies’ photographs of roadside America (whence comes the featured image above;ย http://www.loc.gov/pictures/item/2017708693/)

Inside the world’s largest walnut forest, Roads And Kingdoms.ย If you like this you should get Roger Deakin’s book *Wildwood.*

Backcountry drug war, BioGraphic.

Bones of the Tongass, Sierra Magazine.ย “I wished I could see what it truly meant to leave no trace.”

A photo trip to Antarctica, The Atlantic.

This tweet and everything that follows:

Happy Christmas everybody!

]]>

First up is a small multiple choropleth: choose a starting neighbourhood and it shows you whether taxis or bikes win, by destination neighbourhood. This is nice and simple to use but carries a lot of detail. Much more fun than the table of stats that most of us would have reached for first. I particularly like the way that Todd has provided a natural, Bayesian metric in the % chance that a taxi would beat a bike. That is going to make a lot more sense to most readers than a pair of medians or whatever.

Next, you get some breakdowns by time of day, direction of travel, etc. We see line charts, bar charts, dot plots with connecting lines — basically the same encoding but different formats to keep it fresh.

Todd finds that taxis have got slower since 2009: 20% slower! But bikes have not. And while taxis, and traffic inย general, can have really bad days, bikes whizz past. That makes sense.

It’s all a really nice piece of work with all code on GitHub. Thanks to Xiaodong Cai for pointing it out to me.

]]>

Three lessons can be learnt, I think:

- Colour-blindness matters. But it’s not the only consideration, in the same way that making data visualisations at all should not be construed as an insult to people who are blind.
- People who already have a good grasp and their own mental model of the subject at hand are probably not going to like what you do to make the subject more accessible. That tells you very little about the success of your work.
- Small multiples are an excellent solution to many problems of information overload. Rather than have all the colours in one image, you could have multiple images, highlighting just one category each time. An interactive web page would be a good way of presenting that.

]]>

The paper is a representative example of this sort of technique, so I thought I would just explain it and its visualisations this week. PCA is sometimes included in a “machine learning” toolbox, and its capacity to crunch through many variables makes it appealing to fans of contemporary predictive modelling in the “data science” school.

You can think of the data as a cloud of 3000 dots. In the same way that, if you know a horizontal location and a vertical location for each dot, you can draw a scatterplot, and if you had depth, height and width, you could place each dot in a 3-dimensional space, each dot has 500,000 co-ordinates (genetic features measured at a certain point in the genome), so they have a location in 500,000-dimensional space. That’s hard to visualise, so some compromise will be needed.

We want to show as much information as possible. But what do we mean by “information”? If we can quantify that concept, we can find a way of looking at the cloud in just two dimensions — projecting it onto a piece of paper, or photographing it from aย certain angle — that maximises the “information”. Well, one obvious candidate is the variance, a statistical measure of how spread out data are along one axis. There are other such measures, but variance has some a nice property and relates to PCA.

First, though, imagine a 3-dimensional cloud of data shaped like a pitta bread. The three axes have different variances. If you wanted to take a photograph of the pitta and show people what it is like, it would be odd to take the photograph end-on so that it looked like a long pencil-shaped finger of bread (right, below). You’d be losing a lot of information. However, there is of course no ideal way of doing this photograph: even taking it so that the two highest variance dimensions are visible (left, below) loses that little bit contained in the depth of the pitta.

So, if you have to reduce dimensionality, you want to be sure to include dimensions that have high variance without much distortion. Variance is (kind of) the root-mean-square distance from the mean, so it is related to what matrix algebra calls the L2 norm. I bet you’re thinking here about Pythagoras, and how the squared distances in x and y directions add up to the squared distance straight across them (the hypoteneuse). This is what’s neat about using the variance as a measure of information in dimension reduction: the variances of each dimension add together to give the generalized variance, related to the L2 norm of the whole data matrix. You might recall some tedious stuff like this in Analysis Of Variance if you studied Old-Fashioned Statistics 101. By showing some dimensions, you are partitioning the generalized variance into seen and unseen.

So, let’s look at what they got out of this genetic data.

The 2-dimensional reduction caught a lot of people’s eyes because it nearly re-creates a geographical map of Europe based solely on the genetic data. There are some nice features, like the elongation of Italy from France and Switzerland down to Greece and Cyprus.

Here are some more dataviz thoughts on it:

- It’s difficult to know what to do in a scatterplot with many categories like this. Including a two-letter code, if it is unambiguous, is a pretty good option.
- The letters have to be quite small and are jumbled on top of one another. I wonder if it would help to show a random sample in one version of this scatterplot. Or maybe convex hulls (bagplot, if you prefer) in another.
- The colours mean nothing to me (Vienna)
- The big opaque markers for mean/median/whatever are far too overbearing
- Why does Scotland get special treatment? I mean, that’s nice of the authors, but I don’t want the rest of Europe getting envious.
- Some countries have few data; reducing the size of the mean/median/whatever marker would help convey this.
- I always want to see some measure of discarded information per observation in dimension reduction. Perhaps the same scatterplot with L1/L2 norm of the other principal components encoded as colour of the marker, or something like that. I want to know if there are pockets of data where this summary doesn’t do them justice.
- The authors write “The rotation of axes used in Fig. 1 is 16 degrees counterclockwise and was determined by finding the angle that maximizes the summed correlation of the median PC1 and PC2 values with the latitude and longitude of each country.” Fair enough, but why has nobody ever heard of Procrustes analysis? Go Google it. Procrustes is the basis of analysis of shapes. If you can match the countries in principal component space with the countries in Europe (some map projection anyway), then you can calculate what isotropic or anisotropic transform is required to get them as close to each other as possible. It’s especially useful if you have more than one dimension reduction and want to quantitatively compare them in terms of how they match up to the gold standard (the map).
- Before you do any more criticism, read the paper, at least the intro section. The authors acknowledge and explain many limitations in a very clear and interesting way.

There are many other ways of reducing dimensionality, like correspondence analysis or t-SNE. I’ll come back to them here one day. They all get a broad-brush overview in my dataviz book which is planned to come out next year.

]]>

This is just a screenshot, so get over there and have a look.

]]>

- Very common (affects more than 1 in 10 people)
- Common (affects less than 1 in 10 people)
- Uncommon (affects less than 1 in 100 people)

I think that is useful, in that the words are pretty meaningless in themselvesย and we know that natural frequencies help understanding. I don’t know if this is a standard format. But, alas, I have some gripes, and they are quite serious.

- As defined above, “uncommon” is a subset of “common”. Shurely some mishtake.
- There are better words than these. Why have “common” in every one? What next: “Really not at all common in the least”?
- Why not have a waffle plot, for goodness sake? It’s not hard to do.
- There is one list like this for side effects reported in clinical trials, and another for side effects reported via the Black Triangle (that’s like the Bermuda Triangle, but less, you know, retro surf rock fun, and more amputations and brain damage). Why? Nobody gives a monkey’s where it happened.
- There’s then another list that has no frequency, but begins “Next to the above side effects, the following side effects occurred occasionally during use of Fluarix trivalent influenza vaccine”. OK, firstly, we don’t say “next to the above” in English, unless something is physically next to the above, like above and slightly to the right, and I think this publication warrants more effort than just getting the work experience guy to put it in Google รbersetzer, so that undermines my faith in the rest of the publication. Then, “occasionally”, what the f does that mean? “Failure of the circulatory system” is in there so I’d kinda like to know a bit more. Thirdly, is trivalent the same as Tetra? Something from school days tells me it’s not. Is the side effect profile relevant, or not? I don’t know.

Anyway, I survived and had no side effects at all. We won’t have any of this after Brexit, you know. It’ll be more Wild West. Get jabbed, you wimps.

]]>

Personally, I’m a KFC purist. Let it be the Wicked Zinger or none at all!

]]>

This shows you something in one compact form that you don’t get to grasp at all easily otherwise. We have to fit these models by discrete steps, and we want those steps to function as samples from a posterior distribution. Generally, when the sample sucks, the algorithm omits it. But sometimes, the steps it takes get *divergent*: moving away from the posterior distribution and into trouble. You want to know when that happens (they get flagged up as divergent by Stan), and also to have some idea of why it’s happening.

So, here the divergent samples are highlighted in green. Each one is a vector with a value for each parameter in the model, and also lp__, which is a log-posterior probability or perhaps that multiplied by some negative number from the looks of it. The finger of suspicion points at -2, but I leave you to look it up in the Stan manual if you like.

The caption says it all. It’s also indicative that the green lines are flat across the thetas because they are forced to be very close to each other by the small tau. Any discrepancy from the observed values then gets offset by the grand mean mu; I suspect the height of the theta region is inversely correlated to the height of the mu for each green line, but we can’t see that in this plot. I seem to recall that ShinyStan gives you some pairwise correlations for divergent samples only, but that’s not as compelling an image as this.

What do you do once you find this pattern? I would consider squeezing the tau (standard deviation) prior so that it can’t go near zero, because often we know it can’t be zero. Just nudge up the bottom end of a uniform prior, or truncate some other weird thing (half-Cauchy, mmm) with a lower bound in the declaration: real<lower=0.1> tau; or some such.

By the way, I like the lurid green against the black. It’s clear, and it’s also the colour scheme that computers are meant to have.

A semantic note: I forget which, but it is Stan policy to refer to a vector of (hyper)parameter values generated by one step of an MCMC algorithm as either a draw or a sample, and then the individual components of that vector as the other. You’ll have to just put up with me if I wrote it the wrong way round for your liking.

Declaration of interest: I’m a Stan-dev, albeit a pretty insignificant one. I like Stan and use it a lot. I don’t have any financial ties with Stan Group or the various devs though, so I don’t have to curry favour.

]]>