Last year Nick Cox pointed out to me that the only regular (i.e. all sides are the same length, all angles the same) two-dimensional shapes that tesselate (i.e. fill up the space without leaving gaps or overlapping) are those with 3,4 and 6 sides: the equilateral triangle, the square and the regular hexagon.

Then I started thinking about this in the big data context. Suppose I have to reduce my data to make it vizable, you know so I, the feeble human, can explore it and see what’s going on. That is time consuming and hard to program, so I want to do it once only if possible. How should I bin the data to keep my options open for later dataviz?

Here’s an example of what I mean. If I have two variables, like latitude and longitude of NYC taxi pickups, and I count them on a fine square grid, I can store that matrix of counts locally. Even if it is a big grid like 10,000 by 10,000 that will still be 100,000,000 numbers, which is quite manageable. Later, maybe I want to draw it on a 1000 by 1000 grid, so I just add together the counts in adjacent groups of 100 small squares to make one big square. That runs quickly.

`// pseudocode: aggregate into a square grid`

int[n_rows,n_cols] count_matrix

for i in 1:n_data {

int rownum=floor(y/row_height)

int colnum=floor(x/col_width)

count_matrix[rownum,colnum]=count_matrix[rownum,colnum]+1

}

```
```

`// pseudocode: aggregate into a coarser square grid`

int reduction=10 // each new square is the sum of a 10x10 area of the old grid

int new_rows=n_rows/reduction // assuming it is a multiple of reduction

int new_cols=n_cols/reduction

int[new_rows,new_cols] new_matrix

int left_corner=0 // if our language is zero-indexed

int top_corner=0

for i in 0:new_rows-1 {

for j in 0:new_cols-1 {

new_matrix[i,j]=sum(count_matrix[left_corner:(left_corner+reduction-1),top_corner:(top_corner+reduction-1)]

left_corner=left_corner+reduction

top_corner=top_corner+reduction

}

}

So I was thinking: can you combine shapes together easily? This called for some geometry, which was never my thing. Here we go.

Triangles can be combined in sets of four to make bigger triangles, or sets of six to make a hexagon. Squares combine in sets of four to make bigger squares, and so on. Hexagons don’t combine to make any of these shapes. So, what can I conclude? Bin your data in two dimensions using small squares or triangles, bearing in mind that the triangle will give you hexbins if you want them, but there is no crossing from tri-hex to square or back again. You could have a rhombus, but not a square.

Now, what about higher-dimensional binning? It seems that the only regular “space-filling polyhedron” in three dimensions is the cube (cf https://en.wikipedia.org/wiki/Honeycomb_(geometry)). There are some other shapes that are space-filling by themselves, but have a mixture of face shapes, so you probably don’t want to tangle with them because of the difficulty of determining whether a point is in this polyhedron or that polyhedron, and some mixtures of polyhedra which together fill space, but that’s also unsatisfying for this application because you want flexibility in aggregating them to larger polyhedra, and consistency when taking slices through them for visualisation. So, use cubes. I suppose if you really wanted a hexbin (and it’s a good visual format!), you should do 2-D triangle or hexagon bins from the outset; these could be stacked right prisms (think of the Giant’s Causeway) in 3-D which later get aggregated for a marginal plot or filtered for a conditional slice.

More than three dimensions eluded my non-existent powers of geometrical thinking but it seems to me that hypercubes always pay off. Not only can you aggregate a choix, but it’s also easy to allocate points to hypercubes: you just add more lines (those with the floor() function) to the code above, and more dimensions to the array count_matrix.

**Bottom line:** for k-dimensional data, bin in k-hypercubes. But if you know you want triangles or hexbins in 2-D projections or conditional slices, then you’ll have to do that from the outset.

The learning outcomes are:

- understand the differences between a functional language and an imperative language
- know several strengths and weaknesses of both Stata and R
- be able to include chunks of R code inside a Stata do file and have them run from Stata
- be able to include chunks of Stata code inside an R script and have them run from R
- understand the limitations of passing data back and forth between Stata and R, and how to spot problems

If you want to know more, you can email me or get in touch on Twitter. If you want to book a place or ask about practicalities, travel etc, check out the Timberlake page.

]]>If you want to give every personal insight away to some 25-year old dude in Sunnyvale to dick around with, fine. It’s your choice. It just surprises me that nobody (that I know of) thinks to ask for money in exchange for data. Because it’s worth money to the recipient, right?

VODAFONE SMS: your views matter to us. Please take 2 minutes to complete the quick survey we are sending in the next message.

ME: OK, that’ll be 5 pounds please.

GUY IN SHOP: You’ll get an email asking about your experience in the shop today and how I did – I’d really appreciate it if you could fill that out, and you know, they only take 9 or 10 out of 10 as good.

ME: OK, that’ll be £2.50 please

COMPANY I SHOP WITH: Hey, it’s such fun shopping with us, right? Why not have even more fun with our app for your smart phone? All it needs is permission to read your files, photos, contacts, GPS location, wi-fi, messages, phone calls, and social media logins. Click here! We’ll have suuuuuch fun together!

ME: No.

You get the idea. I charge Vodafone more because they can afford more and because they are going to use the data against me in future (their raison d’être is to take my money and your money and give it to shareholders, and that is best done by ratcheting up my bill in a data-driven way). I think I can beat them at that game though, otherwise I wouldn’t offer the transaction at all, which is the case with the kick-me sign on your back that is known as installing an app for convenient shopping. The guy in the shop wants to gets a raise or promotion one day; well, how about a little down payment on that sweet cash flow, brother? I generally get looked at like I’m a callous weirdo (which may also be true but doesn’t follow logically from the evidence, and yes, I really do ask for money; I’ve not got any so far…). But if everybody said it… Here arises the idea of **data unions**, which as far as I can tell only exists as a bon mot that Pedro Domingos chucked out the other day:

but it’s a good idea. Suppose we shop entirely in cash, but you could block-buy our shopping lists for just 10p per shopping trip. 50p if you want home address postcodes attached. We deactivate our phones’ GPS, but carry basic trackers. You want the data? That’ll be £175 per person per year. All proceeds to me and you. Whaddya say?

]]>So, I’m writing a book on dataviz as you may know. I just wonder if there are new pieces of work out there where someone – maybe you – fitted a model of some kind (regression, trees, neutral network, whatevz) to a big dataset and then visualised it somehow. By big, I mean big, like too numerous to draw the individual data, so you had to do something like bin-summarise-smooth on it. But it’s also interesting if it was big, as in you had to do some kind of map-reduce of sufficient statistics just to get the model done. Perhaps you averaged over random samples, then drew densities of predicted values and residuals for all the data; that’s interesting too. You can email me if you prefer to remain secretive, or just comment and ask me not to approve it for public display.

]]>This is the version sent to Significance magazine — an edited version appeared in the April 2017 issue, and lack of space meant it got a severe pruning down to one good thing, one bad thing, and one funny thing. I understand how magazines work and accept that, but you, learned reader, might prefer to hear all the points and the nuances.

Although I haven’t added hyperlinks, you’ll find all the things I refer to quite easily in your favourite search engine, or possibly your local library.

However implausible it may sound, this collection of reminiscence, musings on the state of the art and advice for young statisticians makes for compelling reading. I suspect most Significance readers will find something of interest in here. There are 52 contributions from eminent statisticians who have won a Committee of Presidents of Statistical Societies (COPSS) award. Each is a short, focussed chapter and so one could even say this is ideal bedtime (or coffee break) reading.

Anyone interested in the history of statistics will know that much has been written about the early days but little about the field since the Second World War. This book goes some way to redress this and is all the more valuable for coming from the horse’s mouth.

If there is a consensus among contributors, it is:

- statistics is exciting
- in fact, statistics is more exciting than ever with more tools and more data, although academia is more pressured
- collaborations are fun and you learn a lot from getting closer to the data source
- careers do not work out as planned
- we have come a very long way in gender equality (but remember that this is a North American book)
- useful analytical methods can be found in completely unrelated applications (“keep your eyes open to synergies between apparently disparate fields”, writes Grace Wahba)
- many of these high-achievers in statistics found initial traditional education in the subject difficult, either because theory seemed so unrelated to practice, or the subject was “a collection of strange recipes … generated by a foreign culture” (Bruce Lindsay)

But statistics was hard work in the old days. Once you read about work in the days before personal computers, you may think twice before cursing the one on your desk. Herman Chernoff recalls rooms of human “computers” inverting matrices of order 12 to ten significant figures on desktop electric calculators. Bruce Lindsay describes the difficulty of having manuscripts prepared by a typist when switching from text to algebra meant a change of typewriter (or at least ‘golf-ball’) mid-page. Dennis Cook followed a yearly cycle of collecting data in the summer and spending all winter analysing it. Juliet Popper Schaeffer recalls the difficulty of obtaining essential research papers in the days before photocopiers.

Discrimination against women was widespread in the American academic job market and quite overt until the Civil Rights Act in 1964. There are jaw-dropping recollections from Juliet Popper Schaeffer, Donna Brogan and Mary Gray, for example: “The IBM interviewer commented that he had never seen such a high [math aptitude] score from any applicant and offered me a secretarial or entry sales post … I was interested in their advertised technical positions… but he simply said that those positions were for males.”

Machine learning methods are a topic of much discussion in statistics today: either a great opportunity or terrible threat or insubstantial hype, depending on whom you ask. In “Past, Present and Future”, some knowledgeable contributors discuss them in depth. Larry Wasserman in particular is keen and suggests that the statistics profession must radically adapt to them or become outmoded. Echoing the many ‘data science Venn diagrams’ to be found online which indicate a meeting point between statistics, computer science and topic expertise, Brad Efron describes statistics as “at the triple point of mathematics, philosophy and science”.

Throughout the 52 chapters, my personal preference was for the recollections and advice. There are some contributions that set out current methodological problems in the author’s own area of interest, and they will interest a much narrower audience. Sometimes, I had the feeling of an unpublished paper sneaking out via the pages of this book, but fortunately these are easily spotted by the extensive algebra. Which brings me to the closing chapter, the shortest of all, from Brad Efron: a list of “thirteen rules for giving a really bad talk”. This made me laugh out loud and should be posted on the walls of all conferences.

I shall leave the final word to Peter Bickel: “We should glory in this time when statistical thinking pervades almost every field of endeavour. It is really a lot of fun.”

]]>This is the version sent to Significance magazine — an edited version appeared in volume 14 issue 4.

Although I haven’t added hyperlinks, you’ll find all the things I refer to quite easily in your favourite search engine, or possibly your local library.

This book (CASI) by two titans of statistical methodology takes a refreshing look at the range of tools available for serious data analysis today. They review frequentist, Bayesian and Fisherian notions of inference and then put it in the context of what computational tools were available at each point from 1900 to today. Those tools shaped what was possible and how it was attempted.

How to introduce statistics or data science now, in the early 21st century, is not a straightforward challenge. First, traditional statistics curricula get a lot of criticism for being difficult and discouraging students. Secondly, the toolbox that gets used is not just what we call “statistics” but also “machine learning”. On the first problem, there is a broad reform movement building from George Cobb, Joan Garfield and others who shaped the American Statistical Association’s GAISE guidelines for teaching. Some, like Andy Zieffler’s CATALST course, deliberately reverse the traditional order of topics, which we inherited from Snedecor, and they have growing evidence that students benefit from seeing a bigger picture of modelling and prediction first before they get bogged down in conditional probabilities and gradients of the log-likelihood function. On the second point, “data science” courses are hugely popular. Some are just reheated mixtures of old statistics and computing content, but the successful ones really mix the techniques and insights that come from statistics and those that come from computer science.

It’s in this context that CASI enters the fray. Trevor Hastie in particular has written successful books before that brought machine learning methods to a statistical audience. After reviewing the three schools of thought, they get down to the business of “early computer age methods” and then “21st century topics”. They emphasise the fact that the practitioner today from whatever background must deal with the challenge of large and heterogeneous data sets. A crucial and unusual distinction is made at the outset: estimators (mathematical procedures to obtain estimates of values we don’t know but would like to) are also algorithms, which consume data. Inference then follows the algorithm. This unites old ideas, like estimates of the mean being straightforward but the standard deviation requiring some assumptions, with newer ones, like smooth non-parametric regression techniques giving an estimate of a predictive curve, and then the bootstrap giving the uncertainty in that curve, and beyond to methods like neural networks, where inference is generally lacking.

Theoretical cornerstones are dealt with quickly, not as mathematical constructs of interest for their own sake, but as handy tools for real problems. Cross-validation gets its own chapter, a flexible, practical, computer-intensive topic well-known in machine learning but not statistics. False discovery rates, ridge regression and the lasso, bootstrapping, trees, random forests, boosting and bagging, neural networks and support vector machines all make an appearance in the 21st century section, with very clear expositions.

CASI could be used for teaching with mathematically confident students in a course that brings together a wide range of techniques. Many of its chapters would certainly provide stimulating reading material for tutorials. Qualified statisticians who want to teach themselves some new methods would benefit too.

]]>**what happened to dataviz of the week?**

It’s over; there are no more weeks. Hand of God has struck the hour, no more pie charts have the power, etc etc. It became a staple of the blog through 2017 but I don’t know if it is really useful to anyone. There are many places where you can hear about new, funky viz; I suggest Twitter + Feedly. It started on my noticeboard when I was a lecturer. Some viz was good and some was bad. I just stuck it up each week without explanation. My colleagues often mis-classified them to good/bad, which was interesting in itself. On the blog, some critique is needed, but (see below), I can’t do so much of that any more.

I think it would be more useful to people to have posts about how to achieve X dataviz goal (where X is not an element of R-bloggers), and thinking about dataviz, weighing up options etc.

You can expect more building from basic components, including R base graphics, SVG and D3. There are no free lunches.

Of course, I have a book in the pipeline, which contains some overarching thoughts on making and critiquing dataviz, and those will get reflected here over the spring and summer.

**Bayes**

I did a little Twitter poll asking what people wanted to hear about from @robertstats, and it was Bayes that came out in front. I was quite surprised by that, but I’m happy to oblige. I won’t try to compete with in-depth technical blogs like Christian Robert’s. The popularity of how-to posts makes me think that I should apply that here. So, there will be approachable posts for people who have passed the absolute beginner stage but are still fumbling through the foothills; I think that’s a gap in the proverbial market.

Expect Stan to feature heavily, plus latent variables and various forms of uncertainty arising from biases and imperfect data.

**criticism**

You might notice that, naked of the university indemnity and legal precedent of academic freedom, I don’t cuss people nearly as much as I used to, and that’s not going to change. I have a kid to feed, you know.

**philosophy of science**

I don’t have much left to say about this, but I think a Platonic style dialogue where some ultra-frequentist gets made to look stupid could be a useful way of showcasing the various reasons why Bayes wins.

I also have notes on “Aboutness” by Stephen Yablo to share with you.

**learning**

The main thrust of my freelance activity is training and coaching, and I’ll talk a bit here about what makes for good learning and teaching in data science / statistics / machine learning. Like Colonel Saunders, I won’t tell you all my secrets, but I might lure you in by revealing that there are some. And if you are a teacher you will probably recognise what I’m getting at.

The coaching front might feature some thoughts on building a career and an identity in this data science age. If you’re a millennial and work in tech, I feel sorry for you, son. But you don’t have to work yourself to an early grave. My homeboy Henry Thoreau might turn up here, or he might not.

]]>Fresh, impactful, memorable, fun, genuinely informative. Canadians can have 100,000 babies at a time eh?

Methodviz (visualisation that explains not one data set or model but an analytical method) of the year goes to Autodesk Research’s gifs where each image has the same summary statistics.

The clever thing about these is the method used to get them all, which you can read here. I have to admit that I belong in the old fart camp who contend that this doesn’t upstage Anscombe, but it is eye-catching and if someone remembers this message, then it’s all been worthwhile. (Tip your xmas hat to Alberto Cairo’s datasaurus while you’re about it.)

Here’s to 2018.

]]>

Dorothea Lange’s Censored Photographs of FDR’s Japanese Concentration Camps

Doomsday prep for the super-rich “Some of the wealthiest people in America—in Silicon Valley, New York, and beyond—are getting ready for the crackup of civilization”

The Hermit Who Inadvertently Shaped Climate-Change Science

The gig economy celebrates working yourself to death

A child soldier sees his mother after 6 years. But why doesn’t he speak?

Where Totem Poles Are a Living Art (and Relics Rest in Peace)

Burrowing under luminous ice to retrieve mussels

My family’s slave “She lived with us for 56 years. She raised me and my siblings without pay. I was 11, a typical American kid, before I realized who she was.” Perhaps the most shared and read and talked-about article of the year. You should read it too.

Can an Archive Capture the Scents of an Entire Era? “A molecular record of smells could give future generations a sense of the past.”

Coralroot, a rare beauty among the old graves (and by extension, many, many more Country Diary entries)

Mike Olbinski, storm chasing photographer who I actually found via Outside, but his own website is the mutha lode.

https://www.theguardian.com/business/2017/may/02/where-oil-rigs-go-to-die

Lesotho is a secret mountain bike paradise

The dark evolution of British drinking culture

Just a standard NDA, New Yorker. I wish I could tell you more, but I can’t.

The peculiar melancholy of parking lots

On the water, and into the wild

John Margolies’ photographs of roadside America (whence comes the featured image above; http://www.loc.gov/pictures/item/2017708693/)

Inside the world’s largest walnut forest, Roads And Kingdoms. If you like this you should get Roger Deakin’s book *Wildwood.*

Backcountry drug war, BioGraphic.

Bones of the Tongass, Sierra Magazine. “I wished I could see what it truly meant to leave no trace.”

A photo trip to Antarctica, The Atlantic.

This tweet and everything that follows:

Happy Christmas everybody!

]]>First up is a small multiple choropleth: choose a starting neighbourhood and it shows you whether taxis or bikes win, by destination neighbourhood. This is nice and simple to use but carries a lot of detail. Much more fun than the table of stats that most of us would have reached for first. I particularly like the way that Todd has provided a natural, Bayesian metric in the % chance that a taxi would beat a bike. That is going to make a lot more sense to most readers than a pair of medians or whatever.

Next, you get some breakdowns by time of day, direction of travel, etc. We see line charts, bar charts, dot plots with connecting lines — basically the same encoding but different formats to keep it fresh.

Todd finds that taxis have got slower since 2009: 20% slower! But bikes have not. And while taxis, and traffic in general, can have really bad days, bikes whizz past. That makes sense.

It’s all a really nice piece of work with all code on GitHub. Thanks to Xiaodong Cai for pointing it out to me.

]]>