I saw a few recurring problems.

- Their employing organisation refused any transmission; I had to turn up in person with a memory stick. This is OK, but often their local IT department had blocked encrypted USB drives, so it had to be carried across the country in my pocket in an unencrypted drive. This is very risky and you shouldn’t do it!
- They were forbidden from using email and had to do something like Google Drive instead. Well, you just gave those medical records to Google, who are under no obligation to delete it when you click delete on your remote copy. They also share stuff with intelligence agencies, which is for good reasons, but you are responsible for the protection of sensitive data and you can’t bury your head in the sand about this sharing.
- They added a password to an Excel spreadsheet and sent me that. Remember that they are often using old versions of Office so the encryption is very poor and easily cracked. Then they’d send the password in a other email. I don’t know where folk wisdom like that comes from. Emails reside indefinitely on a whole chain of servers from thee to me, which might criss-cross the world in doing so, getting tapped along the way. So Johnny Hacker can just pick up both emails and type in your password.
- They asked my advice, and I sent them some links for OpenSSL, but that can be confusing for people who aren’t total computer nerds, hard to install on Windows networked machines, and so on.

To add to that, there is some doubt as to whether or not an encryption algorithm like AES might already have been cracked.

To be clear, if you use commercial encryption software, you are probably discharging your duties and won’t get in trouble. One day, that encryption will be broken, so it’s a question of whether you feel that you’ve done what’s required of you and the future is not your problem (go commercial) or that you have to take personal responsibility for posterity too (use xormydata).

The point of xormydata is to make it easy to send and receive data files, securely, without any of that silly stuff like sending passwords in a separate email. It doesn’t use an algorithm, it takes your data file at a binary level:

1001 0110

and combines it with a “code file”, which acts like a password:

1100 1011

using exclusive or (XOR). This is a logical operation, like OR and AND. It works like this: if the bit in the data file and the corresponding bit in the code file are the same, the result is 0, if they are different, it is a 1. With the bytes above, you would get:

0101 1101

You can read more musing about the process, and how you should use it, on the Github page. * You should also read the warning there about how it can get you in serious trouble*.

You download xormydata.cpp from Github, or clone the repository.

That is a C++ source data file. You need to compile it so it can run on your computer. Typically, your computer might have a compiler such as “clang” or “g++”. If you have Linux or Mac, you can just go straight to the terminal, cd to the folder where you saved xormydata.cpp, and type:

g++ xormydata.cpp -o xormydata

This should produce a new, executable file in the same folder, called xormydata. If you are using Windows, you probably need to install a C++ compiler first, and if you are networked and controlled by a central admin, you’ll probably need their help to get permissions to do this. They will be suspicious. One compromise might be to use an old, unwanted laptop for this encryption and decryption, though obviously that’s a bit of a pain.

Now you are ready to go. You need a collection of code files (see the Github page), and your recipient needs xormydata and their own code files. Crucially, there is no need for you and your recipient to communicate about the code files you are using (like sending passwords).

Alice has a data file (patient_HIV_status.xls), which she wants to send to researcher Bob. They both install xormydata and away they go! Alice is going to use a music mp3 file (Schools_Out.mp3) as the code file, so she types

./xormydata patient_HIV_status.xls Schools_Out.mp3 data_for_Bob.xor 118309

The order of this command is

- the command itself, ./xormydata in Linux/Mac and xormydata.exe in Windows
- the input data file
- the code file
- the name of the desired output file
- optionally, a number indicating where (in bytes) to start using the code file’s 1s and 0s. I strongly recommend you include this because there can be predictable sections of metadata at the start of certain file types.

Now, she sends Bob “data_for_Bob.xor”

Bob is using video files as codes. He types

./xormydata data_for_Bob.xor Go_Pro_commuting.avi data_back_to_Alice.xor 7199003

the file is now double encrypted, with both Alice’s code file and Bob’s code file added. Alice removes her code thus:

./xormydata data_back_to_Alice.xor Schools_Out.mp3 final_data_to_Bob.xor 118309

Now, it just has Bob’s code applied, and she sends it back to him. He types:

./xormydata final_data_to_Bob.xor Go_Pro_commuting.avi patient_HIV_status.xls 7199003

and the original file is revealed. This is a *triple-pass* system, which is simple (at the cost of sending stuff three times), and requires no handing over of passwords and such, but not perfect. Charles can intercept Alice’s emails and pretend to be Bob (man-in-the-middle attack), or Charles can just go snooping on their email servers afterwards; by xor’ing the encrypted files together the right way, even with knowing the code files, he can get the original data back. So, **if you are worried about people intercepting your stuff and trying hard to break into it, then you probably shouldn’t be using xormydata**. I suggest you don’t just use vanilla email to send your xor’d files, but maybe an end-to-end encryption like ProtonMail. That will ensure that your data-transmitting messages are indistinguishable from the ones where you discuss where to go for your colleague’s leaving party.

Also, this is intended as a *one-time pad*, which means you use that code file once only (or, at least, that code file at that start byte). You should keep track of the pairings of code files and data files so you can get them back later, and of course, don’t store that list somewhere where people can get at it. Does it need to be digital at all? Can you just write it in a notebook?

**How do I know I can trust you, Mr Grant?** You don’t; that’s life, kid. But you can read the source code, it’s only 120 lines.

**If this is so simple, and you’re, like, not even a professional programmer, how come nobody else is doing it already?** I don’t know. The crypto world generally went off triple-pass systems decades back, because of the risk of a man-in-the-middle attack. It’s not cool.

**This is still hard work … isn’t there something with just a one-click option?** Not if you want it to be secure into the future, and secure from even the big guys. There’s no free lunch.

**My organisation wants to use a trusted commercial package instead; what can I do?** Not a lot in my experience, though I suppose you could xormydata it and then put it through the commercial package.

**Isn’t this going to be used by bad guys too?** I hope not, but potentially, yes. The same way that you can use a hammer to build a hospital or whack someone over the head. This is technology; if we avoided risk of abuse we would not even have adopted the flint hand axe.

This developed out of my Masters dissertation in the Medical Statistics course at the London School of Hygiene and Tropical Medicine. I was comparing different composite measures of hospital quality, and then I went on to explore ways of assessing and visualising the uncertainty in those measures.

**What are composite performance indicators?**

In the context of New Public Management, we have a bunch of hospitals (you can substitute schools, prisons, privatised railways or privatised deportation agencies or whatever), and politicians have set some very broad-brush goals for them (perhaps, that they should have low mortality and low re-admission rates, and that they should reduce any debt year-on-year). Some agency or Death Panel (the sort of thing I used to do for a living When We Were Very Young) expands this into some measurable indicators. They might have to prioritise things so that it isn’t too burdensome, and they end up with things like:

- % of patients with fungal toenail infections seen by a fungo-podiatrist within 24 hours of being diagnosed
- number of nurses per patient on the fungal toenail infection ward
- % of patients turning up a second time for their FTI, after you said you’d fixed it

(with apologies to anyone who suffers from fungal infections in the toenail, and feels I am making light of their plight; someone had to take the fall (why not you?))

Great, now we have three numbers but someone is sure to say that it doesn’t help patients choose a hospital and doesn’t help funders direct the money to the best performers. You might be tempted to make a composite indicator by some mathematical process. It can often be as crude as averaging them.

One more thing I’ll mention here is that, following Donabedian, it is typical to classify indicators as structure (like the 2nd one above, measuring the facilities), process (like the 1st one, measuring whether you do the right things), or outcome (like the 3rd, measuring how the patient is doing after your care).

__Sources of uncertainty__

**Sampling error**

The most obvious way in which your composite indicator can give you the wrong answer is because it is assessed on the basis of a sample of patients, and not all of them. This is sampling error, and we have a lot of statistical theory to tell us how big it might be. But there are other problems too.

**Order of averaging**

Reeves and colleagues wrote a paper in 2007 which hardly anyone has heard of — but they should have. They explored what happens when you have multiple indicators assessed on multiple patients, as is often the case. Do you summarise the indicators into one number for each patient, and then summarise the patients, or do it the other way round? It turns out that you can get quite different composite scores.

**Weighting and other calculations**

To combine your indicators, you have some formula that takes multiple numbers as input, and produces one number as output. That formula might give more weight to one input than other. You could choose weights on the basis of clinical importance, or you could opt for a variance-maximising summary such as the first principal component. Or, you might also introduce changing implicit weights by steps like dichotomising some of the inputs before averaging them.

That choice obviously affects the composite scores. The tricky thing is that you cannot avoid a judgement of relative weights. Even if you just average the inputs, you will still be giving more weight to some than others, specifically, those with higher inter-hospital variance will come to have a bigger impact on ranking. *There is no value-free composite*.

__Poster__

So, I made a poster and it was shown at a visualisation conference at the Open University in Milton Keynes in 2011. And here it is below. I haven’t managed to do anything further on this subject since then. If you would like to take it on, feel free. Get in touch if you want to discuss it.

]]>Bottom line: there are none (at least in 2013). I then looked at codes used by GPs (family doctors) in the UK for dementia and incontinence, which I had analysed with colleagues. I found some variation by GP but again there wasn’t enough appetite to take it further. I don’t have those data or the output of that analysis any more. That’s the price of confidentiality.

I also organised a session at RSS conference in Sheffield on “Checking and cleaning in big data”. It was not very well attended as everyone had probably gone to hear about something trendier. But the people who were there appreciated the problem, and wanted to learn from experts, and that was pleasing. My invited speakers were Ton de Waal from Statistics Netherlands and Liz Ford from Brighton & Sussex Medical School. You should look them up if you are into this kind of thing. I had someone else from Reuters lined up to talk about automated processing of text streams in real-time but they moved jobs and were contractually gagged, alas.

Anyway, here’s the write-up on the preliminary review. You might find it stimulating. I think it’s an interesting and under-valued avenue for research. I would have liked to have developed some Bayesian model that incorporated the hierarchical structure of the data by the professional doing the coding, and then included latent variables for coding habits. These could have been developed from a preliminary study to hand-classify coding habits and maybe dimension-reduce them into a manageable number of factors.

Over to you now.

**Data linkage**

The goal of data linkage is to combine information from different databases into one. When there is not a unique identifying variable for each subject, special techniques have to be employed to find a likely match and unbiased results from any analysis that follows. Established data linkage methods, whether probabilistic or not, typically lead to the creation of a single linked dataset which is then analysed as if it were perfectly matched. This effectively ignores any uncertainty arising from the matching, and can introduce bias if the incorrect matches are different to the correct ones in terms of some of the variables used in the analysis. However, Bayesian approaches by McGlincy (“A Bayesian Record Linkage Methodology for Multiple Imputation of Missing Links”, 2004) and Goldstein, Harron and Wade (“The analysis of record-linked data using multiple imputation with data value priors”, 2012) have capitalised on the ease with which computational methods such as MCMC can perform analysis and editing / imputation in a single step. Both approaches allow data to be imputed from conditional distributions if no match is sufficiently probable. Goldstein, Harron and Wade used a multiple imputation approach to create several potential matched datasets in order to capture the uncertainty that arises from the matching process. In none of these papers is there any mention of the possibility of human coding necessitating a multilevel structure to the linkage probabilities and weights.

**Automated edit/impute procedures**

Large surveys require a computerised approach to checking data for errors and correcting them where possible. A statistical approach can be traced back to Fellegi and Holt’s seminal paper (“A Systematic Approach to Automatic Edit and Imputation”, 1976). Census agencies, particularly in the USA and the Netherlands, have led the way in developing methods and software, but adoption among a broader statistics community has been rare. De Waal, Pannekoek and Scholtus (“Handbook of statistical data editing”, 2011) provide a comprehensive review of edit/impute methods. A number of common forms of human error are detailed but none of the methods incorporate the identity of the individual recording the data, perhaps because national surveys typically do not have more than one record per individual. There is however a passing reference (p. 28) to a certain type of error being made consistently throughout different variables.

The Fellegi-Holt paradigm aims to produce “internally consistent records, not the construction of a data set that possesses certain distributional properties” (de Waal, Pannekoek, Scholtus, p. 63).

de Waal, Pannekoek and Scholtus note that influential and unusual observations are still generally identified by computer and considered by experts, possibly by contacting the source.

**Coding bias**

Because of the prominence of coding systems in medical data (for example, ICD or Read codes), a search of the Medline database was conducted for the terms “coding bias” (13 retrieved, none relevant), “interviewer bias” (40 retrieved, likewise).

These searches were augmented by searches for the same terms on Google Scholar and Google Web Search, and consideration of references in any partly relevant documents.

Jameson & Reed (“Payment by results and coding practice in the National Health Service”, 2007) and Joy, Velagala & Akhtar (“Coding: An audit of its

accuracy and implications”, 2008) suggest that coding can lead to a considerable change in a healthcare provider’s income within the British NHS’s Payment By Results scheme. This has been emphasised as a system-wide problem by the Audit Commission and NHS Connecting For Health.

Systematic investigation of the bias arising from coding is much rarer. Lindenauer and colleagues (“Association of diagnostic coding with trends in hospitalizations and mortality of patients with pneumonia, 2003-2009.”, 2012) conducted a thorough analysis of coding trends over time for hospital patients with pneumonia and/or sepsis, and found that the use of pneumonia codes had declined between 2003 and 2009, while codes for sepsis secondary to pneumonia, and respiratory failure with pneumonia, had increased. While mortality rates (adjusted for age, sex and co-morbidities) in each category had dropped significantly over the same time period, taken together as a single category, the mortality rate had not significantly changed. The authors suggest that patients that would have been at high risk of dying with a pneumonia code in 2003 were increasingly given sepsis or respiratory failure codes (thus artificially improving mortality rates in the pneumonia group), where they became comparatively low-risk patients. Meanwhile, advances in treatment for sepsis had improved mortality in the other two groups’ higher-risk patients. Commenting on the medical website Medscape (http://www.medscape.com/viewarticle/765523), Shorr described the coding bias exposed by this study as “not comparing apples with apples and oranges with oranges [but]… mixing things up and making fruit salad”.

]]>If you are a Stata beginner, you’re writing do-file code, but you want it to be more efficient, more reliable, and to take you less time, and certainly no copying and pasting of almost-identical blocks, then this is for you. It’s happening on Friday 4 May in the afternoon, UK time, and you can book here.

We ran it a couple of months back and got some very positive feedback from participants.

You will learn how to save time and avoid errors by writing bespoke commands for your own use, getting Stata to loop through your data for repetitive work, and including automated checks to keep Stata running smoothly.

By the end of the course, all participants will feel comfortable undertaking the following tasks:

- Automating repetitive and time-consuming tasks
- How to provide non-technical colleagues with a Stata analysis they can easily run
- Using Stata’s loop functions to save you time and work with big data
- Protecting your automated analysis against common data issues
- Understand how to create your own Stata commands

I am excited to offer this new course because I think there is a real gap in the market here. Lots of people want to tap into both Stata as a commercial package with a lot of well-constructed and thoroughly tested tools for analysis and graphics, as well as R as a flexible, analysis-oriented programming language. Obviously you could run a bit of code here and a bit there and keep notes as to what you did (and heaven knows we’ve all been there), but it is safer, faster and more reproducible to integrate your workflow across both Stata and R. I’ll show you some ways of doing that and you’ll leave with some useful Stata commands and R functions that you can start using straight away.

Because of the many different ways in which you might have these two pieces of software set up on your computers, this needs to be a face-to-face course (for now at least; maybe we’ll be able to offer it online from next year).

And if you’re wondering why anyone would want to use both, here’s some pros and cons. Stata has an imperative scripting language (with macro substitution) that does a lot more than you probably think it does. R is a functional, highly vectorised programming language. What is hard in one is often quite easy in the other.

Some R advantages:

- faster graphics
- sometimes more flexible graphics
- have more than one data file open, plus various arrays
- lists of diverse object types
- useR communities + a lot of Stack Overflow
*vel sim* - functional programming can be handy…
- magrittr piping
- Rcpp integration with C++
- packages for hip stuff like Spark, H2O.ai, Keras etc etc…
- rmarkdown, knitr etc for outputs
- bespoke parallelisation

Stata advantages:

- simpler to bootstrap most estimation and modelling methods
- simpler multiple imputation
- imperative programming with macro substitution can be handy
- most models are achieved with less typing
- customer support + Statalist
- Mata, and Java/C/C++ plugins
- lots of economics/econometrics functionality
- neat structural equation model building
- webdoc, dyndoc etc for outputs
- built-in parallelisation (if you have MP flavor)
- super-clean SVG output
- margins and marginsplot are really good for communicating findings

You can read more and book here.

]]>Power is the chance that you will, in a future study that you are designing, get a significant hypothesis test result if the true value of the test statistic in question is equal to some minimally important value. Sometimes, people carry out studies and they are disappointed to get p>0.05: non-significant. You know, p=0.049 means you collect a Nobel Prize and p=0.051 means you collect a P45 (the tax document you get when you lose your job in the UK). So, when their fears are realised and they get non-significance, they go looking for excuses they can tell the boss. One is that the study maybe turned out (through no fault of their own, of course) to have lacked power, and they ask for retrospective power calculations: not when designing the study but after it has been conducted. That is meaningless, says the statistician, and they go away despondent. I have to say that, to me, it seems a reasonable question — it just can’t be answered by power.

In the last few days, Shravan Vasishth and I have passed the idea around through Twitter. He proposed a calculation on his blog based on treating the probability of H0 vs HA as a random variable, and that spurred me on to type up what notes I had. My approach is to look at a hypothetical identical future study, but the *really* interesting aspect is that you would only ask this question of your friendly statistician when you think there’s a chance of snatching victory from the jaws of defeat, and that introduces a complex bias. Forcing people to think about and justify this bias might actually make the practice of retrospective power calculation, or as I prefer to put it, false non-significance rate, quite a positive one. Here’s the text but I recommend reading it in PDF so the maths makes more sense.

Power, the probability of obtaining a significant hypothesis test result if the population test statistic is equal to a minimally important value, is a ubiquitous concern in many fields of applied statistics, including my own, biomedical research. It is usually operationalised as a frequentist concept, and so calculating it after the study has been conducted — so-called retrospective power — is meaningless.

Let theta be the true value of the test statistic, and sigma its true standard error given the sample size of the completed study. delta is the minimally important value upon which the power calculation for the study was based, and theta^ and sigma^ are the completed study’s estimates of theta and sigma respectively. Here, for simplicity, they are regarded as transformed so that the null hypothesis is h0: theta=0. I will also assume normality of the sampling distribution for illustrative purposes, although the formulas do not require that.The target difference in the original sample size calculation is delta.

1-P (theta^<1.96sigma^ | theta= delta, sigma=sigma) is either 0 or 1 once you know sigma^ and theta^. Nevertheless there seems to be a widespread urge to answer this question: "could my study's non-significant result have been a mistake?" This seems a reasonable question, but to answer it requires something other than power.

Ioannides considered possible ways of assessing the probability of the truth or falsehood of study results in 2005 in his widely cited paper “Why most published research findings are false”. His formulas take into account various other forms of bias such as unacknowledged occult multiplicity, publication and reporting bias. However, he considers only a dichotomised finding (significant / not) and true value (effect / no effect), which limits the applicability of the approach to individual studies. This was one aspect criticised afterwards by Goodman & Greenland [http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0040168].

In a 2008 paper, expanding on an article in American Scientist magazine [http://www.americanscientist.org/issues/page2/of-beauty-sex-and-power], and published only on the first author’s own website, Gelman and Weakliem considered underpowered studies and set out the probabilities of various types of error: the familiar I and II as well as errors of magnitude (type M) and of sign (type S). They conclude that the system of type I and II has not been helpful. In most cases they consider, the probabilities of type M or S errors turn out to be so high as to call any conclusion of the study into question. In the same year, Ralph O’Brien spoke at the JSM conference on “crucial type I and II error rates”. This proposal reverses the familiar formulas through Bayes’ Theorem. The contemporaneous discussion at http://andrewgelman.com/2008/12/26/what_is_the_poi/ sets the scene.

To address the question of whether a study’s non-significant result could be a type II error, we must deal with a theoretical identical future study, in the same prospective way that classical power is calculated. The completed study’s estimates of the parameter and its standard error are fixed values, and the true population values are unknown, but we can establish a distribution for the estimates of an identical future study.

The question “what is the probability that my study’s non-significant result is wrong?” then can be rephrased as “given what we can infer about the true parameter given our data, what is the probability that the true effect is as big or bigger than the target difference and yet an identical study would yield a non-significant result?” This is then a form of Bayesian posterior predictive model checking.

Let theta* and sigma* be the estimates arising from an identical study. We are interested in:

P (theta* < 1.96sigma* | theta^, sigma^, theta ≥ delta)

which we can calculate from the sampling distribution P(theta,sigma | theta^, sigma^) and the conditional P(theta*, sigma* | theta, sigma) and by integrating out the unknown (theta,sigma).

To emphasise the distinction from the type II error rate, I propose the clear term “false non-significance rate” for this.

There is a further complication to consider, which mirrors Ioannides’s various biases. Retrospective power is usually only considered when theta^ < 1.96 sigma^ (that is, a non-significant result), and either sigma^ is larger than expected (including problems of sample size) or theta^ is close to, but smaller than, delta. This introduces a bias because we will only ask the retrospective power question of a subset of the possible values of (theta^,sigma^). To counter this requires us to introduce a prior on theta^,sigma^ and so derive a posterior distribution for theta*,sigma*. This could be informed by previous research or opinion in the usual way, but it does not make sense for it to be diffuse. Sensitivity analysis with various priors is advisable.

It is almost always the case that there is some other information on anticipated findings, so that theta^,sigma^ is not the only information about theta,sigma and hence theta*,sigma*. We should attempt to incorporate this as a prior distribution because it is hard to interpret retrospective power in the context of other studies, in the way that we expect people to interpret study findings (without informative priors).

As a former medical statistician in a university and hospital setting, I was regularly called on to advise on sample size and power calculations. I was, and still am, convinced that the great majority of these calculations were uninformative acts of sophistry, performed for the comfort of the tutor, ethics committee or funding body, and based on such an accumulation of assumptions as to be meaningless. My message to all colleagues and students in such a situation (because it is not their fault to expect a simple answer) is to think very carefully and in depth about what they are trying to investigate, and what they would do having found various potential results. This critical thought helps to innoculate them against the lure of simplicity that comes from one calculation on one hypothesis, under one set of assumptions. The calculations set out in this paper are no different, and require careful justification for all the assumptions behind them. Indeed, I approach publication of this proposition with some trepidation, lest what is intended as a stimulating exercise in defining slippery concepts is reduced instead to a catch-all formula that permits retrospective power calculations to proceed under a new name, and unhindered by cerebral activity. I hope that by encoding the most difficult part of this – the retrospective power bias – as an informative prior distribution, researchers will be forced to slow down and consider what has happened with their study, and what they are seeking to achieve by such calculation, very carefully, like the QWERTY keyboard, intended to slow typists sufficiently to avoid jamming of keys on a manual typewriter.

A final note of caution concerns the target difference delta, which appears in all the formulas here. A review of methods to establish this value (Cook et al, HTA) is essential reading for everyone working with sample size and power calculations, because our recommendations for designing future studies and interpreting completed ones are undermined by irrelevant or unreliable target differences.

]]>

Last year Nick Cox pointed out to me that the only regular (i.e. all sides are the same length, all angles the same) two-dimensional shapes that tesselate (i.e. fill up the space without leaving gaps or overlapping) are those with 3,4 and 6 sides: the equilateral triangle, the square and the regular hexagon.

Then I started thinking about this in the big data context. Suppose I have to reduce my data to make it vizable, you know so I, the feeble human, can explore it and see what’s going on. That is time consuming and hard to program, so I want to do it once only if possible. How should I bin the data to keep my options open for later dataviz?

Here’s an example of what I mean. If I have two variables, like latitude and longitude of NYC taxi pickups, and I count them on a fine square grid, I can store that matrix of counts locally. Even if it is a big grid like 10,000 by 10,000 that will still be 100,000,000 numbers, which is quite manageable. Later, maybe I want to draw it on a 1000 by 1000 grid, so I just add together the counts in adjacent groups of 100 small squares to make one big square. That runs quickly.

`// pseudocode: aggregate into a square grid`

int[n_rows,n_cols] count_matrix

for i in 1:n_data {

int rownum=floor(y/row_height)

int colnum=floor(x/col_width)

count_matrix[rownum,colnum]=count_matrix[rownum,colnum]+1

}

```
```

`// pseudocode: aggregate into a coarser square grid`

int reduction=10 // each new square is the sum of a 10x10 area of the old grid

int new_rows=n_rows/reduction // assuming it is a multiple of reduction

int new_cols=n_cols/reduction

int[new_rows,new_cols] new_matrix

int left_corner=0 // if our language is zero-indexed

int top_corner=0

for i in 0:new_rows-1 {

for j in 0:new_cols-1 {

new_matrix[i,j]=sum(count_matrix[left_corner:(left_corner+reduction-1),top_corner:(top_corner+reduction-1)]

left_corner=left_corner+reduction

top_corner=top_corner+reduction

}

}

So I was thinking: can you combine shapes together easily? This called for some geometry, which was never my thing. Here we go.

Triangles can be combined in sets of four to make bigger triangles, or sets of six to make a hexagon. Squares combine in sets of four to make bigger squares, and so on. Hexagons don’t combine to make any of these shapes. So, what can I conclude? Bin your data in two dimensions using small squares or triangles, bearing in mind that the triangle will give you hexbins if you want them, but there is no crossing from tri-hex to square or back again. You could have a rhombus, but not a square.

Now, what about higher-dimensional binning? It seems that the only regular “space-filling polyhedron” in three dimensions is the cube (cf https://en.wikipedia.org/wiki/Honeycomb_(geometry)). There are some other shapes that are space-filling by themselves, but have a mixture of face shapes, so you probably don’t want to tangle with them because of the difficulty of determining whether a point is in this polyhedron or that polyhedron, and some mixtures of polyhedra which together fill space, but that’s also unsatisfying for this application because you want flexibility in aggregating them to larger polyhedra, and consistency when taking slices through them for visualisation. So, use cubes. I suppose if you really wanted a hexbin (and it’s a good visual format!), you should do 2-D triangle or hexagon bins from the outset; these could be stacked right prisms (think of the Giant’s Causeway) in 3-D which later get aggregated for a marginal plot or filtered for a conditional slice.

More than three dimensions eluded my non-existent powers of geometrical thinking but it seems to me that hypercubes always pay off. Not only can you aggregate a choix, but it’s also easy to allocate points to hypercubes: you just add more lines (those with the floor() function) to the code above, and more dimensions to the array count_matrix.

**Bottom line:** for k-dimensional data, bin in k-hypercubes. But if you know you want triangles or hexbins in 2-D projections or conditional slices, then you’ll have to do that from the outset.

If you want to give every personal insight away to some 25-year old dude in Sunnyvale to dick around with, fine. It’s your choice. It just surprises me that nobody (that I know of) thinks to ask for money in exchange for data. Because it’s worth money to the recipient, right?

VODAFONE SMS: your views matter to us. Please take 2 minutes to complete the quick survey we are sending in the next message.

ME: OK, that’ll be 5 pounds please.

GUY IN SHOP: You’ll get an email asking about your experience in the shop today and how I did – I’d really appreciate it if you could fill that out, and you know, they only take 9 or 10 out of 10 as good.

ME: OK, that’ll be £2.50 please

COMPANY I SHOP WITH: Hey, it’s such fun shopping with us, right? Why not have even more fun with our app for your smart phone? All it needs is permission to read your files, photos, contacts, GPS location, wi-fi, messages, phone calls, and social media logins. Click here! We’ll have suuuuuch fun together!

ME: No.

You get the idea. I charge Vodafone more because they can afford more and because they are going to use the data against me in future (their raison d’être is to take my money and your money and give it to shareholders, and that is best done by ratcheting up my bill in a data-driven way). I think I can beat them at that game though, otherwise I wouldn’t offer the transaction at all, which is the case with the kick-me sign on your back that is known as installing an app for convenient shopping. The guy in the shop wants to gets a raise or promotion one day; well, how about a little down payment on that sweet cash flow, brother? I generally get looked at like I’m a callous weirdo (which may also be true but doesn’t follow logically from the evidence, and yes, I really do ask for money; I’ve not got any so far…). But if everybody said it… Here arises the idea of **data unions**, which as far as I can tell only exists as a bon mot that Pedro Domingos chucked out the other day:

but it’s a good idea. Suppose we shop entirely in cash, but you could block-buy our shopping lists for just 10p per shopping trip. 50p if you want home address postcodes attached. We deactivate our phones’ GPS, but carry basic trackers. You want the data? That’ll be £175 per person per year. All proceeds to me and you. Whaddya say?

]]>So, I’m writing a book on dataviz as you may know. I just wonder if there are new pieces of work out there where someone – maybe you – fitted a model of some kind (regression, trees, neutral network, whatevz) to a big dataset and then visualised it somehow. By big, I mean big, like too numerous to draw the individual data, so you had to do something like bin-summarise-smooth on it. But it’s also interesting if it was big, as in you had to do some kind of map-reduce of sufficient statistics just to get the model done. Perhaps you averaged over random samples, then drew densities of predicted values and residuals for all the data; that’s interesting too. You can email me if you prefer to remain secretive, or just comment and ask me not to approve it for public display.

]]>This is the version sent to Significance magazine — an edited version appeared in the April 2017 issue, and lack of space meant it got a severe pruning down to one good thing, one bad thing, and one funny thing. I understand how magazines work and accept that, but you, learned reader, might prefer to hear all the points and the nuances.

Although I haven’t added hyperlinks, you’ll find all the things I refer to quite easily in your favourite search engine, or possibly your local library.

However implausible it may sound, this collection of reminiscence, musings on the state of the art and advice for young statisticians makes for compelling reading. I suspect most Significance readers will find something of interest in here. There are 52 contributions from eminent statisticians who have won a Committee of Presidents of Statistical Societies (COPSS) award. Each is a short, focussed chapter and so one could even say this is ideal bedtime (or coffee break) reading.

Anyone interested in the history of statistics will know that much has been written about the early days but little about the field since the Second World War. This book goes some way to redress this and is all the more valuable for coming from the horse’s mouth.

If there is a consensus among contributors, it is:

- statistics is exciting
- in fact, statistics is more exciting than ever with more tools and more data, although academia is more pressured
- collaborations are fun and you learn a lot from getting closer to the data source
- careers do not work out as planned
- we have come a very long way in gender equality (but remember that this is a North American book)
- useful analytical methods can be found in completely unrelated applications (“keep your eyes open to synergies between apparently disparate fields”, writes Grace Wahba)
- many of these high-achievers in statistics found initial traditional education in the subject difficult, either because theory seemed so unrelated to practice, or the subject was “a collection of strange recipes … generated by a foreign culture” (Bruce Lindsay)

But statistics was hard work in the old days. Once you read about work in the days before personal computers, you may think twice before cursing the one on your desk. Herman Chernoff recalls rooms of human “computers” inverting matrices of order 12 to ten significant figures on desktop electric calculators. Bruce Lindsay describes the difficulty of having manuscripts prepared by a typist when switching from text to algebra meant a change of typewriter (or at least ‘golf-ball’) mid-page. Dennis Cook followed a yearly cycle of collecting data in the summer and spending all winter analysing it. Juliet Popper Schaeffer recalls the difficulty of obtaining essential research papers in the days before photocopiers.

Discrimination against women was widespread in the American academic job market and quite overt until the Civil Rights Act in 1964. There are jaw-dropping recollections from Juliet Popper Schaeffer, Donna Brogan and Mary Gray, for example: “The IBM interviewer commented that he had never seen such a high [math aptitude] score from any applicant and offered me a secretarial or entry sales post … I was interested in their advertised technical positions… but he simply said that those positions were for males.”

Machine learning methods are a topic of much discussion in statistics today: either a great opportunity or terrible threat or insubstantial hype, depending on whom you ask. In “Past, Present and Future”, some knowledgeable contributors discuss them in depth. Larry Wasserman in particular is keen and suggests that the statistics profession must radically adapt to them or become outmoded. Echoing the many ‘data science Venn diagrams’ to be found online which indicate a meeting point between statistics, computer science and topic expertise, Brad Efron describes statistics as “at the triple point of mathematics, philosophy and science”.

Throughout the 52 chapters, my personal preference was for the recollections and advice. There are some contributions that set out current methodological problems in the author’s own area of interest, and they will interest a much narrower audience. Sometimes, I had the feeling of an unpublished paper sneaking out via the pages of this book, but fortunately these are easily spotted by the extensive algebra. Which brings me to the closing chapter, the shortest of all, from Brad Efron: a list of “thirteen rules for giving a really bad talk”. This made me laugh out loud and should be posted on the walls of all conferences.

I shall leave the final word to Peter Bickel: “We should glory in this time when statistical thinking pervades almost every field of endeavour. It is really a lot of fun.”

]]>