Tag Archives: statistics

Discomfiting jumps

I have been writing a book review of Efron & Hastie’s CASI for Significance magazine. Here’s a tangential half page I wrote but didn’t include.

Students of statistics or data science will typically encounter some discomfiting jumps in attitude as their course progresses. first, they may have a lot of probability theory and some likelihood-based inference for rather contrived problems, which will remind them of their maths classes at school. Ah, they think, I know how to do this. I learn the tricks to manipulate the symbols and get to the QED. Then, they find themselves suddenly in a course that provides tables of data and asks them to analyse and interpret. Suddenly it’s become a practical course that connects to the real world and leaves the maths behind for the most part. Now, there’s no QED given, and no tricks. The assessments suddenly are more like humanities subjects, there’s no right or wrong and it’s the coherence of their argument that matters. Now they have to remember which options to tick in their preferred stats software. They might think: why did we do the mathematical parts of this course at all if we’re not going to use them? Next, for some, come machine learning methods. Now, the inference and asymptotic assurances are not just hidden in the cogs of the computer but are actually absent. How do I know the random forest isn’t giving me the wrong answer? You don’t. It seems at first that when the problem gets really hard, like 21st-century-hard, land-a-job-at-Google-hard, we give up on stats as an interesting mental exercise from the 1930s in favour of “unreasonably effective” heuristics and greedy algorithms.

One really nice thing they do in CASI is to emphasise that all estimation, from standard deviations of samples to GAMs, are algorithms. The inference (I prefer to say “uncertainty”) for those algorithms follows later in the history of the subject. The 1930s methods had enough time to work out inference by now, but other methods are still developing their inferential procedures. This unifies things rather better, but most teaching has to catch up. One problem is that almost all the effort of reformers following George Cobb, Joan Garfield and others has been on the very early introduction to the subject. That’s probably the right place to fix first, but we need to broaden out and fix wider data science courses now.

Leave a comment

Filed under learning

Jasper tree ring fire scars – a teaching dataset

Today I’m sharing a nice little dataset that I think has some good features for teaching. Hope you like it.
I spotted this in the museum in Jasper, Alberta in 2012 and took a photo.

Jasper tree ring fire scars2

Later, I e-mailed the museum to find out who I should credit for it and we eventually found that it originated some time ago from Parks Canada, so thanks to them and I suggest you credit them as source if you use it.

No, I don’t have it in a file. I think working from the typewritten page is quite helpful as it keeps people out of stats software for this. They have to think. If you want to click buttons, there are a gazillion other datasets out there. This is a different kind of exercise.

Here we have the number of scars in tree rings that indicate fires in various years. If you look back in time through a tree’s rings, you can plot when it got damaged by fire but recovered. This could give an idea of the number of fires through the years, but only with some biases. It would be an interesting exercise for students who are getting to grips with the idea of a data-generating process. You could prompt them to think up and justify proposed biases, and hopefully they will agree on stuff like:

  • there’s a number of fires each year; we might be able to predict it with things like El Nino/a years, arrival of European settlers and other data sources*
  • the most ancient years will have few surviving trees, so more and more fires will get missed as you go back in time.
  • This might not be random, if the biggest (oldest) trees were more likely to get felled for wood
  • there will be a point (perhaps when Jasper became a national park) after which fires in the backwoods are actively prevented and fought, at which point the size of the fires, if not the number, should drop
  • the bigger the fire area, the more scars will be left behind; they have to decide to work with number of fires, or size (or both…)
  • the variables for size of the fire will be quite unreliable in the old days, but a good link from number of fires to number of scars otherwise
  • can we really trust the area of burn in the older years? to 2 decimal places in 1665?
  • and other things that are very clever and I haven’t dreamt of

* – once they are done with the data generating process, if they are confident enough with analysis, you could give them this dataset of Canada-wide forest fires, which I pulled together from a few years ago. It’s not without its own quirks, as you’ll see, but they might enjoy using it to corroborate some of their ideas.

I would ask them to propose a joint Bayesian model for the number of fires and area burnt over the years, including (if they want) predictions for the future (bearing in mind the data ends at 1971). You could also ask for sketched dataviz in a poster presentation, for example.

Finally, I highly recommend a trip to Jasper. What a beautiful part of the world!

Leave a comment

Filed under learning, Visualization

Dataviz of the week, 31/5/17

A more techy one this week. Ruth Fong and Andrea Vedaldi have a paper on ArXiv called “Interpretable explanations of black boxes by meaningful perturbation”. The argument that some modern machine learning (let’s not start that one again) techniques are black boxes which produce an output but nobody can understand how and why is a serious concern. If you don’t know how it works, how do you know you can believe it, or apply it outside the bounds of your previous data (in the manner of the disastrous Challenger space shuttle launch)?

HT @poolio for tweeting this, otherwise I’d never have heard about it.

The paper is heavy on the maths but thanks to the visual nature of convolutional neural networks (CNNs), which are high-dimensional non-linear statistical models to classify images, you can absorb the message very easily. Take the image, put it through the CNN, get a classification. Here, from the paper’s Figure 1, we see this image classified as containing a flute with probability 0.9973

Screen Shot 2017-05-31 at 11.39.58

Then, they randomly perturb an area of the image and run it again, checking how it has affected the prediction probability. When they find an area that strongly adversely affects the CNN, they conclude that it is here that the CNN is “looking”. Here’s a perturbed image:

Screen Shot 2017-05-31 at 11.40.13

You can see it’s the flute that has been blurred. They then show the impact of different regions in this “learned mask” heatmap:

Screen Shot 2017-05-31 at 11.40.19

(I’m glossing over the computational details quite a lot here because this post is about dataviz.) It rather reminds me of the old days when I was an undergrad and had to calculate a gazillion different types of residuals and influence statistics, many of which were rather heuristic. You could do this kind of thing with all kinds of black boxes (as  Fong & Vedaldi suggest by proposing a general theory of “explanations”), as long as there are some dimensions that are structural (x and y position in the case of image data) and others that can get perturbed (RGB values in this case). I think it would be valuable in random forests and boosted trees.

They also have a cup of coffee where the mask makes sense when artifacts are added (the kind of artifact that is know to mess with CNNs yet not human brains) and a maypole dance that doesn’t so much (and this seems to be powered by the same CNN tendency to spot ovals). This is potentially very informative for refining the CNN structure.

Screen Shot 2017-05-31 at 11.56.38

If you are interested in communicating the robustness of CNNs effectively, you should read this too.



Leave a comment

Filed under computing, machine learning, Visualization

Two great skills to leverage best-in-class big data science analytics

This came up on Twitter and lots of people were outraged, as you see in the replies and retweets.

Let’s unpack a couple of things.

  • appreciate – it’s not clear what he means by this. It could mean “Many software engineers will never be really good at data science using modern machine learning”, which seems like tautology (same goes for estate agents), but see software engineers below. It could mean “Many software engineers will never truly have an intuitive attraction to the elegant mathematical underpinnings of modern machine learning”, and in that case it is true that there is a connection between maths and, er, maths, but that’s not very interesting. Appreciating in this sense is an ivory tower luxury.
  • love – lord above, are you trying to fool me in love? I think high-pressure rote learning in the Asian mould would do the trick too. It seems irrelevant.


    Victorian Dad (c) Viz

  • as a teen – this is what most people hated about it, the gatekeeping and stereotype-enforcement. It’s clearly bollocks, so let’s not waste time on Someone Said Something Wrong On The Internet. If you want to learn now, here’s my reading page.
  • software engineers – if he really is talking about software engineers (isn’t that term, like, a bit 1990s?), then it sounds fair enough despite the inaccuracies and tautologies. Why would they want to or need to have anything to do with modern ML? I’m a statistician, but do enough programming to grasp what it is like to be a day-in, day-out coder. You just grab something that someone wrote — a random forests library perhaps — and plug it in. Why would you appreciate its theory? That’s a waste of time. You don’t go round appreciating the hell out of fibre broadband cables.
  • modern machine learning – I don’t know what is meant by this, but it’s interesting to me that there are some things in ML and stats like logistic regression, which have strong, mathematical underpinnings, which is to say that their asymptotics are understood, and other things in ML and not stats, like deep learning with backprop, which are kind of greedy, heuristic and do not have guaranteed or even understood asymptotics. Depending on what he means by this phrase, there might be nothing to appreciate. If there is something to appreciate, then it might not be that modern — logistic regression was pretty much finished theoretically in the 70s, PCA in the 30s.
  • math – this is the really interesting thing. Do you need maths to do data science well? It certainly helps with reading those tortuous theory papers (but they’re not that useful compared to messing about with software). It is not as useful as programming (hi, software engineers!) skills. The reason a lot of people get caught out is because they have done some analysis that ran, produced no error messages, but led to the wrong answer, and they had no mental tools to spot it. Maths will not give you that tool; you need to think about data and have messed around getting your hands dirty. I studied maths and enjoyed it and did pretty well, if I say so myself, but that has been of very little use to me. I’ve forgotten most of it.

    A page of my A-level maths revision notes. I have never had to do partial fractions. Ever.

    If you really do intend to be a methodological stats prof, then you’d better get good with the old x’s and y’s, but otherwise, install R and play.

Perhaps the one really useful skill I acquired is imagining data as points in space, rotating, distorting, projecting. I had to do a lot of that when doing a Masters dissertation project with PCA, MCA, etc. That has genuinely helped me to develop ideas and think about where things are going wrong.

The other important thing to think about is metrics – different ways of quantifying the distance from this data point to that one, because that underpins a lot of stuff that follows, whether stats or ML (notably loss / log-likelihood functions). And I have another blog post on this very topic coming up.

Leave a comment

Filed under learning

The peer-review log

As an academic, I started a page on this blog site that documented each peer review I did for a journal. I never quite got round to going back in time from the start, but there isn’t much of interest there that you won’t get from the stuff I did capture. Now that I am hanging up my mortarboard, it doesn’t make sense to be a page any more so I am moving it here. Enjoy the schadenfreude if nothing else.

Statisticians are in short supply, so scientific journals find it hard to get one of us to review the papers that have been submitted to them. And yet the huge majority of these papers rely heavily on stats for their conclusions. As a reviewer, I see the same problems appearing over and over, but I know how hard it is for most scientists to find a friendly statistician to help them make it better. So, I present this log of all the papers I have reviewed, anonymised, giving the month of review, study design and broad outline of what was good or bad from a stats point of view. I hope this helps some authors improve the presentation of their work and avoid the most common problems.

I started this in November 2013, and am working backwards as well as recording new reviews, although the retrospective information might be patchy.

  • November 2012, randomised controlled trial, recommended rejection. Sample size was based on an unrealistic Minimum Clinically Important Difference from prior research uncharacteristic of the primary outcome, and thus the study was unable to demonstrate benefit, and unethical because the primary outcome was about efficiency of the health system while benefit to patients had already been demonstrated, yet the intervention was withheld in the control group. Power to detect adverse events was even lower as a result, yet bold statements about safety were made. A flawed piece of work that put hospital patients at risk with no chance of ever demonstrating anything, this study should never have been approved in the first place. Of interest to scholars of evidence-based medicine, this study has now been printed by Elsevier in a lesser journal, unchanged from the version I reviewed. Such is life; I only hope the authors learnt something from the review to outweigh the reward they felt at finally getting it published.
  • November 2013, cross-sectional survey, recommended rejection. Estimates were adjusted for covariates (not confounders) when it was not relevant to do so, grammar was poor and confusing in places, odds ratios were used when relative risks would be clearer, t-tests and chi-squareds were carried out and reported without any hypothesis being clearly stated or justified
  • November 2013, exploratory / correlation study, recommended major revision then rejection when authors declined to revise the analysis. Ordinal data analysed as nominal, causing an error crossing p=0.05.
  • March 2014, randomised controlled trial, recommended rejection. Estimates were adjusted for covariates when it was not relevant to do so, bold conclusions are made without justification.
  • April 2014, mixed methods systematic review, recommended minor changes around clarity of writing and details of one calculation.
  • May 2014, meta-analysis, recommended acceptance – conducted to current best practice, clearly written and on a useful topic.
  • July 2014, ecological analysis, recommended major revision. Pretty ropy on several fronts, but perhaps most importantly that any variables the authors could find had been thrown into an “adjusted” analysis with clearly no concept of what that meant or was supposed to do. Wildly optimistic conclusions too. Came back for re-review in September 2014 with toned-down conclusions and clarity about what had been included as covariates but the same issue of throwing the kitchen sink in. More “major revisions”; and don’t even think about sending it voetstoots to a lesser journal because I’ll be watching for it! (As of September 2015, I find no sign of it online)
  • July 2014, some other study I can’t find right now…
  • September 2014, cohort study. Clear, appropriate, important. Just a couple of minor additions to the discussion requested.
  • February 2015, secondary analysis of routine data, no clear question, no clear methods, no justification of adjustment, doesn’t contribute anything that we haven’t already known for 20 years and more. Reject.
  • February 2015, revision of some previously rejected paper where the authors try to wriggle out of any work by refuting basic statistical facts. Straight to the 5th circle of hell.
  • March 2015, statistical methods paper. Helpful, practical, clearly written. Only the very merest of amendments.
  • April 2015, secondary analysis of public-domain data. Inappropriate analysis, leading to meaningless conclusions. Reject.
  • April 2015, retrospective cohort study, can’t find the comments any more… but I think I recommended some level of revisions
  • September 2015, survey of a specific health service in a hard-to-reach population. Appropriate to the question, novel and important. Some amendments to graphics and tables were suggested. Minor revisions.
  • March 2016, case series developing a prognostic score. Nice analysis, written very well, and a really important topic. My only quibbles were about assuming linear effects. Accept subject to discretionary changes.
  • October 2016, cohort study. Adjusted for stuff that probably isn’t confounding, and adjusting (Cox regression) for competing risks when they should be recognised as such. Various facts about the participants that are not declared. Major revisions.
  • October 2016 diagnostic study meta-analysis. Well done, clearly explained. A few things could be spelled out more. Minor revisions.
  • November 2016, kind of a diagnostic study…, well-done, well-written, but very limited in scope and hard to tell what the implications for practice might be. Left in the lap of the gods editors.
  • December 2016, observational study of risk factors, using binary outcomes but would be more powerful with time-to-event if possible. Competing risks would have to be used in that case. Otherwise, nice.

Leave a comment

Filed under research

Performance indicators and routine data on child protection services

The parts of social services that do child protection in England get inspected by Ofsted on behalf of the Department for Education (DfE). The process is analogous to the Care Quality Commission inspections of healthcare and adult social care providers, and they both give out ratings of ‘Inadequate’, ‘Requires Improvement’, ‘Good’ or ‘Outstanding’. In the health setting, there’s many years’ experience of quantitative quality (or performance) indicators, often through a local process called clinical audit and sometimes nationally. I’ve been involved with clinical audit for many years. One general trend over that time has been away from de novo data collection and towards recycling routinely collected data. Especially in the era of big data, lots of organisations are very excited about Leveraging Big Data Analytics to discover who’s outstanding, who sucks, and how to save lives all over the place. Now, it may not be that simple, but there is definitely merit in using existing data.

This trend is just appearing on the horizon for social care though, because records are less organised and electronic, and because there just hasn’t been that culture of profession-led audit. Into this scene came my colleagues Rick Hood (complex systems thinker) and Ray Jones (now retired professor and general Colossus of UK social care). They wanted to investigate recently open-sourced data on child protection services and asked if I would be interested to join in. I was – and I wanted to consider this question: could routine data replace Ofsted inspections? I suspected not! But I also suspected that question would soon be asked on the cash-strapped corridors of the DfE, and I wanted to head it off with some facts and some proper analysis.

We hired master data wrangler Allie Goldacre, who combed through, tested and verified and combined together the various sources:

  • Children in Need census, and its predecessor the Child Protection and Referrals returns
  • Children and Family Court Advisory and Support Service records of care proceedings
  • DfE’s Children’s Social Work Workforce statistics
  • SSDA903 records of looked-after children
  • Spending statements from local authorities
  • Local authority statistics on child population, deprivation and urban/rural locations.

Just because the data were ‘open’ didn’t mean they were useable. Each set had its own quirks and each local authority had its own problems and definitions in some cases. The data wrangling was painstaking and painful! As it’s all in the public domain, I’m going to add the data and code to my website here, very soon.

Then, we wrote this paper investigating the system and this paper trying to predict ‘Inadequate’ ratings. The second of these took all the predictors in 2012 (the most complete year for data) and tried to predict Inadequates in 2012 or 2013. We used the marvellous glmnet package in R and got down to three predictors:

  • Initial assessments within the target of 10 days
  • Re-referrals to the service
  • The use of agency workers

Together they get 68% of teams right, and that could not be improved on. We concluded that 68% was not good enough to replace inspection, and called it a day.

But lo! Soon afterwards, the DfE announced that they had devised a new Big Data approach to predict Inadequate Ofsted scores, and that (what a coincidence!) it used the same three indicators. Well I never. We were not credited for this, nor indeed had our conclusion (that it’s a stupid idea) sunk in. Could they have just followed a parallel route to ours? Highly unlikely, unless they had an Allie at work on it, and I get no impression of the nuanced understanding of the data that would result from that.

Ray noticed that the magazine Children and Young People Now were running an article on the DfE prediction, and I got in touch. They asked for a comment and we stuck it in here.

A salutary lesson that cash-strapped Gradgrinds, starry eyed with the promises of big data after reading some half-cocked article in Forbes, will clutch at any positive message that suits them and ignore the rest. This is why careful curation of predictive models matters. The consumer is generally not equipped to make the judgements about using them.

A closing aside: Thomas Dinsmore wrote a while back that a fitted model is intellectual property. I think it would be hard to argue that coefficients from an elastic-net regression are mine and mine only, although the distinction may well be in how they are used, and this will appear in courts around the world now that they are viewed as commercially advantageous.

1 Comment

Filed under research

The sad ossification of Cochrane reviews

Cochrane reviews made a huge difference to evidence-based medicine by forcing consistent analysis and writing on systematic reviews, but now I find them losing the plot in a rather sad way. I wanted to write a longer critique while still indemnified by being a university employee and after the publication of a review I have nearly completed with colleagues (all of whom say “never again”). But those two things will not overlap. So, I’ll just point you to some advice on writing a Summary Of Findings table (the only bit most people read) from the Musculo-skeletal Group:

  • “Fill in the NNT, Absolute risk difference and relative percent change values for each outcome as well as the summary statistics for continuous outcomes in the comments column.”

“Summary”, you say? Well, I’m all for relative + absolute measures, but the NNT is a little controversial nowadays (cf Stephen Senn everywhere) and are all those stats going to have appropriate measures of uncertainty, or will they be presented as gospel truth? With continuous outcomes, we were required to state means, SDs, % difference, and % change in either arm, which seems a bit over the top to me, and, crucially, relies on some pretty bold assumptions about distributions: assumptions that are not necessary elsewhere in the review.

  • “When different scales are used, standardized values are calculated and the absolute and relative changes are put in terms of the most often used and/or recognized scale.”

I can see the point of this but that requires a big old assumption about the population mean and standard deviation of the most often used scale, as well as assumption of normality. Usually, these scales have floor/ceiling effects.

  • “there are two options for filling in the baseline mean of the control group: of the included trials for a particular outcome, choose the study that is a combination of the most representative study population and has a large weighting in the overall result in RevMan. Enter the baseline mean in the control group of this study. […or…] Use the generic inverse variance method in RevMan to determine the pooled baseline mean. Enter the baseline mean and standard error (SE) of the control group for each trial”

This is an invitation to plug in your favourite trial and make the effect look bigger or smaller than it came out. Who says there is going to be one trial that is most representative and has a precise baseline estimate? There will be fudges and trade-offs aplenty here.

  • “Please note that a SoF table should have a maximum of seven most important outcomes.”

Clearly, eight would be completely wrong.

  • “Note that NNTs should only be calculated for those outcomes where a statistically significant difference has been demonstrated”

Jesus wept. I honestly can’t believe I have to write this in 2017. Reporting only significant findings allows genuine effects and noise to get through, and the quantity of noise can actually be huge, certainly not 5% of results (cf John Ioannides everything being false, and Andrew Gelman on types of error).

On calculating some absolute changes in % terms (all under 10%), reviewers then came back and told us that they should all be described as “slight improvement”, the term “slight” being reserved for absolute changes under a certain size. They also recommend using Cohen’s small-medium-large classification quite strictly, in a handy spreadsheet for authors called Codfish. I thought Cohen’s D and his classification had been thrown out long ago in favour of, you know, thinking. This is rather sad, as we see the systematic approach being ossified into a rigid set of rules. I suspect that the really clever methodologists involved in Cochrane are not aware of this, nor would they approve, but it is happening little by little in the specialist groups.


Archaeopteryx lithographica (Eichstätter specimen). H. Raab CC-BY-SA-3.0

This advice for reviewers is not on their website but needs proper statistical review and revision. We shouldn’t be going backwards in this era of Crisis Of Replication.

Leave a comment

Filed under healthcare