This is a very quick thought for teaching, in passing. People often talk about chi-squared tests being overpowered when n is large. It occurs to me that a good way to broach this concept in an intuitive way is to point out that they are no different to t-tests and the like, but do not provide a meaningful point estimate. When you see the mean difference in blood pressure from drug X is 0.3mmHg, with p<0.001, you know it is clinically meaningless. When you see X2=3.89, nobody knows what to think. So perhaps the best thing to do is to mention this alongside non-parametric rank-based procedures, when you explain that they don’t give you an estimate or confidence interval.
Procedural notes that can be skipped:
I had previously intended to write something about the shapes employed by Giorgia Lupi and the Accurat studio – and indeed I still will. But that takes some time and it got leapfrogged by Dear Data. This post came at a good time because I didn’t get around to it straight away (we’re now at week 35 of the project) and by the time I did, some other ideas had bubbled up in conversations, focussing my attention on the process of design, critique and refinement (which is getting added to my reading pile for the summer). These ideas are so alien to statisticians that I am not sure any of them will have read this far into this post, but they (we) are the ones that need to up their (our) game in communication. Nobody else will do it for us! The other building block that came along in time was finally finding really nice writing paper and resolving to draft everything by hand from now on, preferably in time when I’m physically away from a computer. It has already proven very productive. People seem to have different approaches that work (like starting with bullet points, or cutting out phrases, or mind maps), but mine is to start writing at sentence one, like Evelyn Waugh, and just carry straight on. There is no draft; why should there be? Finding that technique and place to write is really valuable; don’t devalue it and try to squeeze it into a train journey or between phone calls. It’s the principal way in which you communicate your work, and probably the most overlooked.
Since some psychologists with no statistical qualifications decided to clean up psychology by banning statistics, other journals and assorted rags are seeing an opportunity to publish something kinda scientific, kinda popular, kinda easy. Here’s three that spectacularly miss the point. I don’t mind people being wrong on the internet, but when peer-reviewed journals tell you how to fix bad stats in peer-reviewed journals by using bad stats, it probably ought to be pointed out. Maybe it’s how Tom Lehrer felt on hearing of Kissinger’s Nobel Prize.
Exhibit one: “The extent and consequences of P-hacking in science“, written by biologists in PLoS Biology. They are examining overt multiplicity – running loads and loads of tests and reporting only the super funky ones – which is quite rare. Most people are agreed on this. The real problem is cryptic multiplicity, or “the garden of forking paths“. Get rid of the overt abusers and you would still have Ioannides’s problem of everything ever published being absolute nonsense (I paraphrase).
“peer-reviewed journals tell you how to fix bad stats in peer-reviewed journals by using bad stats”
Exhibit two: “P-values are just the tip of the iceberg“, written by biostatisticians in Nature. This is pretty good, but falls into that medical habit of envisioning all hypotheses as terribly clear-cut (mean systolic blood pressure at 6 weeks, per protocol, differs between the drug and placebo groups, &c &c), and they really are not. If we could only sort out study designs and data cleaning, they argue, then we could resume p-values with impunity, safe in the knowledge they would lead us to the Universal Truth with alpha type 1 errors and beta type 2 errors. They name their own online course in statistics as the solution (I suppose there’s nothing wrong with that in a comment article, but we Brits feel a tad, you know, uncomfortable with the old self-promotion sort of thing (for a moment I imagined there would be a toll-free number to enrol in the footnotes, but there wasn’t; anyway, it’s $470 and you can pay up front or as-you-go with Coursera, they take all the usual credit cards)). There’s, tellingly, this diagram which shows how you collect your data, then clean it, then do some analyses and find what looks sexy, and then make a hypothesis, and then test it. Cool!
Spot the problem?
- Yes? Well done. You don’t really need to read this. Get on with your work.
- No? Scroll back up and read that garden o’ forking paths paper. Read it good.
Now, let’s be generous. I criticise Leek and Peng with some trepidation, because they know their onions. It could be that the comment piece was slashed about the editing room by Nature’s staffers, and maybe the diagram was also the product of fevered interns. Heaven knows it wouldn’t be the first time.
Exhibit three: “What is medicine’s five sigma?” by the editor of the Lancet (a doctor, that most humble and self-effacing of professions) writes that p-values are awful so the thing to do is to accept only papers with p<0.0000003. This kind of misses the point so profoundly it's like treating an ingrown toenail by amputation. Of the other leg.
I'm not even going to bother you with the psychology magazines.
Conclusion: plenty of experience of critiquing papers and running analyses does not seem to prevent a lack of understanding of what a hypothesis test really does and how it goes wrong familywise, sciencewise, or whatever. The recommended solution is to read some philosophy of science. The late, great Peter Lipton is my homeboy on this one. We should pre-specify everything we can, and not allow those around us to get away with vague scientific hypotheses that map to many possible statistical hypotheses (forks in the path). But we must also remember that sometimes you can't pre-specify everything, and that ultimately you are doing a sort-of-deductive process after an inductive guess at an interesting topic and before an inductive set of practical conclusions, so we should acknowledge this creativity and not brush it under the carpet.
By the way, my new book “Frog Sperm for Dummies” is published next week. Order it now on Amazon! (with apologies to Michael Healy).
I’ve been impressed with this website (constituencyexplorer.org.uk) put together by Jim Ridgway and colleagues at Durham, with input from the House of Commons Library and dataviz guru Alan Smith from the ONS. In part, it is aimed at members of parliament, so they can test their knowledge of facts about constituencies and learn more along the way. But it makes for a fun quiz for residents too. Everything is realised in D3, so it runs everywhere, even on your phone. There are a few features I really like: the clean design, the link between map, list and dotplot in the election results:
… the animation after choosing a value with the slider, highlighting the extra/shortfall icons and the numbers dropping in: nice!
… the simple but quite ambitious help pop-up:
… and the way that the dotplots are always reset to the full width of the variable, so you can’t be misled by small differences appearing bigger than they are. The user has to choose to zoom after seeing the full picture.
All in all, a very nice piece of work. I must declare that I did contribute a few design suggestions in its latter stages of development but I really take no credit for its overall style and functionality. Budding D3 scripters could learn a lot from the source code.
And while we’re on the topic, here some more innovative electoral dataviz:
And finally, take a moment to ask election candidates to commit to one afternoon of free statistical training, a great initiative from the RSS – and frankly, not much to ask. Unfortunately, none of my local (Croydon Central) would-be lawmakers have been bothered to write back yet. But here’s the parties that are most interested in accurate statistics, in descending order (by mashing up this and this):
- National Health Alliance: 4/13
- Pirate Party: 1/6
- Green Party: 47/568
- Labour Party: 51/647
- Plaid Cymru: 3/40
- Liberal Democrats: 47/647
- Ulster Unionist Party: 1/15
- Christian People’s Alliance: 1/17
- Conservative and Unionist Party: 27/647
- Scottish National Party: 2/59
- United Kingdom Independence Party: 15/624
Last summer I was fortunate to attend three stats conferences in the USA. It was a mixture of exciting travel and hard slog, with some great learning and some unexpected surprises; among them, attempting to explain the Gloucestershire cheese-rolling race to a Navajo family, and wondering whether I could bring a dried buffalo scrotum back through Heathrow (it would’ve made a great pen-holder).
However, the academic highlight for me was ICOTS, the ninth International Conference On Teaching Statistics, in Flagstaff, Arizona. All I knew about Flagstaff I learnt from the Rolling Stones, so it was reassuring to find that my hotel room did indeed look over Route 66.
So, I want to leave aside the rest of my trip and share the biggest themes from ICOTS. I’m a university lecturer, so that’s what I’m focussing on, though there’s plenty to say about schools too. But first, a little background from my point of view.
When statistics evolved as a distinct academic discipline in the mid-20th century, it was invariably in the shadow of the mathematics department. To be taken seriously as an academic, one had (and still has) to display the trappings of expertise and rigour. Yet this could be done either by lots of abstraction and mathematics, or by lots of aspplication and real-life problems. Some of greatest names of that era, like Frank Wilcoxon and George Box, learned their skills from application (as did Fisher fifty years earlier), but mostly the maths won; it was the path of least resistance in a larger university setting, and that informed the teaching.
However, as years went by everybody wanted to learn some stats: economists, ecologists, archaeologists, doctors, you name it. But they typically weren’t so good at maths, at least in Europe and North America. Personally, I like to glibly attribute this to hippie school teachers, but that’s a little unfair. So, to accommodate these students, many introductory statistics courses for non-statisticians dumbed down. The mathematical foundations were largely dumped and replaced with recipes. You all know the sort:
- Do a Kolmogorov-Smirnov test.
- If p<0.05, do a Mann-Whitney.
- If not, do a t-test.
- Either way, if p<0.05, you can say the difference is ‘significant’.
This has the effect of getting people to pass exams, but then have no idea what to do in real life, or worse, have inflated notions of their own competence. It is surface learning, not deep understanding. The choice between mathemania and cookbooks is the reason why people feel comfortable telling you that statistics was the only course they failed at university (cf http://www.marketmenot.com/mcdonalds-on-that-grind-commercial/), or – even more worrying – that they got an A but never understood what was going on.
The movement to revive introductory statistics courses is really focussed around the American Statistical Association’s Guidelines for Assessment and Instruction in Statistics Education (GAISE). This is the only set of guidelines on how to teach statistics, yet if you are a British statistics teacher you will probably never have heard of them. They are fairly widely used in the USA, Australia and Canada, though not universally by any means, but are wholeheartedly adopted in New Zealand, where they inform the national policy on teaching statistics. The principles are:
- use real data, warts and all
- introduce inference (the hardest bit) with simulation, not asymptotic formulas
- emphasise computing skills (not a vocational course in one software package)
- emphasise flexible problem-solving in context (not abstracted recipes)
- use more active learning (call this “flipping”, if you really must)
The guidelines include a paragraph called “The Carpentry Analogy”, which I like so much I shall reproduce it here:
In week 1 of the carpentry (statistics) course, we learned to use various kinds of planes (summary statistics). In week 2, we learned to use different kinds of saws (graphs). Then, we learned about using hammers (conﬁdence intervals). Later, we learned about the characteristics of different types of wood (tests). By the end of the course, we had covered many aspects of carpentry (statistics). But I wanted to learn how to build a table (collect and analyze data to answer a question) and I never learned how to do that.
The ICOTS crowd are preoccupied with how to achieve this in real life, and I will group the ideas into 3 broad topics:
- reversing the traditional syllabus
- inference by simulation
- natural frequencies
, and then describe 2 other ideas which are interesting but less clearly defined how they could be implemented.
Reversing the traditional syllabus
Most introductory statistics courses follow an order unchanged since Snedecor’s 1937 textbook: the first to be aimed at people studying statistics (rather than learning how to analyse their own research data). It may begin with probability theory, though sometimes this is removed along with other mathematical content. At any rate, a problem here is that, without the mathematics that appears later for likelihood and the properties of distributions, the role of probability is unclear to the student. It is at best a fun introduction, full of flipping coins, rolling dice and goats hiding behind doors. But the contemporary, vocationally-focussed student (or customer) has less patience for goats and dice than their parents and grandparents did.
Next, we deal with distributions and their parameters, which also introduces descriptive statistics, although the distribution is an abstract and subtle concept, and there are many statistics which are not parameters. Again, the argument goes, once the properties of estimators was removed so as not to scare the students, it was no longer obvious why they should learn about parameters and distributions.
Then we move to tests and confidence intervals, though we may not talk much about the meaning or purpose of inference in case it is discouraging to the students. This is where they are at danger of acquiring the usual confusions: that the sampling distribution and data distribution are the same, that p-values are the chance of being wrong, and that inference can be done without consideration for the relationship between the sample and the population. Students can easily commit to memory magic spells such as “…in the population from which the sample was drawn…” and deploy them liberally to pass exams, without really understanding. Evidence from large classes suggests this is the point where marks and attendance drop.
Then we introduce comparison of multiple groups and perhaps some experimental design. There may be some mention of fixed and random effects (but necessarily vague) before we move to the final, advanced topic: regression. The appearance of regression at the end is Snedecor’s choice; if presented mathematically, that’s probably (!) the right order, because it depends on other concepts already introduced, but if we drop the maths, we can adopt a different order, one that follows the gradual building of students’ intuition and deeper understanding.
Andy Zieffler and colleagues at Minnesota have a programme called CATALST (http://iase-web.org/icots/9/proceedings/pdfs/ICOTS9_8B1_ZIEFFLER.pdf). This first introduces simulation from a model (marginal then conditional), then permutation tests, then bootstrapping. This equates to distributions, then regression, then hypothesis tests, then confidence intervals. This flips around Snedecor’s curriculum, and was echoed in a different talk by David Spiegelhalter. CATALST emphasises model+data throughout as an overarching framework. However, Zieffler noted that after 5 weeks the students do not yet have a deep concept of quantitative uncertainty (so don’t expect too much too quickly). Spiegelhalter’s version is focussed on dichotomous variables: start with a problem, represent it physically, do experiments, represent the results as trees or two-way tables or Venn diagrams to get conditional proportions, talk about expectation in future experiments, and finally get to probability. Probability manipulations like Bayes or P(a,b)=P(a|b)P(b) arrive naturally at the end and then lead to abstract notions of probability rather than the other way round. Visual aids are used throughout. One growth area that wasn’t represented much at ICOTS was interactive didactic graphics in the web browser (e.g. http://www2.le.ac.uk/Members/pl4/interactive-graphs). Some groups have developed Java applets and compiled software, but this suffers from translation onto different platforms and particularly onto mobile devices. The one group that have a product that is flexible and modern is the Lock family; more on them later.
Inference by simulation
The GAISE recommendation on introducing inference is a particularly hot topic. The notion is that students can get an intuitive grasp of what is going on with bootstrapping and randomisation tests far more easily than if you ask them to envisage a sampling distribution, arising from an infinite number of identical studies, drawing from a population, where the null hypothesis is true. This makes perfect sense to us teachers who have had years to think about it (and we are the survivors, not representative of the students.) When you pause to reflect that I have just described something that doesn’t exist, arising from a situation that can never happen, drawn from something you can never know, under circumstances that you know are not true, you see how this might not be the simplest mental somersault to ask of your students.
A common counter-argument is that simulation is an advanced topic. But this is an accident of history: non-parametrics, randomisation tests and bootstrapping were harder to do before computers, so we had to rely on relatively simple asymptotic formulas. That just isn’t true any more, and it hasn’t been since the advent of the personal computer, which brings home for me the extent of inertia in statistics teaching. Another argument is that the asymptotics are programmed in the software, so all students have to do is choose the right test and they get an answer. But you could also see this as a weakness; for many years statisticians have worried about software making things “too easy”, and this is exactly what that worry is about, that novices can get all manner of results out, pick an exciting p-value, write it up with some technical-sounding words and get it published. Simulation is a little like a QWERTY keyboard in that it slows you down just enough so you don’t jam the keys (younger readers may have to look this up). As for bootstrapping, most of us recall thinking it was too good to be true when we first heard about it, and we may fear the same reaction from our students, but that reaction is largely a result of being trained in getting confidence intervals the hard way, by second derivatives of the log-likelihood function. I’ve been telling them about bootstrapping (which is now super-easy in SPSS) since this academic year started, without so much as a flicker of surprise on their faces. A few days after ICOTS, I was having a cappuccino and a macaroon with Brad Efron in Palo Alto (my colleague Gill Mein says I am a terrible name-dropper but I’m just telling it like it is) and I asked him about this reaction. He said that when his 1979 paper came out, everybody said “it shouldn’t have been published because it’s obviously wrong” for a week and then “it shouldn’t have been published because it’s obvious” after that. I think that’s a pretty good sign of its acceptability. I just tell the students we’re doing the next best thing to re-running the experiment many times.
After the end of the main conference, I fought off a Conference Cold and went to a workshop on teaching inference by simulation down the road on the Northern Arizona University campus
This was split into two sessions, one with Beth Chance and Allan Rossman from Cal Poly (https://www.causeweb.org/ – which contains some information on CATALST too), another with some of the Locks (http://lock5stat.com/). Here a classroom full of stats lecturers worked through some of the exercises these simulation evangelists have tested and refined on their own students. One in particular I took away and have used several times since, with my own MRes students, other Kingston Uni Masters students, doctors in the UAE, clinical audit people in Leicester, and probably some others I have unfairly forgotten. It seems to work quite well, and its purpose is to introduce p-values and null hypothesis significance testing.
I take a bag of ‘pedagogical pennies’ and hand them out. Of course the students have their own coins but this makes it a little more memorable and discourages reserved people from sitting it out. A univariate one-group scenario is given to them that naturally has H0: pi=50%. You might say that ten of your patients have tried both ice packs and hot packs for their knee osteoarthritis, and 8 say they find the ice better. Could that be convincing enough for you to start recommending ice to everyone? (Or, depending on the audience, 8 out of 10 cats prefer Whiskas: https://youtu.be/jC1D_a1S2xs) I point out that the coin is a patient who has no preference (or a cat) and they all toss the coin 10 times. Then on the flipchart, I ask how many got no heads, 1 head, 2 heads… and draw a dot plot. We count how many got 0, 1, 2, 8, 9 or 10 and this proportion of the whole class is the approximate p-value. They also get to see a normal-ish sampling distribution emerge on the chart, and the students with weird results (I recently got my first 10/10; she thought it was a trick) can see that this is real life; when they get odd results in research, they just can’t see the other potential outcomes. Hopefully that shows them what the null hypothesis is, and the logic behind all p-values. It’s close enough to 0.05 to provoke some discussion.
The fact that simulation gives slightly different answers each time is also quite useful, because you can emphasise that p-values should be a continuous scale of evidence, not dichotomised, and that little tweaks to the analysis can tip over into significance a result that should really have been left well alone. (Of course, this is a different issue to sampling error, but as an aside it seems to work quite well.) One problem I haven’t worked a way around yet is that, at the end, I tell the students that of course they would really do this in the computer, which would allow them to run it 1000 times, not limited by their class size, and I fear that is a signal for them to forget everything that just happened. The best I can offer right now is to keep reminding them about the coin exercise, ten minutes later, half an hour later, at the end of the day and a week later if possible. I also worry that too much fun means the message is lost. Give them a snap question to tell you what a p-value is, with some multiple choices on a flipchart, offering common misunderstandings, and then go through each answer in turn to correct it. This is a tricky subject so it won’t be instant.
The Locks have a book out, the first to give a comprehensive course in stats with inference-by-simulation at its heart. It’s a great book and I recommend you check it out on their website. They also have some interactive analyses and graphics which allow the student to take one of their datasets (or enter their own!) and run permutation tests and bootstrap confidence intervals. It all runs in the browser so will work anywhere.
David Spiegelhalter spoke on the subject of natural frequencies with some passion. He has been involved in revising the content of the GCSE (16 year old) mathematics curriculum in the UK. Not every aspect in the final version was to his taste, but he made some inroads with this one, and was clearly delighted (http://understandinguncertainty.org/using-expected-frequencies-when-teaching-probability).
Some of the classic errors of probability can be addressed this way, without having to introduce algebraic notation. The important feature is that you are always dealing with a number of hypothetical people (or other units of analysis), with various things happening to some of them. It relates directly to a tree layout for probabilities, but without the annoying little fractions. I am also a fan of waffle plots for visualising proportions of a whole with a wide range of values (https://eagereyes.org/blog/2008/engaging-readers-with-square-pie-waffle-charts) and it would be nice to do something with these – perhaps bringing the interactive element in! One downside is that you often have to contrive the numbers to work out ‘nicely’ , which prevents you quickly responding to students’ “what if” questions.
Now for the couple of extras.
Statistical consulting is being used as a learning experience, in much the same way that you can get a cheap haircut from trainee hairdressers, at Pretoria (http://iase-web.org/icots/9/proceedings/pdfs/ICOTS9_C213_FLETCHER.pdf) & Truman State University (http://iase-web.org/icots/9/proceedings/pdfs/ICOTS9_C195_KIM.pdf). It sounds scary and hard work, but is a very innovative and bold idea, and we know that many students who are serious about using statistics in their career will have to do this to some extent, so why not give them some experience?
I was really impressed by Esther Isabelle Wilder of CUNY’s project NICHE (http://serc.carleton.edu/NICHE/index.html and http://iase-web.org/icots/9/proceedings/pdfs/ICOTS9_7D3_WILDER.pdf), which aims to boost statistical literacy in further and higher education, cutting across specialisms and silos in an institution. It acknowledges that many educators outside stats have to teach some, that they may be rusty and feel uncomfortable about it, and provides a safe environment for them to boost their stats skills and share good ideas. This is a very big and real problem and it would be great to see a UK version! Pre- and post-test among the faculty shows improvement in their comprehension, and they have to turn people away each summer because it has become so popular.
Finally, here’s a couple of exercises I liked the sound of:
Open three packets of M&Ms, and arrange them by colour. Get students to talk about what they can conclude about the contents of the next pack. (Fundamentalist frequentists might not like this.) This came from Markus Vogel & Andreas Eichler.
Ask students to design and conduct a study with an MP3 player, to try to determine whether the shuffling is random; this was devised by Andy Zieffler and reported by Katie Makar. We know that iPods in fact are not random, because customers initially complained that they were playing two or three songs from the same album together! I can’t vouch for other brands but Android’s built in player seems to do truly random things (as of v 4.3).
I thought this article in BMC Med Res Meth sounded interesting: “The rise of multiple imputation: a review of the reporting and implementation of the method in medical research“. I feel that we should know more about the adoption of analytical techniques, how they spread and are facilitated or blocked, but such reviews are almost unheard of. Unfortunately this one doesn’t answer a whole bunch of questions I have, largely because it just looks at two big-name medical journals in the years after software became widely available. I want to know what happens to get people using a new method (earlier in the history) and what gets printed all the way from top predators to bottom feeders.
Sadly, the crucial table 4, which lists the technical specifications of the methods used, is totally confusing. The half-page of miniscule footnotes (tl;dr) don’t help. All I can say is that not many of the papers used 100+ imputations, which is an interesting point because it seems recently (see a paper by Ian White which I can’t remember right now) that Don Rubin’s original experiments with a handful of imputed datasets might not be enough in some circumstances, and also with computer power on every desk (it’s not 1977 any more), there’s no reason not pump it up. Yet the original advice lives on.
It would be interesting to look at tutorial papers and reports and websites for various professions, as these are often badly written by someone who has A Little Knowledge, and see how these are referenced. I suspect a lot of people are going around referencing the original paper or the Rubin & Little book having never read either of them. Then someone should do some interviews of applied researchers who have published papers including MI. There you go, I’ve just given you a PhD topic. Off and do the thing.
(but not short enough for a tweet)
I’ve long admired the capacity of Stata developers to encapsulate complex statistical methods in a few plain English words.
Now that their new release (version 14) includes some MCMC methods, they explain the world of Bayesian analysis in the leaflet thus:
Bayesian analysis is a statistical analysis that answers research questions about unknown parameters using probability statements. For example, what is the probability that the average male height is between 70 and 80 inches or that the average female height is between 60 and 70 inches?
That is very impressive. Yes, there’s other stuff you can say, but not without complicating it and discouraging the curious. This captures both one of the objectives and the fundamental technical difference, and one of the different ways in which results are interpreted.
I just had to blog that straight away, but there are two related things to look out for if you are crazy about Bayesy:
- As you might know, I’ve written a Stata-to-Stan interface. It is called StataStan and that has, to my delight, made me one of the Stan developers. It’s a bit like writing a little rhyme and suddenly being invited on tour with the Wu-Tang Clan. I will be posting a long explanation of it and some examples here very soon.
- Rasmus Bååth and I have a plan afoot to promote succinct introductions like this. When we have time (ha!) we will launch that online and hopefully get contributions from a few wise people (proverbial RZAs and GZAs but perhaps no ODBs)