A thinking exercise to teach about cryptic multiplicity

It’s Pi Day, and yesterday I saw a tweet from Mathematics Mastery, my sister-in-law’s brain child, which pointed out that the number zero does not occur in the first 31 digits of pi. I wondered “what’s the chances of that?” and then realised it was a fine example to get students of statistics to think through. Not because the probability is difficult to work out, but because the model and assumptions are not obvious. Pi is a transcendental number, meaning that it was discovered by Walt Whitman, or something like that. All the symbols 0-9 appear without any pattern, so the chance that a particular digit is a particular symbol is 0.1. The chance it’s not “0” is 0.9, and the 30 that follow are independent and identically distributed, so that comes to 3.8%  But you’d be just as surprised to find that “3” does not appear. Or “8”. There was nothing special a priori about “0”. Students will hopefully spot this if you have shown them real-life examples like “women wear red or pink shirts when they ovulate“. (Your alarm bells might start going here, detecting an approaching rant about the philosophy of inference, but relax. I’m giving you a day off.) So we crunch up some nasty probability theory (if you’ve taught them that sort of stuff) and get the chance of one or more symbols being completely absent at just over 38%. Then you can subtract some unpleasant multiple absences and get back to about 34%, or just simulate it!:

iter<-1000000
pb<-txtProgressBar(min=1,max=iter,style=3)
count<-matrix(NA,iter,10)
for (i in 1:iter) {
setTxtProgressBar(pb,i)
x<-floor(9.99999*runif(31))
for (j in 1:10)
count[i,j]<-sum(x==(j-1))
}
close(pb)
noneofone<-apply(count==0,1,sum)
table(noneofone)

But there’s another issue, and I hope that someone in a class would come up with it. Why 31? That’s just because the 32nd was the first “0”. So isn’t that also capitalising on chance? Yes, I think it is. It is an exploratory look-see analysis that suddenly turned into a hard-nosed let’s-prove-something analysis because we got excited about a pattern. What we really need to examine is the chance of coming up with a n-free run of length 31 or greater, where n is any of the ten symbols we call numbers. This is starting to sound more like a hypothesis test now, and you can get students to work with a negative binomial distribution to get it, but the important message is not how to do this particular example, or that coincidences, being ill-defined a priori, happen a lot (though that’s important too: “million-to one chances crop up nine times out of ten”, wrote Terry Pratchett), but rather that our belief about the data-generating process determines how we analyse, and it is vital to stop and think about where they came from and why we believe that particular mental/causal model before diving into the eureka stuff.

Leave a comment

Filed under learning

Seminar at Kingston University, Wednesday 18 March 2015, 2:15 pm

Come and hear me talk about emerging work with Bayesian latent variable models and SEMs. email Luluwah al-Fagih if you want to attend: L.Al-Fagih@kingston.ac.uk

Applying Bayesian latent variable models to imperfect healthcare data

Abstract: Analysis routinely collected or observational healthcare data is increasingly popular but troubled by poor data quality from a number of sources. Human error, coding habits, missing and coarse data all play a part in this. I will describe the application of Bayesian latent variable models to tackle issues like these in various forms to four projects: a pilot clinical trial in stroke rehabilitation, a meta-analysis including mean differences in depression scores alongside odds of reaching a threshold, an exploratory study of predictors of ocular tuberculosis, and an observational study of the timing of imaging in initial treatment of major trauma patients. The motivation for the Bayesian approach is the ease of flexible modelling, and I will explain the choices of software and algorithms currently available. Using latent variables allows us to draw inferences based on the unknown true values that are not available to us, while explicitly taking all sources of uncertainty into account.

Leave a comment

Filed under Uncategorized

Extract data from Dementia Care Mapping spreadsheets into Stata

I’ve recently been involved in three projects using the Dementia Care Mapping data collection tool. This is a neat way of getting really detailed longitudinal observations on the activity, mood, interactions and wellbeing of people with dementia in group settings. The makers of the DCM provide a spreadsheet which looks like this:

Copied from the Bradford DCM manual. These are not real data (I hope!)

Copied from the Bradford DCM manual. These are not real data (I hope!)

that is, there is a row of BCC behaviour codes and a row of ME quality of life codes for each person, and they are captured typically every five minutes. Date and time and other stuff that might be important are floating above the main body of the data. Subsequent worksheets provide descriptive tables and graphs, but we will ignore those as we want to extract data into Stata for more detailed analysis. (But let me, in passing, point out this work in Bradford, which is moving towards electronic collection.)

The good news is that Stata versions 12 and 13 have improved import commands, with specific support for Excel spreadsheets. You can specify that you want to take only particular cells, so that allows us to pull in the stuff at the top like data and time, and then go back and get the bulk of the data. Here’s my code:

// first, get date from cell, store in macro
import excel "`xlfile'", sheet("Raw data") ///
cellrange(A2:S2) clear
local visitdate=D[1] // date format
local visittime=I[1] // string
// now get DCM data
clear
import excel "`xlfile'", sheet("Raw data") ///
cellrange(A4:EP28) clear
rename A ID
rename B measure
foreach v of varlist C-EP {
capture confirm string variable `v'
if !_rc {
local temp=`v'[1]
local hr=substr("`temp'",1,2)
local mn=substr("`temp'",4,2)
rename `v' t`hr'`mn'
}
else {
drop `v'
}
}

In fact, I want to show you a bigger chunk of it, because in the most recent project, there were several care homes, and each was visited several times, and within each visit there was at least a before, during, and after spreadsheet. Thankfully, my colleague who did the collection had very consistently filed everything away so the file and folder structure was very consistent, and I could capitalise on that.

/* Data extraction from multiple Dementia Care Mapping

spreadsheets in multiple folders.
NB the shell command is for DOS. If using
another operating system, it needs to change. */

clear all
global loc "C:\Users\RLGrant\Desktop\Dementia Care Mapping"

global homes "A B C D E F"

// loop 1: over homes
local loopcount=1
foreach home of global homes {
local folder2="${loc}\DCM_`home'"
shell dir "`folder2'" > "`folder2'\folders.txt" /B

// loop 2: over visits
tempname fh2
local linenum2 = 0
file open `fh2' using "`folder2'\folders.txt", read
file read `fh2' nextfolder
while r(eof)==0 {
if "`nextfolder'"!="folders.txt" {
local visit = "`folder2'"+"\"+"`nextfolder'"
local ++linenum2
// save text list of files
shell dir "`visit'" > "`visit'\files.txt" /B

// loop 3: over files
tempname fh
local linenum = 0
file open `fh' using "`visit'\files.txt", read
file read `fh' nextfile
while r(eof)==0 {
if "`nextfile'"!="files.txt" {
local xlfile = "`visit'"+"\"+"`nextfile'"
dis as result "Now reading `xlfile'"
local ++linenum
// first, get date from cell, store in macro
import excel "`xlfile'", sheet("Raw data") ///
cellrange(A2:S2) clear
local visitdate=D[1] // date format
local visittime=I[1] // string
// now get DCM data
clear
import excel "`xlfile'", sheet("Raw data") ///
cellrange(A4:EP28) clear
rename A ID
rename B measure
foreach v of varlist C-EP {
capture confirm string variable `v'
if !_rc {
local temp=`v'[1]
local hr=substr("`temp'",1,2)
local mn=substr("`temp'",4,2)
rename `v' t`hr'`mn'
}
else {
drop `v'
}
}
qui drop in 1
qui count
local n=r(N)
forvalues i=2(2)`n' {
qui replace ID=ID[`i'-1] in `i'
}
qui drop if ID==""
qui reshape wide t*, i(ID) j(measure) string
gen home="`home'"
// add date and times
gen visitdate=`visitdate'
format visitdate %tdddMonCCYY
gen str24 visittime="`visittime'"
// add filename
gen str40 filename="`nextfile'"
order filename home visitdate visittime
if `loopcount'==1 {
save "${loc}\alldata.dta", replace
}
else {
append using "${loc}\alldata.dta"
save "${loc}\alldata.dta", replace
}
local ++loopcount
}
file read `fh' nextfile
}
file close `fh'
// shell del "`visit'\files.txt"
}
file read `fh2' nextfolder
}
file close `fh2'
// shell del "`folder2'\folders.txt"
}

The trick here is to get a text file that contains the contents of each folder, read that in and loop through it’s contents. So it doesn’t really matter what the files are called. In this case, the homes are named A-F, and at the end I do some date and time formatting. With different data, these bits may well have to be tweaked but I leave them here to inspire you if nothing else. Happy analysing!

1 Comment

Filed under Stata

More active learning in statistics classes – and hypothesis testing too

Most statistics teachers would agree that our face-to-face time with students needs to get more ‘active’. The concepts and the critical thinking so essential to what we do only sinks in when you try it out. That applies as much to reading and critiquing other’s statistics as it does to working out your own. One area of particular interest to me is communicating statistical findings, something for which evidence of effective strategies is sorely lacking, so it remains most valuable to learn by doing.

It’s so easy to stand there and talk about what you do, but there’s no guarantee they get it or retain that information a week later. I always enjoy reading Andrew Gelman’s blog and a couple of interesting discussions about active learning came up there recently, which I’ll signpost and briefly summarise.

Firstly, thinking aloud about activating a survey class (and a graphics / comms one, but most of the responses are about the familiar survey topics). The consensus seems to be to let the students discover – painfully if necessary – for themselves. That means letting them collect and grapple with messy data, not contrived examples. There’s some nice pointers in there about stage-managing the student group experience (obviously we don’t really let them grapple unaided).

The statistical communication course came back next, with a refreshing theme that we don’t know how to do this (me neither, but we’re getting closer, I’d like to think). Check out O’Rourke’s suggested documents if nothing else!

Then, the problem of hypothesis testing. The dialogue between Vasishth and Gelman particularly crystallises the issue for practising analysts. It came back a couple of weeks later; I particularly like the section about a third of the way down after Deborah Mayo appears, like an avenging superhero, to demolish the widely used, over-simplified interpretation of hypothesis testing in a single sentence, after which Anonymous and Gelman cover a situation where two researchers look at the same data. Dr Good has a pre-specified hypothesis, tests it and finds a significant result, stops there and reports it. Dr Evil intends to keep fishing until he or she finds something sexy they can publish, but happens by chance to start with the same test as Dr Good. Satisfied with the magical p<0.05, they too stop and write it up. Is Evil’s work equivalent to Good’s? Is the issue with motivation or selection? Food for thought, but we have strayed from teaching into some kind of Socratic gunfight (doubly American!). However, I think there is no harm in exposing students (especially those already steeped in some professional practice like the healthcare professionals I teach) to these problems, because they already recognise them from published literature, although they might not formulate them quite so clearly. Along the way, someone linked to this rather nice post by Simine Vazire.

(I don’t want you to think I’ve wimped out, so here’s my view, although that’s really not what this post is about: Rahul wrote “The reasonable course might be for [Dr Evil] to treat this analysis as exploratory in the light of what he observed. Then collect another data set with the express goal of only testing for that specific hypothesis. And if he again gets p<0.01 then publish.” – which I agree with, but for me all statistical results are exploratory. They might be hypothesis testing as well, but they are never proving or disproving stuff, always stacking evidence quantitatively in the service of a fluffier mental process called abduction or Inference to the Best Explanation. They are merely a feeble attempt to make a quantitative, systematic, less biased representation of our own thoughts.)

Now, if you like a good hypothesis testing debate, consider the journal that banned tests, and keep watching StatsLife for some forthcoming opinions on the matter.

Leave a comment

Filed under learning

Adjust brightness in LXDE Linux

This is a little diversion from the usual stats.

I’ve been running LXDE Debian Linux on my small laptop for a while, and I’m really pleased with it. It handles all sorts of stuff and the fact that L stands for Lightweight hardly ever holds me back. But it doesn’t have any screen brightness controls, and it seems lots of people have asked about this on forums. Usually the issue is mixed up with allocating brightness control to a combination of keys, but that’s a bigger problem which depends on exactly the hardware you have. I just fixed it with a simple crude hack, and as it bothers everyone, I thought I’d share it here.

Take a look in your /sys/class/backlight folder. I’ve got a /samsung folder inside that, you might have something different, but whatever you have, look around until you find a file called brightness, and another called max_brightness. Open them in your text editor of choice. In my case max_brightness simply contains the number 8, and brightness 1. To change brightness of the screen, you change the number inside the brightness file. Make a new text file called go-dim, which contains this:

echo 1 > /sys/class/backlight/samsung/brightness

Then, one called go-bright, which contains this:

echo 4 > /sys/class/backlight/samsung/brightness

You don’t have to use 4 as a bright value, you can choose something else (less than or equal to the value inside max_brightness). Then save them somewhere easily accessible like the Desktop, open the terminal and type:

chmod a+x Desktop/go-dim

and

chmod a+x Desktop/go-bright

Now, you can double click those files on your desktop, choose “execute” and they will do their thing for you. Obviously, if you save them somewhere else, you need to type the correct path in the chmod command.

Leave a comment

Filed under computing

Kingston & St George’s Stats surgeries 2015

This year’s stats surgeries for our postgraduate students have just gone up on my website at http://www.robertgrantstats.co.uk/students.html

Leave a comment

Filed under learning

My talk for Calculating and Communicating Uncertainty 2015 – and an experimental D3 page

Tomorrow and Wednesday I’ll be at the CCU2015 conference in Westminster. I’m talking in a workshop session on Wednesday on interactive graphics – and in particular, how they can be used to communicate uncertainty- so I thought I would make an experimental page showing a couple of different ways of trying this. You can read my slides here and the experiment is here. While I’m sitting in the conference, I’ll probably tidy up a couple of rough edges, and in due course, I’ll make some more experiments. Please let me know what you think, and watch out for tweets with #ccu2015, although it means different things to different folks.

ccu-ex1

Leave a comment

Filed under animation, JavaScript, Visualization