Trends in teaching statistics (reporting from ICOTS 2014)

Last summer I was fortunate to attend three stats conferences in the USA. It was a mixture of exciting travel and hard slog, with some great learning and some unexpected surprises; among them, attempting to explain the Gloucestershire cheese-rolling race to a Navajo family, and wondering whether I could bring a dried buffalo scrotum back through Heathrow (it would’ve made a great pen-holder).

However, the academic highlight for me was ICOTS, the ninth International Conference On Teaching Statistics, in Flagstaff, Arizona. All I knew about Flagstaff I learnt from the Rolling Stones, so it was reassuring to find that my hotel room did indeed look over Route 66.

Get your teaching tricks on Route 66

Get your teaching tricks on Route 66

So, I want to leave aside the rest of my trip and share the biggest themes from ICOTS. I’m a university lecturer, so that’s what I’m focussing on, though there’s plenty to say about schools too. But first, a little background from my point of view.

When statistics evolved as a distinct academic discipline in the mid-20th century, it was invariably in the shadow of the mathematics department. To be taken seriously as an academic, one had (and still has) to display the trappings of expertise and rigour. Yet this could be done either by lots of abstraction and mathematics, or by lots of aspplication and real-life problems. Some of greatest names of that era, like Frank Wilcoxon and George Box, learned their skills from application (as did Fisher fifty years earlier), but mostly the maths won; it was the path of least resistance in a larger university setting, and that informed the teaching.

However, as years went by everybody wanted to learn some stats: economists, ecologists, archaeologists, doctors, you name it. But they typically weren’t so good at maths, at least in Europe and North America. Personally, I like to glibly attribute this to hippie school teachers, but that’s a little unfair. So, to accommodate these students, many introductory statistics courses for non-statisticians dumbed down. The mathematical foundations were largely dumped and replaced with recipes. You all know the sort:

  1. Do a Kolmogorov-Smirnov test.
  2. If p<0.05, do a Mann-Whitney.
  3. If not, do a t-test.
  4. Either way, if p<0.05, you can say the difference is ‘significant’.
It's nice to name statistical tests after the pioneers (like this here feller) but not so helpful for the students.

It’s nice to name statistical tests after the pioneers (like this here feller) but not so helpful for the students.

This has the effect of getting people to pass exams, but then have no idea what to do in real life, or worse, have inflated notions of their own competence. It is surface learning, not deep understanding. The choice between mathemania and cookbooks is the reason why people feel comfortable telling you that statistics was the only course they failed at university (cf http://www.marketmenot.com/mcdonalds-on-that-grind-commercial/), or – even more worrying – that they got an A but never understood what was going on.

The movement to revive introductory statistics courses is really focussed around the American Statistical Association’s Guidelines for Assessment and Instruction in Statistics Education (GAISE). This is the only set of guidelines on how to teach statistics, yet if you are a British statistics teacher you will probably never have heard of them. They are fairly widely used in the USA, Australia and Canada, though not universally by any means, but are wholeheartedly adopted in New Zealand, where they inform the national policy on teaching statistics. The principles are:

  • use real data, warts and all
  • introduce inference (the hardest bit) with simulation, not asymptotic formulas
  • emphasise computing skills (not a vocational course in one software package)
  • emphasise flexible problem-solving in context (not abstracted recipes)
  • use more active learning (call this “flipping”, if you really must)

The guidelines include a paragraph called “The Carpentry Analogy”, which I like so much I shall reproduce it here:

In week 1 of the carpentry (statistics) course, we learned to use various kinds of planes (summary statistics). In week 2, we learned to use different kinds of saws (graphs). Then, we learned about using hammers (confidence intervals). Later, we learned about the characteristics of different types of wood (tests). By the end of the course, we had covered many aspects of carpentry (statistics). But I wanted to learn how to build a table (collect and analyze data to answer a question) and I never learned how to do that.

The ICOTS crowd are preoccupied with how to achieve this in real life, and I will group the ideas into 3 broad topics:

  • reversing the traditional syllabus
  • inference by simulation
  • natural frequencies

, and then describe 2 other ideas which are interesting but less clearly defined how they could be implemented.

Reversing the traditional syllabus

Most introductory statistics courses follow an order unchanged since Snedecor’s 1937 textbook: the first to be aimed at people studying statistics (rather than learning how to analyse their own research data). It may begin with probability theory, though sometimes this is removed along with other mathematical content. At any rate, a problem here is that, without the mathematics that appears later for likelihood and the properties of distributions, the role of probability is unclear to the student. It is at best a fun introduction, full of flipping coins, rolling dice and goats hiding behind doors. But the contemporary, vocationally-focussed student (or customer) has less patience for goats and dice than their parents and grandparents did.

Next, we deal with distributions and their parameters, which also introduces descriptive statistics, although the distribution is an abstract and subtle concept, and there are many statistics which are not parameters. Again, the argument goes, once the properties of estimators was removed so as not to scare the students, it was no longer obvious why they should learn about parameters and distributions.

Then we move to tests and confidence intervals, though we may not talk much about the meaning or purpose of inference in case it is discouraging to the students. This is where they are at danger of acquiring the usual confusions: that the sampling distribution and data distribution are the same, that p-values are the chance of being wrong, and that inference can be done without consideration for the relationship between the sample and the population. Students can easily commit to memory magic spells such as “…in the population from which the sample was drawn…” and deploy them liberally to pass exams, without really understanding. Evidence from large classes suggests this is the point where marks and attendance drop.

Then we introduce comparison of multiple groups and perhaps some experimental design. There may be some mention of fixed and random effects (but necessarily vague) before we move to the final, advanced topic: regression. The appearance of regression at the end is Snedecor’s choice; if presented mathematically, that’s probably (!) the right order, because it depends on other concepts already introduced, but if we drop the maths, we can adopt a different order, one that follows the gradual building of students’ intuition and deeper understanding.

Andy Zieffler and colleagues at Minnesota have a programme called CATALST (http://iase-web.org/icots/9/proceedings/pdfs/ICOTS9_8B1_ZIEFFLER.pdf). This first introduces simulation from a model (marginal then conditional), then permutation tests, then bootstrapping. This equates to distributions, then regression, then hypothesis tests, then confidence intervals. This flips around Snedecor’s curriculum, and was echoed in a different talk by David Spiegelhalter. CATALST emphasises model+data throughout as an overarching framework. However, Zieffler noted that after 5 weeks the students do not yet have a deep concept of quantitative uncertainty (so don’t expect too much too quickly). Spiegelhalter’s version is focussed on dichotomous variables: start with a problem, represent it physically, do experiments, represent the results as trees or two-way tables or Venn diagrams to get conditional proportions, talk about expectation in future experiments, and finally get to probability. Probability manipulations like Bayes or P(a,b)=P(a|b)P(b) arrive naturally at the end and then lead to abstract notions of probability rather than the other way round. Visual aids are used throughout. One growth area that wasn’t represented much at ICOTS was interactive didactic graphics in the web browser (e.g. http://www2.le.ac.uk/Members/pl4/interactive-graphs). Some groups have developed Java applets and compiled software, but this suffers from translation onto different platforms and particularly onto mobile devices. The one group that have a product that is flexible and modern is the Lock family; more on them later.

Inference by simulation

The GAISE recommendation on introducing inference is a particularly hot topic. The notion is that students can get an intuitive grasp of what is going on with bootstrapping and randomisation tests far more easily than if you ask them to envisage a sampling distribution, arising from an infinite number of identical studies, drawing from a population, where the null hypothesis is true. This makes perfect sense to us teachers who have had years to think about it (and we are the survivors, not representative of the students.) When you pause to reflect that I have just described something that doesn’t exist, arising from a situation that can never happen, drawn from something you can never know, under circumstances that you know are not true, you see how this might not be the simplest mental somersault to ask of your students.

A common counter-argument is that simulation is an advanced topic. But this is an accident of history: non-parametrics, randomisation tests and bootstrapping were harder to do before computers, so we had to rely on relatively simple asymptotic formulas. That just isn’t true any more, and it hasn’t been since the advent of the personal computer, which brings home for me the extent of inertia in statistics teaching. Another argument is that the asymptotics are programmed in the software, so all students have to do is choose the right test and they get an answer. But you could also see this as a weakness; for many years statisticians have worried about software making things “too easy”, and this is exactly what that worry is about, that novices can get all manner of results out, pick an exciting p-value, write it up with some technical-sounding words and get it published. Simulation is a little like a QWERTY keyboard in that it slows you down just enough so you don’t jam the keys (younger readers may have to look this up). As for bootstrapping, most of us recall thinking it was too good to be true when we first heard about it, and we may fear the same reaction from our students, but that reaction is largely a result of being trained in getting confidence intervals the hard way, by second derivatives of the log-likelihood function. I’ve been telling them about bootstrapping (which is now super-easy in SPSS) since this academic year started, without so much as a flicker of surprise on their faces. A few days after ICOTS, I was having a cappuccino and a macaroon with Brad Efron in Palo Alto (my colleague Gill Mein says I am a terrible name-dropper but I’m just telling it like it is) and I asked him about this reaction. He said that when his 1979 paper came out, everybody said “it shouldn’t have been published because it’s obviously wrong” for a week and then “it shouldn’t have been published because it’s obvious” after that. I think that’s a pretty good sign of its acceptability. I just tell the students we’re doing the next best thing to re-running the experiment many times.

After the end of the main conference, I fought off a Conference Cold and went to a workshop on teaching inference by simulation down the road on the Northern Arizona University campus

Yes, that's a rack for students' skateboards. Cool, huh?

Yes, that’s a rack for students’ skateboards. Cool, huh?

This was split into two sessions, one with Beth Chance and Allan Rossman from Cal Poly (https://www.causeweb.org/ – which contains some information on CATALST too), another with some of the Locks (http://lock5stat.com/). Here a classroom full of stats lecturers worked through some of the exercises these simulation evangelists have tested and refined on their own students. One in particular I took away and have used several times since, with my own MRes students, other Kingston Uni Masters students, doctors in the UAE, clinical audit people in Leicester, and probably some others I have unfairly forgotten. It seems to work quite well, and its purpose is to introduce p-values and null hypothesis significance testing.

I take a bag of ‘pedagogical pennies’ and hand them out. Of course the students have their own coins but this makes it a little more memorable and discourages reserved people from sitting it out. A univariate one-group scenario is given to them that naturally has H0: pi=50%. You might say that ten of your patients have tried both ice packs and hot packs for their knee osteoarthritis, and 8 say they find the ice better. Could that be convincing enough for you to start recommending ice to everyone? (Or, depending on the audience, 8 out of 10 cats prefer Whiskas: https://youtu.be/jC1D_a1S2xs) I point out that the coin is a patient who has no preference (or a cat) and they all toss the coin 10 times. Then on the flipchart, I ask how many got no heads, 1 head, 2 heads… and draw a dot plot. We count how many got 0, 1, 2, 8, 9 or 10 and this proportion of the whole class is the approximate p-value. They also get to see a normal-ish sampling distribution emerge on the chart, and the students with weird results (I recently got my first 10/10; she thought it was a trick) can see that this is real life; when they get odd results in research, they just can’t see the other potential outcomes. Hopefully that shows them what the null hypothesis is, and the logic behind all p-values. It’s close enough to 0.05 to provoke some discussion.

The fact that simulation gives slightly different answers each time is also quite useful, because you can emphasise that p-values should be a continuous scale of evidence, not dichotomised, and that little tweaks to the analysis can tip over into significance a result that should really have been left well alone. (Of course, this is a different issue to sampling error, but as an aside it seems to work quite well.) One problem I haven’t worked a way around yet is that, at the end, I tell the students that of course they would really do this in the computer, which would allow them to run it 1000 times, not limited by their class size, and I fear that is a signal for them to forget everything that just happened. The best I can offer right now is to keep reminding them about the coin exercise, ten minutes later, half an hour later, at the end of the day and a week later if possible. I also worry that too much fun means the message is lost. Give them a snap question to tell you what a p-value is, with some multiple choices on a flipchart, offering common misunderstandings, and then go through each answer in turn to correct it. This is a tricky subject so it won’t be instant.

The Locks have a book out, the first to give a comprehensive course in stats with inference-by-simulation at its heart. It’s a great book and I recommend you check it out on their website. They also have some interactive analyses and graphics which allow the student to take one of their datasets (or enter their own!) and run permutation tests and bootstrap confidence intervals. It all runs in the browser so will work anywhere.

statkey

Natural frequencies

David Spiegelhalter spoke on the subject of natural frequencies with some passion. He has been involved in revising the content of the GCSE (16 year old) mathematics curriculum in the UK. Not every aspect in the final version was to his taste, but he made some inroads with this one, and was clearly delighted (http://understandinguncertainty.org/using-expected-frequencies-when-teaching-probability).

Some of the classic errors of probability can be addressed this way, without having to introduce algebraic notation. The important feature is that you are always dealing with a number of hypothetical people (or other units of analysis), with various things happening to some of them. It relates directly to a tree layout for probabilities, but without the annoying little fractions. I am also a fan of waffle plots for visualising proportions of a whole with a wide range of values (https://eagereyes.org/blog/2008/engaging-readers-with-square-pie-waffle-charts) and it would be nice to do something with these – perhaps bringing the interactive element in! One downside is that you often have to contrive the numbers to work out ‘nicely’ mathtex, which prevents you quickly responding to students’ “what if” questions.

Now for the couple of extras.

Statistical consulting is being used as a learning experience, in much the same way that you can get a cheap haircut from trainee hairdressers, at Pretoria (http://iase-web.org/icots/9/proceedings/pdfs/ICOTS9_C213_FLETCHER.pdf) & Truman State University (http://iase-web.org/icots/9/proceedings/pdfs/ICOTS9_C195_KIM.pdf). It sounds scary and hard work, but is a very innovative and bold idea, and we know that many students who are serious about using statistics in their career will have to do this to some extent, so why not give them some experience?

I was really impressed by Esther Isabelle Wilder of CUNY’s project NICHE (http://serc.carleton.edu/NICHE/index.html and http://iase-web.org/icots/9/proceedings/pdfs/ICOTS9_7D3_WILDER.pdf), which aims to boost statistical literacy in further and higher education, cutting across specialisms and silos in an institution. It acknowledges that many educators outside stats have to teach some, that they may be rusty and feel uncomfortable about it, and provides a safe environment for them to boost their stats skills and share good ideas. This is a very big and real problem and it would be great to see a UK version! Pre- and post-test among the faculty shows improvement in their comprehension, and they have to turn people away each summer because it has become so popular.

Finally, here’s a couple of exercises I liked the sound of:

Open three packets of M&Ms, and arrange them by colour. Get students to talk about what they can conclude about the contents of the next pack. (Fundamentalist frequentists might not like this.) This came from Markus Vogel & Andreas Eichler.

Ask students to design and conduct a study with an MP3 player, to try to determine whether the shuffling is random; this was devised by Andy Zieffler and reported by Katie Makar. We know that iPods in fact are not random, because customers initially complained that they were playing two or three songs from the same album together! I can’t vouch for other brands but Android’s built in player seems to do truly random things (as of v 4.3).

Leave a comment

Filed under learning

The rise of the imputers (maybe)

I thought this article in BMC Med Res Meth sounded interesting: “The rise of multiple imputation: a review of the reporting and implementation of the method in medical research“. I feel that we should know more about the adoption of analytical techniques, how they spread and are facilitated or blocked, but such reviews are almost unheard of. Unfortunately this one doesn’t answer a whole bunch of questions I have, largely because it just looks at two big-name medical journals in the years after software became widely available. I want to know what happens to get people using a new method (earlier in the history) and what gets printed all the way from top predators to bottom feeders.

rise-of-mi

Sadly, the crucial table 4, which lists the technical specifications of the methods used, is totally confusing. The half-page of miniscule footnotes (tl;dr) don’t help. All I can say is that not many of the papers used 100+ imputations, which is an interesting point because it seems recently (see a paper by Ian White which I can’t remember right now) that Don Rubin’s original experiments with a handful of imputed datasets might not be enough in some circumstances, and also with computer power on every desk (it’s not 1977 any more), there’s no reason not pump it up. Yet the original advice lives on.

It would be interesting to look at tutorial papers and reports and websites for various professions, as these are often badly written by someone who has A Little Knowledge, and see how these are referenced. I suspect a lot of people are going around referencing the original paper or the Rubin & Little book having never read either of them. Then someone should do some interviews of applied researchers who have published papers including MI. There you go, I’ve just given you a PhD topic. Off and do the thing.

Leave a comment

Filed under Bayesian, computing, healthcare

Bayesian stats in very few words

(but not short enough for a tweet)

I’ve long admired the capacity of Stata developers to encapsulate complex statistical methods in a few plain English words.

Now that their new release (version 14) includes some MCMC methods, they explain the world of Bayesian analysis in the leaflet thus:

Bayesian analysis is a statistical analysis that answers research questions about unknown parameters using probability statements. For example, what is the probability that the average male height is between 70 and 80 inches or that the average female height is between 60 and 70 inches?

That is very impressive. Yes, there’s other stuff you can say, but not without complicating it and discouraging the curious. This captures both one of the objectives and the fundamental technical difference, and one of the different ways in which results are interpreted.

I just had to blog that straight away, but there are two related things to look out for if you are crazy about Bayesy:

  1. As you might know, I’ve written a Stata-to-Stan interface. It is called StataStan and that has, to my delight, made me one of the Stan developers. It’s a bit like writing a little rhyme and suddenly being invited on tour with the Wu-Tang Clan. I will be posting a long explanation of it and some examples here very soon.
  2. Rasmus Bååth and I have a plan afoot to promote succinct introductions like this. When we have time (ha!) we will launch that online and hopefully get contributions from a few wise people (proverbial RZAs and GZAs but perhaps no ODBs)

mcmc-262x160

Leave a comment

Filed under Bayesian, learning

Complex interventions: MRC guidance on researching the real world

The MRC has had advice on evaluating “complex interventions” since 2000, updated 2008. By complex interventions, they mean things like encouraging children to walk to school, not complex in the sense of being made up of many parts, but complex in the sense that the way it happens and the effect it has is hard to predict because of non-linearities, interactions and feedback loops. Complexity is something I have been thinking and reading about a lot recently; it really is unavoidable in most of the work I do (I never do simple RCTs; I mean how boring is it if your life’s work is comparing drug X to placebo using a t-test?) and although it is supertrendy and a lot of nonsense is said about it, there is some wisdom out there too. However, I always found the 2000/8 guidance facile: engage stakeholders, close the loop, take forward best practice. You know you’re not in for a treat when you see a diagram like this:

bobbins-flowchart

 

Now, there is a new guidance document out that gets into the practical details and the philosophical underpinnings at the same time: wonderful! There’s a neat summary in the BMJ.

What I particularly like about this, and why it should be widely read, is that it urges all of us researchers to be explicit a priori about our beliefs and mental causal models. You can’t measure everything in a complex system, so you have to reduce it to the stuff you think matters, and you’d better be able to justify or at least be clear about that reduction. It acknowledges the role that context plays in affecting the results observed and also the inferences you choose to make. And it stresses that the only decent way of finding out what’s going on is to do both quantitative and qualitative data collection. That last part is interesting because it argues against the current fashion for gleeful retrospective analysis of big data. Without talking to people who were there, you know nothing.

My social worker colleague Rick Hood and I are putting together a paper on this subject of inference in complex systems. First I’ll be talking about it in Rome at IWcee (do come! Rome is lovely in May), picking up ideas from economists, and then we’ll write it up over the summer. I’ll keep you posted.

Leave a comment

Filed under research

A thinking exercise to teach about cryptic multiplicity

It’s Pi Day, and yesterday I saw a tweet from Mathematics Mastery, my sister-in-law’s brain child, which pointed out that the number zero does not occur in the first 31 digits of pi. I wondered “what’s the chances of that?” and then realised it was a fine example to get students of statistics to think through. Not because the probability is difficult to work out, but because the model and assumptions are not obvious. Pi is a transcendental number, meaning that it was discovered by Walt Whitman, or something like that. All the symbols 0-9 appear without any pattern, so the chance that a particular digit is a particular symbol is 0.1. The chance it’s not “0” is 0.9, and the 30 that follow are independent and identically distributed, so that comes to 3.8%  But you’d be just as surprised to find that “3” does not appear. Or “8”. There was nothing special a priori about “0”. Students will hopefully spot this if you have shown them real-life examples like “women wear red or pink shirts when they ovulate“. (Your alarm bells might start going here, detecting an approaching rant about the philosophy of inference, but relax. I’m giving you a day off.) So we crunch up some nasty probability theory (if you’ve taught them that sort of stuff) and get the chance of one or more symbols being completely absent at just over 38%. Then you can subtract some unpleasant multiple absences and get back to about 34%, or just simulate it!:

iter<-1000000
pb<-txtProgressBar(min=1,max=iter,style=3)
count<-matrix(NA,iter,10)
for (i in 1:iter) {
setTxtProgressBar(pb,i)
x<-floor(9.99999*runif(31))
for (j in 1:10)
count[i,j]<-sum(x==(j-1))
}
close(pb)
noneofone<-apply(count==0,1,sum)
table(noneofone)

But there’s another issue, and I hope that someone in a class would come up with it. Why 31? That’s just because the 32nd was the first “0”. So isn’t that also capitalising on chance? Yes, I think it is. It is an exploratory look-see analysis that suddenly turned into a hard-nosed let’s-prove-something analysis because we got excited about a pattern. What we really need to examine is the chance of coming up with a n-free run of length 31 or greater, where n is any of the ten symbols we call numbers. This is starting to sound more like a hypothesis test now, and you can get students to work with a negative binomial distribution to get it, but the important message is not how to do this particular example, or that coincidences, being ill-defined a priori, happen a lot (though that’s important too: “million-to one chances crop up nine times out of ten”, wrote Terry Pratchett), but rather that our belief about the data-generating process determines how we analyse, and it is vital to stop and think about where they came from and why we believe that particular mental/causal model before diving into the eureka stuff.

Leave a comment

Filed under learning

Seminar at Kingston University, Wednesday 18 March 2015, 2:15 pm

Come and hear me talk about emerging work with Bayesian latent variable models and SEMs. email Luluwah al-Fagih if you want to attend: L.Al-Fagih@kingston.ac.uk

Applying Bayesian latent variable models to imperfect healthcare data

Abstract: Analysis routinely collected or observational healthcare data is increasingly popular but troubled by poor data quality from a number of sources. Human error, coding habits, missing and coarse data all play a part in this. I will describe the application of Bayesian latent variable models to tackle issues like these in various forms to four projects: a pilot clinical trial in stroke rehabilitation, a meta-analysis including mean differences in depression scores alongside odds of reaching a threshold, an exploratory study of predictors of ocular tuberculosis, and an observational study of the timing of imaging in initial treatment of major trauma patients. The motivation for the Bayesian approach is the ease of flexible modelling, and I will explain the choices of software and algorithms currently available. Using latent variables allows us to draw inferences based on the unknown true values that are not available to us, while explicitly taking all sources of uncertainty into account.

Leave a comment

Filed under Uncategorized

Extract data from Dementia Care Mapping spreadsheets into Stata

I’ve recently been involved in three projects using the Dementia Care Mapping data collection tool. This is a neat way of getting really detailed longitudinal observations on the activity, mood, interactions and wellbeing of people with dementia in group settings. The makers of the DCM provide a spreadsheet which looks like this:

Copied from the Bradford DCM manual. These are not real data (I hope!)

Copied from the Bradford DCM manual. These are not real data (I hope!)

that is, there is a row of BCC behaviour codes and a row of ME quality of life codes for each person, and they are captured typically every five minutes. Date and time and other stuff that might be important are floating above the main body of the data. Subsequent worksheets provide descriptive tables and graphs, but we will ignore those as we want to extract data into Stata for more detailed analysis. (But let me, in passing, point out this work in Bradford, which is moving towards electronic collection.)

The good news is that Stata versions 12 and 13 have improved import commands, with specific support for Excel spreadsheets. You can specify that you want to take only particular cells, so that allows us to pull in the stuff at the top like data and time, and then go back and get the bulk of the data. Here’s my code:

// first, get date from cell, store in macro
import excel "`xlfile'", sheet("Raw data") ///
cellrange(A2:S2) clear
local visitdate=D[1] // date format
local visittime=I[1] // string
// now get DCM data
clear
import excel "`xlfile'", sheet("Raw data") ///
cellrange(A4:EP28) clear
rename A ID
rename B measure
foreach v of varlist C-EP {
capture confirm string variable `v'
if !_rc {
local temp=`v'[1]
local hr=substr("`temp'",1,2)
local mn=substr("`temp'",4,2)
rename `v' t`hr'`mn'
}
else {
drop `v'
}
}

In fact, I want to show you a bigger chunk of it, because in the most recent project, there were several care homes, and each was visited several times, and within each visit there was at least a before, during, and after spreadsheet. Thankfully, my colleague who did the collection had very consistently filed everything away so the file and folder structure was very consistent, and I could capitalise on that.

/* Data extraction from multiple Dementia Care Mapping

spreadsheets in multiple folders.
NB the shell command is for DOS. If using
another operating system, it needs to change. */

clear all
global loc "C:\Users\RLGrant\Desktop\Dementia Care Mapping"

global homes "A B C D E F"

// loop 1: over homes
local loopcount=1
foreach home of global homes {
local folder2="${loc}\DCM_`home'"
shell dir "`folder2'" > "`folder2'\folders.txt" /B

// loop 2: over visits
tempname fh2
local linenum2 = 0
file open `fh2' using "`folder2'\folders.txt", read
file read `fh2' nextfolder
while r(eof)==0 {
if "`nextfolder'"!="folders.txt" {
local visit = "`folder2'"+"\"+"`nextfolder'"
local ++linenum2
// save text list of files
shell dir "`visit'" > "`visit'\files.txt" /B

// loop 3: over files
tempname fh
local linenum = 0
file open `fh' using "`visit'\files.txt", read
file read `fh' nextfile
while r(eof)==0 {
if "`nextfile'"!="files.txt" {
local xlfile = "`visit'"+"\"+"`nextfile'"
dis as result "Now reading `xlfile'"
local ++linenum
// first, get date from cell, store in macro
import excel "`xlfile'", sheet("Raw data") ///
cellrange(A2:S2) clear
local visitdate=D[1] // date format
local visittime=I[1] // string
// now get DCM data
clear
import excel "`xlfile'", sheet("Raw data") ///
cellrange(A4:EP28) clear
rename A ID
rename B measure
foreach v of varlist C-EP {
capture confirm string variable `v'
if !_rc {
local temp=`v'[1]
local hr=substr("`temp'",1,2)
local mn=substr("`temp'",4,2)
rename `v' t`hr'`mn'
}
else {
drop `v'
}
}
qui drop in 1
qui count
local n=r(N)
forvalues i=2(2)`n' {
qui replace ID=ID[`i'-1] in `i'
}
qui drop if ID==""
qui reshape wide t*, i(ID) j(measure) string
gen home="`home'"
// add date and times
gen visitdate=`visitdate'
format visitdate %tdddMonCCYY
gen str24 visittime="`visittime'"
// add filename
gen str40 filename="`nextfile'"
order filename home visitdate visittime
if `loopcount'==1 {
save "${loc}\alldata.dta", replace
}
else {
append using "${loc}\alldata.dta"
save "${loc}\alldata.dta", replace
}
local ++loopcount
}
file read `fh' nextfile
}
file close `fh'
// shell del "`visit'\files.txt"
}
file read `fh2' nextfolder
}
file close `fh2'
// shell del "`folder2'\folders.txt"
}

The trick here is to get a text file that contains the contents of each folder, read that in and loop through it’s contents. So it doesn’t really matter what the files are called. In this case, the homes are named A-F, and at the end I do some date and time formatting. With different data, these bits may well have to be tweaked but I leave them here to inspire you if nothing else. Happy analysing!

1 Comment

Filed under Stata