I have been writing a book review of Efron & Hastie’s CASI for Significance magazine. Here’s a tangential half page I wrote but didn’t include.
Students of statistics or data science will typically encounter some discomfiting jumps in attitude as their course progresses. first, they may have a lot of probability theory and some likelihood-based inference for rather contrived problems, which will remind them of their maths classes at school. Ah, they think, I know how to do this. I learn the tricks to manipulate the symbols and get to the QED. Then, they find themselves suddenly in a course that provides tables of data and asks them to analyse and interpret. Suddenly it’s become a practical course that connects to the real world and leaves the maths behind for the most part. Now, there’s no QED given, and no tricks. The assessments suddenly are more like humanities subjects, there’s no right or wrong and it’s the coherence of their argument that matters. Now they have to remember which options to tick in their preferred stats software. They might think: why did we do the mathematical parts of this course at all if we’re not going to use them? Next, for some, come machine learning methods. Now, the inference and asymptotic assurances are not just hidden in the cogs of the computer but are actually absent. How do I know the random forest isn’t giving me the wrong answer? You don’t. It seems at first that when the problem gets really hard, like 21st-century-hard, land-a-job-at-Google-hard, we give up on stats as an interesting mental exercise from the 1930s in favour of “unreasonably effective” heuristics and greedy algorithms.
One really nice thing they do in CASI is to emphasise that all estimation, from standard deviations of samples to GAMs, are algorithms. The inference (I prefer to say “uncertainty”) for those algorithms follows later in the history of the subject. The 1930s methods had enough time to work out inference by now, but other methods are still developing their inferential procedures. This unifies things rather better, but most teaching has to catch up. One problem is that almost all the effort of reformers following George Cobb, Joan Garfield and others has been on the very early introduction to the subject. That’s probably the right place to fix first, but we need to broaden out and fix wider data science courses now.