I have been reading a recently published paper by Fagerland and Hosmer on tests of the goodness of fit for proportional odds models. I work with ordinal outcomes quite often, and I use proportional odds models as a simple way of getting adjusted effects of some risk factor or treatment on that outcome, for example here to get differences in graduates’ confidence between ethnic groups, adjusted for other covariates. The model (a generalized linear model with cumulative logistic link function) provides the odds ratio for a one-step increase on the outcome scale. The assumption of proportionality makes it easy to fit this model in standard software and means the odds ratio is the same wherever you are on the scale. Unlike logistic regressions for a binary outcome, it is not well established how you assess the goodness of fit of the model. By goodness of fit (GOF), we mean how closely the model follows the data, which is an issue because you could be fooled into thinking everything is fine once you get significant p-values in your regression and can’t think of anything else to add to it.
Fagerland and Hosmer are big names in the GOF literature. For logistic regression, the Hosmer-Lemeshow test is a simple way of finding poor fit, although it can be argued that, because absence of evidence is not evidence of absence, a hypothesis test is logically the wrong way to go about assessing GOF. The essence of Hosmer-Lemeshow is that the model gives you a predicted probability of the event happening for every observation, and you split those into deciles. Then draw up a 10×2 table of predicted vs observed, and get a special chi-squared statistic for how the observed reality differs from the predictions of the model. Generally, you conclude there is a problem if p<0.05 but there is obviously scope for subtle problems to go undetected. This test is a rough rule of thumb, and it is easy to understand, which together account for its continued popularity. They have also recently published on a multinomial logistic GOF test in the Stata Journal.
The idea behind the proportional odds version for m categories in the ordinal outcome is that the ith observation is given a score:
and then these are used in the same way as the predicted probability in Hosmer-Lemeshow. This score comes from an earlier test by Lipsitz et al in 1996. The paper compares the new test with Lipsitz and another one by Pulkstenis and Robinson (2004). There doesn’t seem much between them and in terms of power in six different scenarios, it is either Pulkstenis-Robinson or Lipsitz that come out on top. However, they contend that this is because their test has a lower Type 1 error rate and somehow they pull off some sleight-of-hand in the discussion to make their test appear to be the best. This all sounds like nitpicking to me, as all these tests are heuristic and operating in close proximity to each other. In an old-fashioned style, the new paper refers to their own test as Cg, which is scientific code for “please call it Fagerland-Hosmer”. Anyway, it is very useful to have a review of the available tests and to see some figures on their performance under misspecification scenarios. I will use this in future, but not in isolation, always alongside some simple diagnostic graphs and cross-tabulations.