Looking or testing for selection bias

When analysing any data collection process that was voluntary, it is good practice to look at the responders and anything you know about the non-responders, and check that they are not too different. If they are, it could be a sign of selection bias: the people who took part (or who were invited to take part) were different to those who did not.

Like the classic table 1 in a clinical trial paper, which compares the group that got drug A to the group that got drug B, you are hoping to see no differences between them as you look through the stats. And another similarity is that you sometimes see significance tests in there and you sometimes don’t. Now, it is well-known that it is inappropriate (or, to be more precise, it is uninformative or irrelevant) to test two randomly allocated groups and see if they are significantly different. The issue is that hypothesis tests set out to accept or reject the null hypothesis that any differences between the groups are purely random. But you know they are random, because that’s how you made them! So the issue is not whether they are random, but rather how big any chance differences are and whether they are so big as to worry you and require an adjusted analysis.

The corresponding argument when checking for bias is not so clear-cut. After all, you didn’t randomise people to not responding, they chose it for themselves. I think that it’s illogical to do hypothesis tests for bias, for the opposite reason to the clinical trial: because if bias exists then the process producing it is by definition not random, and so it is best just to describe any differences and say if they are big enough to worry you. Hypothesis testing would essentially seek to determine whether the bias arose by bad luck or because of some human factor, which is not really the issue here (though it may be of methodological / sociological interest). What matters is that there is some bias, not to speculate about where it came from.

However, I have to admit that there is one philosophical issue still bugging me here. You notice that I sneakily described “bad luck” as something that is “not random” above? Well, this is part of a much bigger issue: what is randomness? If you contend it is the same thing as ignorance (which could loosely be called a Bayesian approach), then people choosing mysteriously not to participate in your survey is in fact random, but then a question arises of whether you should construct a monolithic yes/no hypothesis test for something that is essentially unknowable. 

The practical advice is: describe your responders and non-responders, and think about what that tells you about these people. Think hard; statistics help but do not replace your brain. Then justify whether you think they might give you biased answers, and be prepared for someone to disagree with you, because this is a rather subjective question.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s