Thanks to R-bloggers, I discovered a thoughtful post by Joel Caldwell on a blog called Engaging Market Research. Not obviously my thang, but then they do a lot of cluster analysis, and every now and then somebody asks me to do that or show them how to do it.
Unlike regressions and other analyses that treat all your data as one lump, the clustering requires some careful thought. And careful thought is generally lacking in the world of data analysis. “Just show me which button to push”, I imagine students and clients thinking as they tune out my droning voice talking about assumptions and context.
Firstly, we get a false picture from the classic examples such as Old Faithful:
and Fisher’s irises:
Life isn’t like that. It’s really hard to pull data apart into clusters, and people out there who hire statisticians have been given a false idea that the computer will make sense of everything. The thing is, even though there may not be any evidence to favour, say, four clusters over five when you are looking at a long drawn-out blob, there is often a contextual reason, and that’s good enough! Statistics is supposed help you make decisions under uncertainty. As long as you don’t start to believe that you have discovered some immutable law of the universe, you’ll be fine.
Here I really like Joel’s explanation:
The shoe manufacturer can get along with three sizes of sandals but not three sizes of dress shoes. It is not the foot that is changing, but the demands of the customer. Thus, even if segments are no more than convenient fictions, they can be useful from the manager’s perspective. […] These segments are meaningful only within the context of the […] problem
Nor should we be wedded to the stability of our segment solution when those segments were created by dynamic forces that continue to act and alter its structure. Our models ought to be as diverse as the objects we are studying.
This guy has thought long and hard about what he is trying to do, which is the number one skill the data analyst needs.