Graphs that find phrases that don’t fit

This is pretty innovative: breaking text (in this case, scripts of Mad Men or Downton Abbey) into phrases and looking for over-representation by comparing against a population standard (in this case, Google’s Ngram database). All blogged at which I found via

Ben Schmidt has used R to do the comparison and the graphs. With a huge number of phrases, it is hard to get any clarity where they all jumble together, but the ones that poke out above the cloud are suspects for over-use.

I wonder what my qualitatively-minded colleagues would make of this approach…

One quibble: see the diagonal line of words at top-left in the Pride And Prejudice example?

Well, that screams out to me, based on years of looking at graphs that went a bit wrong, that those phrases appeared once only, thus forming a boundary to the graph. And if you can get into the top-middle zone, indicating a really important over-represented phrase by being mentioned once only, then there is a lot of room for uncertainty arising from chance. A funnel plot (one of my favourites) would be useful here.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s