This is pretty innovative: breaking text (in this case, scripts of Mad Men or Downton Abbey) into phrases and looking for over-representation by comparing against a population standard (in this case, Google’s Ngram database). All blogged at www.prochronism.com which I found via www.flowingdata.com
Ben Schmidt has used R to do the comparison and the graphs. With a huge number of phrases, it is hard to get any clarity where they all jumble together, but the ones that poke out above the cloud are suspects for over-use.
I wonder what my qualitatively-minded colleagues would make of this approach…
One quibble: see the diagonal line of words at top-left in the Pride And Prejudice example?
Well, that screams out to me, based on years of looking at graphs that went a bit wrong, that those phrases appeared once only, thus forming a boundary to the graph. And if you can get into the top-middle zone, indicating a really important over-represented phrase by being mentioned once only, then there is a lot of room for uncertainty arising from chance. A funnel plot (one of my favourites) would be useful here.