Sunday, January 9, 2011

Pinch me with one of your huge lobster claws, I want to know if I'm dreaming.

Pitfalls of Google Ngram Viewer for etymological research:

  • Optical character recognition problems increase going back in time, as widely noted.
  • The "smoothing" function, which is on by default, can make anomalous spikes look like trends.
  • Google Books contains a number of magazines, and these often bear the date of the magazine's first-ever issue.
  • There are dates that are erroneous in other ways - the periodical Microprocessors and Microsystems is dated 1906.  In another case I saw a date assigned that was actually a year mentioned in the title of the work, implying they're entered manually.
In short, the worst thing to use Ngram for (as currently implemented) is dates of first use.  The word "robot," as everyone knows, entered English in 1921 with the play R.U.R., but Ngram would have us believe it enjoyed a bit of use in the 1900s.  This is a combination of bad dates, mis-OCRing of "Robert," "robbery," etc. and the use of the word in other contexts, such as sociological discussions of the use of forced labor in Eastern Europe.

1 comments:

VICKYFF said...
This comment has been removed by a blog administrator.