Google Books isn’t just an e-book store. It’s a pile of data, waiting to be mined. And while the metadata on many of the books in Google’s database may not be in the best of shape, enough books have good metadata that they can be used for some fairly interesting projects.
Ars Technica has the story on one of these. A group of Harvard researchers created a tool that could be used to trace the usage of words or phrases in books over the last few centuries. And what’s more, Google has made the tool publicly available via a web interface.
You can go to the site, type in words or phrases (several at a time, if you like) and trace their popularity over time and in comparison to each other. It’s a fascinating way to spend an hour or so.
The tool isn’t perfect—for one thing, it’s case-sensitive, and there’s no way to combine queries: I can see all uses of “Urban Fantasy” or all uses of “urban fantasy” on the same chart, but I can’t see a combination of the both terms into a single line. And also, it seems incapable of differentiating between whole words and parts of words: when I query on “ebook” or “e-book” I get such a large number of results across the last couple of centuries that I suspect it’s also including uses of the word “notebook”.
And a search oddity that I get, in which a small number of uses are shown for 1900-1910 when I search on “cyberpunk” or “Geek Squad”, makes me wonder whether some of the metadata on their books is not as good as they think it is.
And following up a search on the “f word” brings to light an interesting shortcoming in Google’s optical character recognition. Investigating a peculiar set of peaks in its usage between about 1630 and 1810 brings search results that reveal Google has been translating the “long s” used in those days as a lower-case “f”, which leads to all sorts of amusing example sentences in the search results.
Still, it’s fairly interesting to look at the usage of words, including dirty ones, to see how often they have appeared in print over time. And further, it’s a great example of the kinds of uses that can come from having so much data together for the first time. With a little more refinement, this class of tool could be extremely valuable to scholarly research—as well as providing amusing ways for laypeople to pass the time.
TeleRead: News and views on e-books, libraries, publishing and related topics