Initial Thoughts on the Google Books Ngram Viewer and Datasets

First and foremost, you have to be the most jaded or cynical scholar not to be excited by the release of the Google Books Ngram Viewer and (perhaps even more exciting for the geeks among us) the associated datasets. In the same way that the main Google Books site has introduced many scholars to the potential of digital collections on the web, Google Ngrams will introduce many scholars to the possibilities of digital research. There are precious few easy-to-use tools that allow one to explore text-mining patterns and anomalies; perhaps only Wordle has the same dead-simple, addictive quality as Google Ngrams. Digital humanities needs gateway drugs. Kudos to the pushers on the Google Books team.

Second, on the concurrent launch of “Culturomics“: Naming new fields is always contentious, as is declaring precedence. Yes, it was slightly annoying to have the Harvard/MIT scholars behind this coinage and the article that launched it, Michel et al., stake out supposedly new ground without making sufficient reference to prior work and even (ahem) some vaguely familiar, if simpler, graphs and intellectual justifications. Yes, “Culturomics” sounds like an 80s new wave band. If we’re going to coin neologisms, let’s at least go with Sean Gillies’ satirical alternative: Freakumanities. No, there were no humanities scholars in sight in the Culturomics article. But I’m also sure that longtime “humanities computing” scholars consider advocates of “digital humanities” like me Johnnies-come-lately. Luckily, digital humanities is nice, and so let us all welcome Michel et al. to the fold, applaud their work, and do what we can to learn from their clever formulations. (But c’mon, Cantabs, at least return the favor by following some people on Twitter.)

Third, on the quality and utility of the data: To be sure, there are issues. Some big ones. Mark Davies makes some excellent points about why his Corpus of Historical American English (COHA) might be a better choice for researchers, including more nuanced search options and better variety and normalization of the data. Natalie Binder asks some tough questions about Google’s OCR. On Twitter many of us were finding serious problems with the long “s” before 1800 (Danny Sullivan got straight to the naughty point with his discourse on the history of the f-bomb). But the Freakumanities, er, Culturomics guys themselves talk about this problem in their caveats, as does Google.

Moreover, the data will improve. The Google n-grams are already over a year old, and the plan is to release new data as soon as it can be compiled. In addition, unlike text-mining tools like COHA, Google Ngrams is multilingual. For the first time, historians working on Chinese, French, German, and Spanish sources can do what many of us have been doing for some time. Professors love to look a gift horse in the mouth. But let’s also ride the horse and see where it takes us.

So where does it take us? My initial tests on the viewer and examination of the datasets—which, unlike the public site, allow you to count words not only by overall instances but, critically, by number of pages those instances appear on and number of works they appear in—hint at much work to be done:

1) The best possibilities for deeper humanities research are likely in the longer n-grams, not in the unigrams. While everyone obsesses about individuals words (guilty here too of unigramism) or about proper names (which are generally bigrams), more elaborate and interesting interpretations are likelier in the 4- and 5-grams since they begin to provide some context. For instance, if you want to look at the history of marriage, charting the word itself is far less interesting than seeing if it co-occurs with words like “loving” or “arranged.” (This is something we learned in working on our NEH-funded grant on text mining for historians.)

2) We should remember that some of the best uses of Google’s n-grams will come from using this data along with other data. My gripe with the “Culturomics” name was that it implied (from “genomics”) that some single massive dataset, like the human genome, will be the be-all and end-all for cultural research. But much of the best digital humanities work has come from mashing up data from different domains. Creative scholars will find ways to use the Google n-grams in concert with other datasets from cultural heritage collections.

3) Despite my occasional griping about the Culturomists, they did some rather clever things with statistics in the latter part of their article to tease out cultural trends. We historians and humanists should be looking carefully at the more complex formulations of Michel et al., when they move beyond linguistics and unigram patterns to investigate in shrewd ways topics like how fleeting fame is and whether the suppression of authors by totalitarian regimes works. Good stuff.

4) For me, the biggest problem with the viewer and the data is that you cannot seamlessly move from distant reading to close reading, from the bird’s eye view to the actual texts. Historical trends often need to be investigated in detail (another lesson from our NEH grant), and it’s not entirely clear if you move from Ngram Viewer to the main Google Books interface that you’ll get the book scans the data represents. That’s why I have my students use Mark Davies’ Time Magazine Corpus when we begin to study historical text mining—they can easily look at specific magazine articles when they need to.

How do you plan to use the Google Books Ngram Viewer and its associated data? I would love to hear your ideas for smart work in history and the humanities in the comments, and will update this post with my own further thoughts as they occur to me.

36 thoughts on “Initial Thoughts on the Google Books Ngram Viewer and Datasets

  1. Moacir

    (too long for a tweet)

    Ngrams’s multilingualness is a bit suspect–my very first Russian search yielded texts in Serbian (ok) and Gaelic (?).

    At the same time, though, Davies curates (smaller) Spanish and Portuguese corpora in addition to the COHA. Considering that, as far I can tell, one can’t *search* multilingually on ngrams, there’s no real difference between a drop down menu and clicking on a different part of corpus.byu.edu

    So while you’re right to say that the data will improve (regarding my top quibble), saying that COHA is not multilingual but ngrams is seems like an unfair comparison.

  2. mike o'malley

    Have to admit I’m having a very hard time figuring out how to make it useful. I need to go right to the texts, and I need something more like proximity search. It’s somewhat useful to chart the frequency of the word “slave,” but I already knew it was used a lot and I’m not sure I gain much more by knowing it peaked in 1860. If you enter word pairs it actually gives you the wrong impression–the most interesting stuff happens when words are paired. But it gives you separate lines which only intersect in frequency.In that sense it does more harm than good–it reinforces some kind of odd idea that words bear no relation to other words. (While I don’t know for a fact that there a linguists who believe that, I’d risk a significant sum betting that there are)

    I looks to me like a lot of time and effort spent to do something fairly useless. I’m having a very hard time seeing what I could do with it–maybe someone can show me

  3. Pingback: chris forster · Google nGrams: Quick Response to Mike O’Malley

  4. Shai Ophir

    These are very good news that Google started to provide tools and datasets to researches. I am not sure about the value of this first delivery, but hopefully more to come – such as references between keywords and authors.
    I am using Google Books for cross-reference analysis for a while. You can see an example in a short paper I published in The Information Society, Vol 26:2 (“A New Type of Historical Knowledge”). At least I didn’t create a new discipline name for this :-)
    Shai Ophir

  5. Pingback: My 11 Favorite Things from the Interwebz This Year | an/archivista

  6. Elena Razlogova

    I’m skeptical. I tried “informer” and got a huge spike in 1820:

    http://ngrams.googlelabs.com/graph?content=informer&year_start=1800&year_end=2000&corpus=0&smoothing=3

    Is that because people suddenly decided to inform or because google happened to scan a ton of statutes for that particular year that mentioned informers? I think the latter:

    http://www.google.com/search?q=%22informer%22&tbs=bks:1,cdr:1,cd_min:1819,cd_max:1823&lr=lang_en

    So far it is more fun to find errors like these than do actual research.

  7. Mike Gushard

    I am an architectural historian and it would be really useful for me to isolate certain genres like builder’s guides. This would facilitate comparison of observed field data (like the instances of mansard roofs) to written material. There are plenty of research opportunities examining the gaps within between the two but as of yet it is quite difficult to quickly examine the corpus of building and home making guides. Even if there weren’t many gaps it’d be useful as a verification of decades of fieldwork.

    One day. A boy can dream. In any event, the inclusion non-architecturally focused material give you an idea of how large certain ideas loomed in popular conversation, which is also useful.

    Random note: I searched “Regan.” Predictably, lots of people were writing about him during his presidency, afterward not so much. Then in 2000 there is a Regan explosion.

    I thought it was interesting. Myth making perhaps or maybe the dialogue around all presidents follows this model.

    http://bit.ly/g4N8B1
    http://bit.ly/ftyTLj

  8. Mark N.

    The fact that the raw data are available is a significant plus imo. I agree the COHA is quite excellent, and better in many ways, but as far as I can tell it’s a pure web-query interface: the underlying data are not available for download, at least not to the general public.

  9. Pingback: L’interprétation des graphiques produits par Ngram Viewer » Article » OWNI, Digital Journalism

  10. Pingback: New toy from Google Labs « MLibrary Chatter: The Shhh! Stops Here

  11. Pingback: Inaugural edition of the Digital Humanities Blog Carnival | nicomachus.net

  12. Pingback: N-Grams « the long nineteenth century

  13. Pingback: Jonathan Stray » A computational journalism reading list

  14. Pingback: jumping on the ngram bandwagon at RMCLAS | parezco y digo

  15. Pingback: Kulturomia i Google Ngram Viewer - historiaimedia.org

  16. Pingback: Week One Lecture, Part Two of Two: Close Readings, Distant Readings, and Everything In-Between « Literature in a Wired World

  17. Pingback: Brian Sarnacki | <!-- History Grad Student -->

  18. Pingback: Data visualization; or the outside corner as fool’s errand | Dylan Mulvin

  19. Pingback: My Sandbox · Digging into Data

  20. Pingback: Data Mining: Instead of Finding the Needle in the Haystack, Realizing that the Hay may be More Interesting! « The Journey to Enlightenment: Making the Leap to the Digital Age

  21. Pingback: Gateway Drugs, Statistical Analysis, and Text Mining | sackerman51

  22. Pingback: Information and Data in the Digital Age | History in the Digital

  23. Pingback: Google and Digital Humanities | THATCamp Ohio State University

  24. Pingback: Critique of Google’s Ngram Viewer: « jhwhistory

  25. Pingback: Hist 696: Digging into Data « Wandering but not Lost

  26. Pingback: Reminder | Adventures in Digital History 3.0

  27. Pingback: shinenkan

  28. Pingback: Sapping Attention: Keeping the words in Topic Models

  29. Pingback: Out of Vogue, Out of Mind? – Old Colonies in the New Empire | Christopher M. Church

  30. Pingback: What is datamining, and does it encourage the creation of a specific kind of history? | chicirafoster

  31. Pingback: Participations » Prendre les procédures au sérieux

  32. Pingback: Google Ngrams Viewer: How good is it really? | UH Digital History Blog

  33. Pingback: How wary do historians have to be when using Ngram Viewer? | sarahburginhistory

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>