Initial Thoughts on the Google Books Ngram Viewer and Datasets

First and foremost, you have to be the most jaded or cynical scholar not to be excited by the release of the Google Books Ngram Viewer and (perhaps even more exciting for the geeks among us) the associated datasets. In the same way that the main Google Books site has introduced many scholars to the potential of digital collections on the web, Google Ngrams will introduce many scholars to the possibilities of digital research. There are precious few easy-to-use tools that allow one to explore text-mining patterns and anomalies; perhaps only Wordle has the same dead-simple, addictive quality as Google Ngrams. Digital humanities needs gateway drugs. Kudos to the pushers on the Google Books team.

Second, on the concurrent launch of “Culturomics“: Naming new fields is always contentious, as is declaring precedence. Yes, it was slightly annoying to have the Harvard/MIT scholars behind this coinage and the article that launched it, Michel et al., stake out supposedly new ground without making sufficient reference to prior work and even (ahem) some vaguely familiar, if simpler, graphs and intellectual justifications. Yes, “Culturomics” sounds like an 80s new wave band. If we’re going to coin neologisms, let’s at least go with Sean Gillies’ satirical alternative: Freakumanities. No, there were no humanities scholars in sight in the Culturomics article. But I’m also sure that longtime “humanities computing” scholars consider advocates of “digital humanities” like me Johnnies-come-lately. Luckily, digital humanities is nice, and so let us all welcome Michel et al. to the fold, applaud their work, and do what we can to learn from their clever formulations. (But c’mon, Cantabs, at least return the favor by following some people on Twitter.)

Third, on the quality and utility of the data: To be sure, there are issues. Some big ones. Mark Davies makes some excellent points about why his Corpus of Historical American English (COHA) might be a better choice for researchers, including more nuanced search options and better variety and normalization of the data. Natalie Binder asks some tough questions about Google’s OCR. On Twitter many of us were finding serious problems with the long “s” before 1800 (Danny Sullivan got straight to the naughty point with his discourse on the history of the f-bomb). But the Freakumanities, er, Culturomics guys themselves talk about this problem in their caveats, as does Google.

Moreover, the data will improve. The Google n-grams are already over a year old, and the plan is to release new data as soon as it can be compiled. In addition, unlike text-mining tools like COHA, Google Ngrams is multilingual. For the first time, historians working on Chinese, French, German, and Spanish sources can do what many of us have been doing for some time. Professors love to look a gift horse in the mouth. But let’s also ride the horse and see where it takes us.

So where does it take us? My initial tests on the viewer and examination of the datasets—which, unlike the public site, allow you to count words not only by overall instances but, critically, by number of pages those instances appear on and number of works they appear in—hint at much work to be done:

1) The best possibilities for deeper humanities research are likely in the longer n-grams, not in the unigrams. While everyone obsesses about individuals words (guilty here too of unigramism) or about proper names (which are generally bigrams), more elaborate and interesting interpretations are likelier in the 4- and 5-grams since they begin to provide some context. For instance, if you want to look at the history of marriage, charting the word itself is far less interesting than seeing if it co-occurs with words like “loving” or “arranged.” (This is something we learned in working on our NEH-funded grant on text mining for historians.)

2) We should remember that some of the best uses of Google’s n-grams will come from using this data along with other data. My gripe with the “Culturomics” name was that it implied (from “genomics”) that some single massive dataset, like the human genome, will be the be-all and end-all for cultural research. But much of the best digital humanities work has come from mashing up data from different domains. Creative scholars will find ways to use the Google n-grams in concert with other datasets from cultural heritage collections.

3) Despite my occasional griping about the Culturomists, they did some rather clever things with statistics in the latter part of their article to tease out cultural trends. We historians and humanists should be looking carefully at the more complex formulations of Michel et al., when they move beyond linguistics and unigram patterns to investigate in shrewd ways topics like how fleeting fame is and whether the suppression of authors by totalitarian regimes works. Good stuff.

4) For me, the biggest problem with the viewer and the data is that you cannot seamlessly move from distant reading to close reading, from the bird’s eye view to the actual texts. Historical trends often need to be investigated in detail (another lesson from our NEH grant), and it’s not entirely clear if you move from Ngram Viewer to the main Google Books interface that you’ll get the book scans the data represents. That’s why I have my students use Mark Davies’ Time Magazine Corpus when we begin to study historical text mining—they can easily look at specific magazine articles when they need to.

How do you plan to use the Google Books Ngram Viewer and its associated data? I would love to hear your ideas for smart work in history and the humanities in the comments, and will update this post with my own further thoughts as they occur to me.

Comments

Moacir says:

(too long for a tweet)

Ngrams’s multilingualness is a bit suspect–my very first Russian search yielded texts in Serbian (ok) and Gaelic (?).

At the same time, though, Davies curates (smaller) Spanish and Portuguese corpora in addition to the COHA. Considering that, as far I can tell, one can’t *search* multilingually on ngrams, there’s no real difference between a drop down menu and clicking on a different part of corpus.byu.edu

So while you’re right to say that the data will improve (regarding my top quibble), saying that COHA is not multilingual but ngrams is seems like an unfair comparison.

mike o'malley says:

Have to admit I’m having a very hard time figuring out how to make it useful. I need to go right to the texts, and I need something more like proximity search. It’s somewhat useful to chart the frequency of the word “slave,” but I already knew it was used a lot and I’m not sure I gain much more by knowing it peaked in 1860. If you enter word pairs it actually gives you the wrong impression–the most interesting stuff happens when words are paired. But it gives you separate lines which only intersect in frequency.In that sense it does more harm than good–it reinforces some kind of odd idea that words bear no relation to other words. (While I don’t know for a fact that there a linguists who believe that, I’d risk a significant sum betting that there are)

I looks to me like a lot of time and effort spent to do something fairly useless. I’m having a very hard time seeing what I could do with it–maybe someone can show me

Martin Foys says:

Yes. When this gets linked up to topic modeling, things are going to get interesting.

[…] search which someone much cleverer than I came up with of “beft/best.” (Dan Cohen mentions this example with reference to Danny Sullivan’s post.) That one image confirms what we already know about […]

Shai Ophir says:

These are very good news that Google started to provide tools and datasets to researches. I am not sure about the value of this first delivery, but hopefully more to come – such as references between keywords and authors.
I am using Google Books for cross-reference analysis for a while. You can see an example in a short paper I published in The Information Society, Vol 26:2 (“A New Type of Historical Knowledge”). At least I didn’t create a new discipline name for this 🙂
Shai Ophir

[…] occurrence over time. In addition to the tool, Google’s also making the raw data available. Dan Cohen calls the viewer a “gateway drug to the digital humanities,” and I hope that gateway […]

Elena Razlogova says:

I’m skeptical. I tried “informer” and got a huge spike in 1820:

http://ngrams.googlelabs.com/graph?content=informer&year_start=1800&year_end=2000&corpus=0&smoothing=3

Is that because people suddenly decided to inform or because google happened to scan a ton of statutes for that particular year that mentioned informers? I think the latter:

http://www.google.com/search?q=%22informer%22&tbs=bks:1,cdr:1,cd_min:1819,cd_max:1823&lr=lang_en

So far it is more fun to find errors like these than do actual research.

Mike Gushard says:

I am an architectural historian and it would be really useful for me to isolate certain genres like builder’s guides. This would facilitate comparison of observed field data (like the instances of mansard roofs) to written material. There are plenty of research opportunities examining the gaps within between the two but as of yet it is quite difficult to quickly examine the corpus of building and home making guides. Even if there weren’t many gaps it’d be useful as a verification of decades of fieldwork.

One day. A boy can dream. In any event, the inclusion non-architecturally focused material give you an idea of how large certain ideas loomed in popular conversation, which is also useful.

Random note: I searched “Regan.” Predictably, lots of people were writing about him during his presidency, afterward not so much. Then in 2000 there is a Regan explosion.

I thought it was interesting. Myth making perhaps or maybe the dialogue around all presidents follows this model.

http://bit.ly/g4N8B1
http://bit.ly/ftyTLj

Mike Gushard says:

*Reagan not Regan of course.

This is what I get for trying to tap a reply out on my phone. Damn you auto-complete!

Mark N. says:

The fact that the raw data are available is a significant plus imo. I agree the COHA is quite excellent, and better in many ways, but as far as I can tell it’s a pure web-query interface: the underlying data are not available for download, at least not to the general public.

Dave says:

We created a Facebook page for people to share interesting ngrams. You can check it out here:

http://www.facebook.com/nteresting.ngrams

[…] questions méthodologiques et épistémologiques à l’article de Socioargu ainsi qu’à ceux de Dan Cohen [en], d’Olivier Ertzscheid, et à la discussion sur Language Log […]

[…] the Google Books Ngram Viewer is not just a toy, as Dan Cohen, historian and digital humanities guru, explains. It is kind of fun, though, and also offers a glimpse of the potential for exploration and research […]

[…] Initial Thoughts on the Google Books Ngram Viewer and Datasets, Dan Cohen shares his reflections on how Google’s Ngram Viewer might be useful to humanities […]

[…] fun to play with, but, as I’m sure all my hermeneutically suspicious readers know, there are plenty of objections to taking the findings seriously. The team of non-digital-humanist scientists behind […]

[…] On a related note, Google N-gram viewer lets you look at the frequency of short phrases within 4% of all books published, ever. The excellent paper gives examples of how to use this for cultural research. Dan Cohen has important criticisms. […]

[…] excitement and wariness over the prospects for a humanist mining of the corpus. (See, for example, Dan Cohen and Mike O’Malley for historians who are cautiously optimistic and crankily skeptical […]

[…] eksploracji danych, nawet jeśli jakość tego badania pozostawia wiele do życzenia. Dan Cohen przekonuje, że ten projekt może mieć duże znaczenie dla promocji idei badań humanistycznych […]

[…] (collection) of work (note that the viewer has some major issues, some of which Dan Cohen discusses in this blog post; read more on how it works here). Again, it’s a simple tool: you enter one or more words and […]

[…] One of McNeely’s final thoughts, that academics in the laboratories are beginning to confront “humanities scholars on their own turf,” (273) provides the best example of why humanists must assert their position in the laboratory. McNeely’s suggestion also appears particularly accurate in the light of the “discovery” of “Culturomics .” A good example of the university-industrial complex that dominates many research universities, Harvard and MIT researchers teamed up with the Google Books project to examine an unprecedented amount of written works throughout history. While a useful tool, the, as McNeely puts it, “hubris that comes from transgressing disciplinary boundaries” (273) is painfully clear. Leading humanists from any number of the universities in the Boston area could have been included, but were not. If there had been more humanists, perhaps the academics who “founded” “Culturomics ” may have realized this type of research had already been happening for years. […]

[…] be a lot of ambivalence about what data visualization can accomplish. For every response from the Digital Humanities crew on n-grams there is a management seminar on data visualization for “business […]

[…] second article by Dan Cohen, “Initial Thoughts on the GoogleBooks Ngram viewer and datasets” provides more current reflexions on where the potential of data mining currently stands. Google […]

[…] on Ngrams and surveying the Voyeur tools and pondering tag clouds, I think I can safely echo Mike O’Malley’s comments (comment #2) that finding the utility seems rather mystifying. Do I need to know the frequency of […]

[…] the readings on data mining for this week, I got a little sidetracked thinking about Professor Cohen’s analysis that “Digital Humanities needs gateway drugs. Kudos to the pushers on the Google books […]

[…] by Google. Although their assertion that they are breaking new ground is annoying (since it ignores previous work), their approach to things like the appearance and disappearance of fame is interesting and raises […]

[…] The Ngram Viewer uses the raw data (OCR’d text) from the Google Books project and lets the user search for the incidence of particular words in published books over the last couple of centuries. (Try it, it’s fun!) Compared to the controversy surrounding the Google Books project, the debates around the Ngram viewer have been small potatoes, but they sure make interesting reading for DHers. Dan Cohen wrote a great blog post about it. […]

[…] second article by Dan Cohen, “Initial Thoughts on the GoogleBooks Ngram viewer and datasets” provides more current reflexions on where the potential of data mining currently stands. Google […]

[…] Oct 2006). “Applying Quantitative Analysis to Classic Lit,” Wired, Dec. 2009; Cohen, Google Books, Ngrams and Culturomics; Rob Nelson, Mining the Dispatch. This entry was posted in Announcements, dh2012 by jmcclurken. […]

[…] first time I myself read up on topic modeling was after seeing it referenced in the comments to Dan Cohen's first post about Google Ngrams.) Bookworm is obviously similar to Ngrams: it's designed to keep the Ngrams strategy of […]

[…] more on the advantages and limitations of Google nGrams, please see here and […]

[…] Overall Google Ngram Viewer has a lot to offer historians. It allows them to see patterns or trends in data over a longer period than would be possible if they were researching through traditional methods. It stores a vast amount of data in a small space which can be accessed immediately. Finally, it offers historians a simple and manageable tool in the emerging and sometimes complicated discipline of Digital History, as Dan Cohen discusses here. […]

[…] The site has Application Programming Interface allowing the user to manipulate the data by entering predictive words and to export information for their own use/research. Under advanced usage  there is a guide on how to perform a more in depth search using features like part-of-speech tags and wild card, inflection and case insensitive searches. There are ngram compositions which are operators that can be used to combine ngrams for use in topic modelling and better comparisons and analysis leading to more interesting interpretations. […]

[…] brand-new Google Books Ngram Viewer as a ‘gateway drug’ into the digital humanities (http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/). I’ve been playing around with it recently and I’m […]

[…] méthodologiques et épistémologiques à l’article de socioargu ainsi qu’à ceux de Dan Cohen, d’Olivier Ertzscheid, et à la discussion sur Language […]

S.C. Healy says:

As an MA Student in DH in Ireland, I can see the petty scenario of why anyone would bother to build a tool, as far as cultoromics is concerned, I think it was way too presumptuous to label something that only included a fraction of the published world available to the world who had publishing…. I love Ngram Viewer, but I would not support it as a Culturomic evaluation. It certainly provides an indicator, but the corpus is a sampling at the very least. On the fact that someone built a tool to help with this – I would have to say this is pretty brilliant on the corpus they have in front of them…

[…] brand-new Google Books Ngram Viewer as a ‘gateway drug’ into the digital humanities (http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/). I’ve been playing around with it recently and I’m […]

[…] Cohen, “Initial Thoughts on the Google Books Ngram Viewer and Datasets,” DanCohen.org, 19 September […]

Leave a Reply