Dan Cohen

Archive for the ‘Google’ Category

Digital Campus #24 - Running from the Law

Tuesday, April 8th, 2008

On the first podcast of our second year of the Digital Campus podcast, we discuss some of the legal constraints and threats that academic content providers and digital tool builders face—namely, an increasingly confusing and nightmarish patchwork of regulations from copyright to patents. We talk about the ways in which we have tried to pursue fair use and new technology without getting sued. In the news roundup we cover the launch of offline Google Docs and Internet safety classes for kids. [Subscribe to this podcast.]

Still Waiting for a Real Google Book Search API

Monday, March 31st, 2008

For years on this blog, at conferences, and even in direct conversations with Google employees I have been agitating for an API (application programming interface) for Google Book Search. (For a summary of my thoughts on the matter, see my imaginatively titled post, “Why Google Books Should Have an API.”) With the world’s largest collection of scanned books, I thought such an API would have major implications for doing research in the humanities. And I looked forward to building applications on top of the API, as I had done with my Syllabus Finder.

So why was I disappointed when Google finally released an API for their book scanning project a couple of weeks ago?

My suspicion began with the name of the API itself. Even though the URL for the API is http://code.google.com/apis/books/, suggesting that this is the long-awaited API for the kind of access to Google Books that I’ve been waiting for, the rather prosaic and awkward title of the API suggests otherwise: The Google Book Search Book Viewability API. From the API’s home page:

The Google Book Search Book Viewability API enables developers to:

  • Link to Books in Google Book Search using ISBNs, LCCNs, and OCLC numbers
  • Know whether Google Book Search has a specific title and what the viewability of that title is
  • Generate links to a thumbnail of the cover of a book
  • Generate links to an informational page about a book
  • Generate links to a preview of a book

These are remarkably modest goals. Certainly the API will be helpful for online library catalogs and other book services (such as LibraryThing) that wish to embed links to Google’s landing pages for books and (when copyright law allows) links to the full texts. The thumbnails of book covers will make OPACs look prettier.

But this API does nothing to advance the kind of digital scholarship I have advocated for in this space. To do that the API would have to provide direct access to the full OCRed text of the books, to provide the ability to mine these texts for patterns and to combine them with other digital tools and corpora. Undoubtedly copyright concerns are part of the story here, hobbling what Google can do. But why not give full access to pre-1923 books through the API?

I’m not hopeful that there are additional Google Book Search APIs coming. If that were the case the URL for the viewability API would be http://code.google.com/apis/books/viewability/. The result is that this API simply seems like a way to drive traffic to Google Books, rather than to help academia or to foster a external community of developers, as other Google APIs have done.

Google Book Search Begins Adding Quality Control Measures

Wednesday, February 20th, 2008

As predicted in this space six months ago, Google has added the ability for users to report missing or poorly scanned pages in their Book Search. (From my post “Google Books: Champagne or Sour Grapes?“: “Just as they have recently added commentary to Google News, they could have users flag problematic pages.”)

I’ll say it again: criticism of Google Book Search that focuses on quality chases a red herring—something that Google can easily fix. Let’s focus instead on more substantive issues, such as the fact that Google’s book archive is not truly open.

More Perspectives on Google Books

Tuesday, November 6th, 2007

An abundance of writing on Google Books this week. First, Paul Courant, the University Librarian and Dean of Libraries at the University of Michigan, has a new blog that begins with a candid assessment of what it’s like “being in bed with Google.” Google antagonist Siva Vaidhyanathan provides an immediate response and some good, as-yet-unanswered questions on his new Googlization of Everything blog. (Picky criticism to go along with the praise for Siva: if one of your main arguments is that Google is “flagrantly violating copyright,” it’s probably not a good idea to do the same thing on your blog by frequently reproducing copyrighted articles.)

Meanwhile, I think the best assessment of Google and Google Books comes this week from Danny Sullivan at Search Engine Land: “Google: As Open As It Wants To Be (i.e., When It’s Convenient).” Sullivan writes, “There’s probably no deeper example of Google being closed than when it comes to book search…if Google’s on an ‘open’ kick [with OpenSocial and the Open Handset Alliance], why not join the Open Content Alliance?” As I’ve noted in this space, openness is the preeminent question about Google Books, rather than questions of scan or search quality (which can be improved).

A reCAPTCHA Dilemma?

Monday, October 8th, 2007

Here’s a possible conundrum worthy of the New York Times’s ethicist, Randy Cohen (no relation to your’s truly). I have been a major proponent of reCAPTCHA, the red and yellow box at the bottom of my blog posts that uses words from books scanned by the Internet Archive/Open Content Alliance as a system to prevent comment spam. At the same time visitors decipher the words in that box to add a comment, they help to turn old texts into accurate, useful transcriptions. My glee about killing two birds with one stone has soured a bit after discovering something unsettling: I still get comment spam on my blog, and a lot of it–thousands and thousands of bogus comments.

My investigation of these comments–checking IP addresses, looking at patterns of posting and the links therein, and other discussions of how solid reCAPTCHA’s technology is (e.g., it doesn’t seem susceptible to a “relay attack,” where a puzzle is redirected by the spammer to a unsuspecting person logging onto another site)–leads me to the depressing conclusion that these comments are not done by bots or unwitting third parties. Rather, they are added by hand, one at a time, intentionally. Real human beings are figuring out the blurry words from those old books to insert vaguely plausible comments (”Nice post! Check out my site for more on the same topic.”).

I suppose it’s good news that the spammers are being used as human OCR. By my calculations they’ve decoded, word by word, about 50 pages of text on my blog alone. (Real commenters have transcribed about a half a page.) But I suspect–and would be happy to be proven wrong in real comments, below–that many of the actual people solving the reCAPTCHA are being paid pennies an hour by spam overlords to boost the Google rankings of their clients by adding keyword-rich linked comments to sites with high PageRank.

So in a sense, reCAPTCHA leads to a kind of indirect outsourcing similar to sending a book to be “rekeyed” by low-paid, third-world typists.

Google Books: Is It Good for History?

Wednesday, October 3rd, 2007

The September 2007 issue of the American Historical Association’s Perspectives is now available online, and it is worth reading Rob Townsend’s article “Google Books: Is It Good for History?” The article is an update of Rob’s much-debated post on the AHA blog in May, and I believe this revised version now reads as the best succinct critique of Google Books available (at least from the perspective of scholars). Rob finds fault with Google’s poor scans, frequently incorrect metadata, and too-narrow interpretation of the public domain.

Regular readers of this blog know of my aversion to jeremiads about Google, but Rob’s piece is well-reasoned and I agree with much of what he says.

Why Google Books Should Have an API

Tuesday, September 4th, 2007

No Way Out[This post is a version of a message I sent to the listserv for CenterNet, the consortium of digital humanities centers. Google has expressed interest in helping CenterNet by providing a (limited) corpus of full texts from their Google Books program, but I have been arguing for an API instead. My sense is that this idea has considerable support but that there are also some questions about the utility of an API, including from within Google.]

My argument for an API over an extracted corpus of books begins with a fairly simple observation: how are we to choose a particular dataset for Google to compile for us? I’m a scholar of the Victorian era, so a large corpus from the nineteenth century would be great, but how about those who study the Enlightenment? If we choose novels, what about those (like me) who focus on scientific literature? Moreover, many of us wish to do more expansive horizontal (across genres in a particular age) and vertical (within the same genre but through large spans of time) analyses. How do we accommodate the wishes of everyone who does computational research in the humanities?

Perhaps some of the misunderstanding here is about the kinds of research a humanities scholar might do as opposed to, say, the computational linguist, who might make use of a dataset or corpus (generally a broad and/or normalized one) to assess the nature of (a) language itself, examine frequencies and patterns of words, or address computer science problems such as document classification. Some of these corpora can provide a historian like me with insights as long as the time span involved is long enough and each document includes important metadata such as publication date (e.g., you can trace the rise and fall of certain historical themes using BYU’s Time Magazine corpus).

But there are many other analyses that humanities scholars could undertake with an API, especially one that allowed them to first search for books of possible interest and then to operate on the full texts of that ad hoc corpus. An example from my own research: in my last book I argued that mathematics was “secularized” in the nineteenth century, and part of my evidence was that mathematical treatises, which normally contained religious language in the early nineteenth century, lost such language by the end of the century. By necessity, researching in the pre-Google Books era, my textual evidence was limited–I could only read a certain number of treatises and chose to focus on the writing of high-profile mathematicians.

How would I go about supporting this thesis today using Google Books? I would of course love to have an exhaustive corpus of mathematical treatises. But in my book I also used published books of poems, sermons, and letters about math. In other words, it’s hard to know exactly what to assemble in advance–just treatises would leave out much of the story and evidence.

Ideally, I would like to use an API to find books that matched a complicated set of criteria (it would be even better if I could use regular expressions to find the many variants of religious language and also to find religious language relatively close to mentions of mathematics), and then use get_cache to acquire the full OCRed text of these matching books. From that ad hoc corpus I would want to do some further computational analyses on my own server, such as extracting references to touchstones for the divine vision of mathematics (e.g., Plato’s later works, geometry rather than number theory), and perhaps even do some aggregate analyses (from which works did British mathematicians most often acquire this religious philosophy of mathematics?). I would also want to examine these patterns over time to see if indeed the bond between religion and mathematics declined in the late Victorian era.

This is precisely the model I use for my Syllabus Finder. I first find possible syllabi using an algorithm-based set of searches of Google (via the unfortunately deprecated SOAP Search API) while also querying local Center for History and New Media databases for matches. Since I can then extract the full texts of matching web pages from Google (using the API’s cache function), I can do further operations, such as pulling book assignments out of the syllabi (using regular expressions).

It seems to me that a model is already in place at Google for such an API for Google Books: their special university researcher’s version of the Search API. That kind of restricted but powerful API program might be ideal because 1) I don’t think an API would be useful without the get_OCRed_text function, which (let’s face it) liberates information that is currently very hard to get even though Google has recently released a plain text view of (only some of) its books; and 2) many of us want to ping the Google Books API with more than the standard daily hit limit for Google APIs.

[Image credit: the best double-entendre cover I could find on Google Books: No Way Out by Beverly Hastings.]

Debating Paul Duguid’s Google Books Lament

Thursday, August 23rd, 2007

Over at the O’Reilly Radar, Peter Brantley reprints an interesting debate between Paul Duguid, author of the much-discussed recent article about the quality of Google Books, and Patrick Leary, author of “Googling the Victorians.” I’m sticking with my original negative opinion of the article, which Leary agrees completely with.

Google Books: Champagne or Sour Grapes?

Thursday, August 16th, 2007

Beyond Good and EvilIs it possible to have a balanced discussion of Google’s outrageously ambitious and undoubtedly flawed project to scan tens of millions of books in dozens of research libraries? I have noted in this space the advantages and disadvantages of Google Books—sometimes both at one time. Heck, the only time this blog has ever been seriously “dugg” is when I noted the appearance of fingers in some Google scans. Google Books is an easy target.

This week Paul Duguid has received a lot of positive press (e.g., Peter Brantley, if:book) for his dressing down of Google Books, “Inheritance and loss? A brief survey of Google Books.” It’s a very clever article, using poorly scanned Google copies of Lawrence Sterne’s absurdist and raunchy comedy Tristram Shandy to reveal the extent of Google’s folly and their “disrespect” for physical books.

I thought I would enjoy reading Duguid’s article, but I found myself oddly unenthusiastic by the end.

Of course Google has poor scans—as the saying goes, haste makes waste—but this is not a scientific survey of the percentage of pages that are unreadable or missing (surely less than 0.1% in my viewing of scores of Victorian books). Nor does the article note that Google might have possible remedies for some of these inadequacies. For example, they almost certainly have higher-resolution, higher-contrast scans that are different than the lo-res ones they display (a point made at the Million Books workshop; they use the originals for OCR), which they can revisit to produce better copies for the web. Just as they have recently added commentary to Google News, they could have users flag problematic pages. Truly bad books could be rescanned or replaced by other libraries’ versions.

Most egregiously, none of the commentaries I have seen on Duguid’s jeremiad have noted the telling coda to the article: “This paper is based on a talk given to the Society of Scholarly Publishers, San Francisco, 6 June 2007. I am grateful to the Society for the invitation.” The question of playing to the audience obviously arises.

Google Books will never be perfect, or even close. Duguid is right that it disrespects age-old, critical elements of books. (Although his point that Google disrespects metadata strangely fails to note that Google is one of the driving forces behind the Future of Bibliographic Control meetings, which are all about metadata.) Google Books is the outcome, like so many things at Google, of a mathematical challenge: How can you scan tens of millions of books in five years? It’s easy to say they should do a better job and get all the details right, but if you do the calculations of that assessment, you’ll probably see that the perfect library scanning project would take 50 years rather than 5. As in OCR, getting from 98% to 100% accuracy would probably take an order of magnitude longer and be an order of magnitude more expensive. That’s the trade-off they have decided to make, and as a company interested in search, where near-100% accuracy is unnecessary (I have seen OCR specialists estimate that even 90% accuracy is perfectly fine for search), it must have been an easy decision to make.

Complaining about the quality, thoroughness, and fidelity of Google’s (public) scans distracts us from the larger problem of Google Books. As I have argued repeatedly in this space, the real problem—especially for those in the digital humanities but also for many others—is that Google Books is not open. Recently they have added the ability to view some books in “plain text” (i.e., the OCRed text, but it’s hard to copy text from multiple pages at once), and even in some cases to download PDFs of public domain works. But those moves don’t go far enough for scholarly needs. We need what Cliff Lynch of CNI has called “computational access,” a higher level of access that is less about reading a page image on your computer than applying digital tools and analyses to many pages or books at one time to create new knowledge and understanding.

An API would be ideal for this purpose if Google doesn’t want to expose their entire collection. Google has APIs for most of their other projects—why not Google Books?

[Image courtesy of Ubisoft.]

Google Books: What’s Not to Like?

Wednesday, May 9th, 2007

The American Historical Association’s Rob Townsend takes some sharp jabs at Google’s ambitious library scanning project. Some of the comments are equally sharp.