On this week’s podcast, we were lucky to have a live link to Liam Wyatt in Alexandria, Egypt. Liam is a co-host of Wikipedia Weekly and was attending Wikimania 2008. Tom, Mills, and I covered Wikipedia in the very first episode of Digital Campus, and if anything it has become an even hotter topic on campuses since then. Liam gives us a valuable insider’s view of some of the issues Wikipedia and its community are facing, including questions over authority and internationalization. [Subscribe to this podcast.]
This leaves Microsoft’s partner (and our partner on the Zotero project), the Internet Archive, somewhat in the lurch, although Microsoft has done the right thing and removed the contractual restrictions on the books they digitized so they may become part of IA’s fully open collection (as part of the broader Open Content Alliance), which now has about 400,000 volumes. Also still on the playing field is the Universal Digital Library (a/k/a the Million Books Project), which has 1.5 million volumes.
And then there’s Google and its Book Search program. For those keeping score at home, my sources tell me that Google, which coyly likes to say it has digitized “over a million books” so far, has actually finished scanning five million. It will be hard for non-profits like IA to catch up with Google without some game-changing funding or major new partnerships.
Foundations like the Alfred P. Sloan Foundation have generously made substantial (million-dollar) grants to add to the digital public domain. But with the cost of digitizing 10 million pre-1923 books at around $300 million, where might this scale of funds and new partners come from? To whom can the Open Content Alliance turn to replace Microsoft?
Frankly, I’ve never understood why institutions such as Harvard, Yale, and Princeton haven’t made a substantial commitment to a project like OCA. Each of these universities has seen its endowment grow into the tens of billions in the last decade, and each has the means and (upon reflection) the motive to do a mass book digitization project of Google’s scale. $300 million sounds like a lot, but it’s less than 1% of Harvard’s endowment and my guess is that the amount is considerably less than all three universities are spending to build and fund laboratories for cutting-edge sciences like genomics. And a 10 million public-domain book digitization project is just the kind of outrageously grand project HYP should be doing, especially if they value the humanities as much as the sciences.
Moreover, Harvard, Yale, and Princeton find themselves under enormous pressure to spend more of their endowment for a variety of purposes, including tuition remission and the public good. (Full and rather vain disclosure: I have some relationship to all three institutions; I complain because I love.) Congress might even get into the act, mandating that universities like HYP spend a more generous minimum percentage of their endowment every year, just like private foundations who benefit (as does HYP, though in an indirect way) from the federal tax code.
In one stroke HYP could create enormous good will with a moon-shot program to rival Google’s: free books for the world. (HYP: note the generous reaction to, and the great press for, MIT’s OpenCourseWare program.) And beyond access, the project could enable new forms of scholarship through computational access to a massive corpora of full texts.
Alas, Harvard and Princeton partnered with Google long ago. Princeton has committed to digitizing about one million volumes with Google; Harvard’s number is unclear, but probably smaller. The terms of the agreement with Google are non-exclusive; Harvard and Princeton could initiate their own digitization projects or form other partnerships. But I suspect that would be politically difficult since the two universities are getting free digitization services from Google and would have to explain to their overseers why they want to replace free with very expensive. (The answer sounds like Abbott and Costello: the free program produces something that’s not free, while the expensive one is free.)
If Google didn’t exist, Harvard would probably be the most obvious candidate to pull off the Great Digitization of Widener. Not only does it have the largest endowment; historian Robert Darnton, a leader in thinking about the future (and the past) of the book, is now the director of the Harvard library system. Harvard also recently passed an open access mandate for the publications of its faculty.
Princeton has the highest per-student endowment of any university, and could easily undertake a mass digitization project of this scale. Perhaps some of the many Princeton alumni who went on to vast riches on the Web, such as EBay‘s Meg Whitman (who has already given $100 million to Princeton) or Amazon‘s Jeff Bezos, could pitch in.
But Harvard’s and Princeton’s Google “non-exclusive” partnership makes these outcomes unlikely, as does the general resistance in these universities to spending science-scale funds outside of the sciences (unless it’s for a building).
That leaves Yale. Yale chose Microsoft last year to do its digitization, and has now been abandoned right in the middle of its project. Since Microsoft is apparently leaving its equipment and workflow in place at partner institutions, Yale could probably pick up the pieces with an injection of funding from its endowment or from targeted alumni gifts. Yale just spent an enormous amount of money on a new campus for the sciences, and this project could be seen as a counterbalance for the humanities.
Or, HYP could band together and put in a mere $100 million each to get the job done.
Is this likely to happen? Of course not. HYP and other wealthy institutions are being asked to spend their prodigious endowments on many other things, and are reluctant to up their spending rate at all. But I believe a HYP or HYP-like solution is much more likely than public funding for this kind of project, as the Human Genome Project received.
On the first podcast of our second year of the Digital Campus podcast, we discuss some of the legal constraints and threats that academic content providers and digital tool builders face—namely, an increasingly confusing and nightmarish patchwork of regulations from copyright to patents. We talk about the ways in which we have tried to pursue fair use and new technology without getting sued. In the news roundup we cover the launch of offline Google Docs and Internet safety classes for kids. [Subscribe to this podcast.]
For years on this blog, at conferences, and even in direct conversations with Google employees I have been agitating for an API (application programming interface) for Google Book Search. (For a summary of my thoughts on the matter, see my imaginatively titled post, “Why Google Books Should Have an API.”) With the world’s largest collection of scanned books, I thought such an API would have major implications for doing research in the humanities. And I looked forward to building applications on top of the API, as I had done with my Syllabus Finder.
My suspicion began with the name of the API itself. Even though the URL for the API is http://code.google.com/apis/books/, suggesting that this is the long-awaited API for the kind of access to Google Books that I’ve been waiting for, the rather prosaic and awkward title of the API suggests otherwise: The Google Book Search Book Viewability API. From the API’s home page:
The Google Book Search Book Viewability API enables developers to:
- Link to Books in Google Book Search using ISBNs, LCCNs, and OCLC numbers
- Know whether Google Book Search has a specific title and what the viewability of that title is
- Generate links to a thumbnail of the cover of a book
- Generate links to an informational page about a book
- Generate links to a preview of a book
These are remarkably modest goals. Certainly the API will be helpful for online library catalogs and other book services (such as LibraryThing) that wish to embed links to Google’s landing pages for books and (when copyright law allows) links to the full texts. The thumbnails of book covers will make OPACs look prettier.
But this API does nothing to advance the kind of digital scholarship I have advocated for in this space. To do that the API would have to provide direct access to the full OCRed text of the books, to provide the ability to mine these texts for patterns and to combine them with other digital tools and corpora. Undoubtedly copyright concerns are part of the story here, hobbling what Google can do. But why not give full access to pre-1923 books through the API?
I’m not hopeful that there are additional Google Book Search APIs coming. If that were the case the URL for the viewability API would be http://code.google.com/apis/books/viewability/. The result is that this API simply seems like a way to drive traffic to Google Books, rather than to help academia or to foster a external community of developers, as other Google APIs have done.
As predicted in this space six months ago, Google has added the ability for users to report missing or poorly scanned pages in their Book Search. (From my post “Google Books: Champagne or Sour Grapes?“: “Just as they have recently added commentary to Google News, they could have users flag problematic pages.”)
I’ll say it again: criticism of Google Book Search that focuses on quality chases a red herring—something that Google can easily fix. Let’s focus instead on more substantive issues, such as the fact that Google’s book archive is not truly open.
An abundance of writing on Google Books this week. First, Paul Courant, the University Librarian and Dean of Libraries at the University of Michigan, has a new blog that begins with a candid assessment of what it’s like “being in bed with Google.” Google antagonist Siva Vaidhyanathan provides an immediate response and some good, as-yet-unanswered questions on his new Googlization of Everything blog. (Picky criticism to go along with the praise for Siva: if one of your main arguments is that Google is “flagrantly violating copyright,” it’s probably not a good idea to do the same thing on your blog by frequently reproducing copyrighted articles.)
Meanwhile, I think the best assessment of Google and Google Books comes this week from Danny Sullivan at Search Engine Land: “Google: As Open As It Wants To Be (i.e., When It’s Convenient).” Sullivan writes, “There’s probably no deeper example of Google being closed than when it comes to book search…if Google’s on an ‘open’ kick [with OpenSocial and the Open Handset Alliance], why not join the Open Content Alliance?” As I’ve noted in this space, openness is the preeminent question about Google Books, rather than questions of scan or search quality (which can be improved).
Here’s a possible conundrum worthy of the New York Times’s ethicist, Randy Cohen (no relation to your’s truly). I have been a major proponent of reCAPTCHA, the red and yellow box at the bottom of my blog posts that uses words from books scanned by the Internet Archive/Open Content Alliance as a system to prevent comment spam. At the same time visitors decipher the words in that box to add a comment, they help to turn old texts into accurate, useful transcriptions. My glee about killing two birds with one stone has soured a bit after discovering something unsettling: I still get comment spam on my blog, and a lot of it–thousands and thousands of bogus comments.
My investigation of these comments–checking IP addresses, looking at patterns of posting and the links therein, and other discussions of how solid reCAPTCHA’s technology is (e.g., it doesn’t seem susceptible to a “relay attack,” where a puzzle is redirected by the spammer to a unsuspecting person logging onto another site)–leads me to the depressing conclusion that these comments are not done by bots or unwitting third parties. Rather, they are added by hand, one at a time, intentionally. Real human beings are figuring out the blurry words from those old books to insert vaguely plausible comments (“Nice post! Check out my site for more on the same topic.”).
I suppose it’s good news that the spammers are being used as human OCR. By my calculations they’ve decoded, word by word, about 50 pages of text on my blog alone. (Real commenters have transcribed about a half a page.) But I suspect–and would be happy to be proven wrong in real comments, below–that many of the actual people solving the reCAPTCHA are being paid pennies an hour by spam overlords to boost the Google rankings of their clients by adding keyword-rich linked comments to sites with high PageRank.
So in a sense, reCAPTCHA leads to a kind of indirect outsourcing similar to sending a book to be “rekeyed” by low-paid, third-world typists.
The September 2007 issue of the American Historical Association’s Perspectives is now available online, and it is worth reading Rob Townsend’s article “Google Books: Is It Good for History?” The article is an update of Rob’s much-debated post on the AHA blog in May, and I believe this revised version now reads as the best succinct critique of Google Books available (at least from the perspective of scholars). Rob finds fault with Google’s poor scans, frequently incorrect metadata, and too-narrow interpretation of the public domain.
Regular readers of this blog know of my aversion to jeremiads about Google, but Rob’s piece is well-reasoned and I agree with much of what he says.
[This post is a version of a message I sent to the listserv for CenterNet, the consortium of digital humanities centers. Google has expressed interest in helping CenterNet by providing a (limited) corpus of full texts from their Google Books program, but I have been arguing for an API instead. My sense is that this idea has considerable support but that there are also some questions about the utility of an API, including from within Google.]
My argument for an API over an extracted corpus of books begins with a fairly simple observation: how are we to choose a particular dataset for Google to compile for us? I’m a scholar of the Victorian era, so a large corpus from the nineteenth century would be great, but how about those who study the Enlightenment? If we choose novels, what about those (like me) who focus on scientific literature? Moreover, many of us wish to do more expansive horizontal (across genres in a particular age) and vertical (within the same genre but through large spans of time) analyses. How do we accommodate the wishes of everyone who does computational research in the humanities?
Perhaps some of the misunderstanding here is about the kinds of research a humanities scholar might do as opposed to, say, the computational linguist, who might make use of a dataset or corpus (generally a broad and/or normalized one) to assess the nature of (a) language itself, examine frequencies and patterns of words, or address computer science problems such as document classification. Some of these corpora can provide a historian like me with insights as long as the time span involved is long enough and each document includes important metadata such as publication date (e.g., you can trace the rise and fall of certain historical themes using BYU’s Time Magazine corpus).
But there are many other analyses that humanities scholars could undertake with an API, especially one that allowed them to first search for books of possible interest and then to operate on the full texts of that ad hoc corpus. An example from my own research: in my last book I argued that mathematics was “secularized” in the nineteenth century, and part of my evidence was that mathematical treatises, which normally contained religious language in the early nineteenth century, lost such language by the end of the century. By necessity, researching in the pre-Google Books era, my textual evidence was limited–I could only read a certain number of treatises and chose to focus on the writing of high-profile mathematicians.
How would I go about supporting this thesis today using Google Books? I would of course love to have an exhaustive corpus of mathematical treatises. But in my book I also used published books of poems, sermons, and letters about math. In other words, it’s hard to know exactly what to assemble in advance–just treatises would leave out much of the story and evidence.
Ideally, I would like to use an API to find books that matched a complicated set of criteria (it would be even better if I could use regular expressions to find the many variants of religious language and also to find religious language relatively close to mentions of mathematics), and then use get_cache to acquire the full OCRed text of these matching books. From that ad hoc corpus I would want to do some further computational analyses on my own server, such as extracting references to touchstones for the divine vision of mathematics (e.g., Plato’s later works, geometry rather than number theory), and perhaps even do some aggregate analyses (from which works did British mathematicians most often acquire this religious philosophy of mathematics?). I would also want to examine these patterns over time to see if indeed the bond between religion and mathematics declined in the late Victorian era.
This is precisely the model I use for my Syllabus Finder. I first find possible syllabi using an algorithm-based set of searches of Google (via the unfortunately deprecated SOAP Search API) while also querying local Center for History and New Media databases for matches. Since I can then extract the full texts of matching web pages from Google (using the API’s cache function), I can do further operations, such as pulling book assignments out of the syllabi (using regular expressions).
It seems to me that a model is already in place at Google for such an API for Google Books: their special university researcher’s version of the Search API. That kind of restricted but powerful API program might be ideal because 1) I don’t think an API would be useful without the get_OCRed_text function, which (let’s face it) liberates information that is currently very hard to get even though Google has recently released a plain text view of (only some of) its books; and 2) many of us want to ping the Google Books API with more than the standard daily hit limit for Google APIs.
[Image credit: the best double-entendre cover I could find on Google Books: No Way Out by Beverly Hastings.]
Over at the O’Reilly Radar, Peter Brantley reprints an interesting debate between Paul Duguid, author of the much-discussed recent article about the quality of Google Books, and Patrick Leary, author of “Googling the Victorians.” I’m sticking with my original negative opinion of the article, which Leary agrees completely with.