Dan Cohen

Archive for the ‘Digitization’ Category

Mass Digitization of Books: Exit Microsoft, What Next?

Thursday, May 29th, 2008

So Microsoft has left the business of digitizing millions of books—apparently because they saw it as no business at all.

This leaves Microsoft’s partner (and our partner on the Zotero project), the Internet Archive, somewhat in the lurch, although Microsoft has done the right thing and removed the contractual restrictions on the books they digitized so they may become part of IA’s fully open collection (as part of the broader Open Content Alliance), which now has about 400,000 volumes. Also still on the playing field is the Universal Digital Library (a/k/a the Million Books Project), which has 1.5 million volumes.

And then there’s Google and its Book Search program. For those keeping score at home, my sources tell me that Google, which coyly likes to say it has digitized “over a million books” so far, has actually finished scanning five million. It will be hard for non-profits like IA to catch up with Google without some game-changing funding or major new partnerships.

Foundations like the Alfred P. Sloan Foundation have generously made substantial (million-dollar) grants to add to the digital public domain. But with the cost of digitizing 10 million pre-1923 books at around $300 million, where might this scale of funds and new partners come from? To whom can the Open Content Alliance turn to replace Microsoft?

Frankly, I’ve never understood why institutions such as Harvard, Yale, and Princeton haven’t made a substantial commitment to a project like OCA. Each of these universities has seen its endowment grow into the tens of billions in the last decade, and each has the means and (upon reflection) the motive to do a mass book digitization project of Google’s scale. $300 million sounds like a lot, but it’s less than 1% of Harvard’s endowment and my guess is that the amount is considerably less than all three universities are spending to build and fund laboratories for cutting-edge sciences like genomics. And a 10 million public-domain book digitization project is just the kind of outrageously grand project HYP should be doing, especially if they value the humanities as much as the sciences.

Moreover, Harvard, Yale, and Princeton find themselves under enormous pressure to spend more of their endowment for a variety of purposes, including tuition remission and the public good. (Full and rather vain disclosure: I have some relationship to all three institutions; I complain because I love.) Congress might even get into the act, mandating that universities like HYP spend a more generous minimum percentage of their endowment every year, just like private foundations who benefit (as does HYP, though in an indirect way) from the federal tax code.

In one stroke HYP could create enormous good will with a moon-shot program to rival Google’s: free books for the world. (HYP: note the generous reaction to, and the great press for, MIT’s OpenCourseWare program.) And beyond access, the project could enable new forms of scholarship through computational access to a massive corpora of full texts.

Alas, Harvard and Princeton partnered with Google long ago. Princeton has committed to digitizing about one million volumes with Google; Harvard’s number is unclear, but probably smaller. The terms of the agreement with Google are non-exclusive; Harvard and Princeton could initiate their own digitization projects or form other partnerships. But I suspect that would be politically difficult since the two universities are getting free digitization services from Google and would have to explain to their overseers why they want to replace free with very expensive. (The answer sounds like Abbott and Costello: the free program produces something that’s not free, while the expensive one is free.)

If Google didn’t exist, Harvard would probably be the most obvious candidate to pull off the Great Digitization of Widener. Not only does it have the largest endowment; historian Robert Darnton, a leader in thinking about the future (and the past) of the book, is now the director of the Harvard library system. Harvard also recently passed an open access mandate for the publications of its faculty.

Princeton has the highest per-student endowment of any university, and could easily undertake a mass digitization project of this scale. Perhaps some of the many Princeton alumni who went on to vast riches on the Web, such as EBay’s Meg Whitman (who has already given $100 million to Princeton) or Amazon’s Jeff Bezos, could pitch in.

But Harvard’s and Princeton’s Google “non-exclusive” partnership makes these outcomes unlikely, as does the general resistance in these universities to spending science-scale funds outside of the sciences (unless it’s for a building).

That leaves Yale. Yale chose Microsoft last year to do its digitization, and has now been abandoned right in the middle of its project. Since Microsoft is apparently leaving its equipment and workflow in place at partner institutions, Yale could probably pick up the pieces with an injection of funding from its endowment or from targeted alumni gifts. Yale just spent an enormous amount of money on a new campus for the sciences, and this project could be seen as a counterbalance for the humanities.

Or, HYP could band together and put in a mere $100 million each to get the job done.

Is this likely to happen? Of course not. HYP and other wealthy institutions are being asked to spend their prodigious endowments on many other things, and are reluctant to up their spending rate at all. But I believe a HYP or HYP-like solution is much more likely than public funding for this kind of project, as the Human Genome Project received.

Google Book Search Begins Adding Quality Control Measures

Wednesday, February 20th, 2008

As predicted in this space six months ago, Google has added the ability for users to report missing or poorly scanned pages in their Book Search. (From my post “Google Books: Champagne or Sour Grapes?“: “Just as they have recently added commentary to Google News, they could have users flag problematic pages.”)

I’ll say it again: criticism of Google Book Search that focuses on quality chases a red herring—something that Google can easily fix. Let’s focus instead on more substantive issues, such as the fact that Google’s book archive is not truly open.

Digitization and Repatriation

Friday, December 14th, 2007

Elgin MarblesIt’s always worth listening to Cliff Lynch’s opening talks at the CNI task force meetings, and this week’s meeting in Washington was no exception. (My apologies for not blogging the meeting; busy week.) Like no one else, Cliff has his finger on the pulse of all that is new and important in the world of the digital humanities. Although Cliff discussed some issues that have received a lot of press, such as net neutrality, I found one issue he raised totally unexpected and fascinating.

Cliff noted that digital surrogates for museum objects—that is, digital photographs or 2- or 3-D scans—are becoming so good that for most scholarly and classroom purposes they can replace the originals. For many years, one of the main arguments museums have used to avoid the repatriation of foreign materials—e.g., sculpture or pottery taken during colonization or war—is that they worried about the accessibility and condition of an object if they returned it. Scholars might lose important evidence, museums argued, and researchers often needed to look at the original object for small details like texture or paint color. With advances in digitization, however, this objection no longer holds water, and museums should feel more pressure (or more freedom) to repatriate controversial items in their collections.

[Creative Commons licensed photo of the Elgin Marbles courtesy of zakgallop on Flickr.]

Tony Grafton on Digital Texts and Reading

Monday, November 5th, 2007

Anthony Grafton was the first person to turn me onto intellectual history. His seminar on ideas in the Renaissance was one of the most fascinating courses I took at Princeton, and I still remember well Tony rocking in his seat, looking a bit like a young Karl Marx, making brilliant connections among a broad array of sources.

So it’s not unexpected given his wide-ranging interests but still terrific to see a scholar who has spent so much time with early books thinking deeply about “digitization and its discontents” in his article “Future Reading” in the latest issue of The New Yorker. And it’s even more gratifying to see Tony note in his online companion piece to “Future Reading,” “Adventures in Wonderland,” that “One of the best ways to get a handle on the sprawling world of digital sources is through George Mason University’s Center for History and New Media.”

Google Books: Is It Good for History?

Wednesday, October 3rd, 2007

The September 2007 issue of the American Historical Association’s Perspectives is now available online, and it is worth reading Rob Townsend’s article “Google Books: Is It Good for History?” The article is an update of Rob’s much-debated post on the AHA blog in May, and I believe this revised version now reads as the best succinct critique of Google Books available (at least from the perspective of scholars). Rob finds fault with Google’s poor scans, frequently incorrect metadata, and too-narrow interpretation of the public domain.

Regular readers of this blog know of my aversion to jeremiads about Google, but Rob’s piece is well-reasoned and I agree with much of what he says.

Debating Paul Duguid’s Google Books Lament

Thursday, August 23rd, 2007

Over at the O’Reilly Radar, Peter Brantley reprints an interesting debate between Paul Duguid, author of the much-discussed recent article about the quality of Google Books, and Patrick Leary, author of “Googling the Victorians.” I’m sticking with my original negative opinion of the article, which Leary agrees completely with.

Understanding reCAPTCHA

Friday, August 17th, 2007

reCAPTCHAOne of the things I added to this blog when I moved from my own software to WordPress was the red and yellow box in the comments section, which defends this blog against comment spam by asking commenters to decipher a couple of words. Such challenge-response systems are called CAPTCHAs (a tortured and unmellifluous acroynm of “completely automated public Turing test to tell computers and humans apart”). What really caught my imagination about the CAPTCHA I’m using, called reCAPTCHA, is that it uses words from books scanned by the Internet Archive/Open Content Alliance. Thus at the same time commenters solve the word problems they are effectively serving as human OCR machines.

To date, about two million words have been deciphered using reCAPTCHA (see the article in Technology Review lauding reCAPTCHA’s mastermind, Luis von Ahn), which is a great start but by my calculation (100,000 words per average book) only the equivalent of about 20 books. Of course, it’s really much more than that because the words in reCAPTCHA are the hardest ones to decipher by machine and are sprinkled among thousands of books.

Indeed, that is the true genius of reCAPTCHA—it “tells computers and humans apart” by first using OCR software to find words computers can’t decipher, then feeds those words to humans, who can decipher the words (proving themselves human). Therefore a spammer running OCR software (as many of them do to decipher lesser CAPTCHAs), will have great difficulty cracking it. If you would like an in-depth lesson about how reCAPTCHA (and CAPTCHAs in general) works, take a listen to Steve Gibson’s podcast on the subject.

The brilliance of reCAPTCHA and its simultaneous assistance to the digital commons leads one to ponder: What other aspects of digitization, cataloging, and research could be aided by giving a large, distributed group of humans the bits that computers have great difficulty with?

And imagine the power of this system if all 60 million CAPTCHAs answered daily were reCAPTCHAs instead. Why not convert your blog or login system to reCAPTCHA today?

Google Books: Champagne or Sour Grapes?

Thursday, August 16th, 2007

Beyond Good and EvilIs it possible to have a balanced discussion of Google’s outrageously ambitious and undoubtedly flawed project to scan tens of millions of books in dozens of research libraries? I have noted in this space the advantages and disadvantages of Google Books—sometimes both at one time. Heck, the only time this blog has ever been seriously “dugg” is when I noted the appearance of fingers in some Google scans. Google Books is an easy target.

This week Paul Duguid has received a lot of positive press (e.g., Peter Brantley, if:book) for his dressing down of Google Books, “Inheritance and loss? A brief survey of Google Books.” It’s a very clever article, using poorly scanned Google copies of Lawrence Sterne’s absurdist and raunchy comedy Tristram Shandy to reveal the extent of Google’s folly and their “disrespect” for physical books.

I thought I would enjoy reading Duguid’s article, but I found myself oddly unenthusiastic by the end.

Of course Google has poor scans—as the saying goes, haste makes waste—but this is not a scientific survey of the percentage of pages that are unreadable or missing (surely less than 0.1% in my viewing of scores of Victorian books). Nor does the article note that Google might have possible remedies for some of these inadequacies. For example, they almost certainly have higher-resolution, higher-contrast scans that are different than the lo-res ones they display (a point made at the Million Books workshop; they use the originals for OCR), which they can revisit to produce better copies for the web. Just as they have recently added commentary to Google News, they could have users flag problematic pages. Truly bad books could be rescanned or replaced by other libraries’ versions.

Most egregiously, none of the commentaries I have seen on Duguid’s jeremiad have noted the telling coda to the article: “This paper is based on a talk given to the Society of Scholarly Publishers, San Francisco, 6 June 2007. I am grateful to the Society for the invitation.” The question of playing to the audience obviously arises.

Google Books will never be perfect, or even close. Duguid is right that it disrespects age-old, critical elements of books. (Although his point that Google disrespects metadata strangely fails to note that Google is one of the driving forces behind the Future of Bibliographic Control meetings, which are all about metadata.) Google Books is the outcome, like so many things at Google, of a mathematical challenge: How can you scan tens of millions of books in five years? It’s easy to say they should do a better job and get all the details right, but if you do the calculations of that assessment, you’ll probably see that the perfect library scanning project would take 50 years rather than 5. As in OCR, getting from 98% to 100% accuracy would probably take an order of magnitude longer and be an order of magnitude more expensive. That’s the trade-off they have decided to make, and as a company interested in search, where near-100% accuracy is unnecessary (I have seen OCR specialists estimate that even 90% accuracy is perfectly fine for search), it must have been an easy decision to make.

Complaining about the quality, thoroughness, and fidelity of Google’s (public) scans distracts us from the larger problem of Google Books. As I have argued repeatedly in this space, the real problem—especially for those in the digital humanities but also for many others—is that Google Books is not open. Recently they have added the ability to view some books in “plain text” (i.e., the OCRed text, but it’s hard to copy text from multiple pages at once), and even in some cases to download PDFs of public domain works. But those moves don’t go far enough for scholarly needs. We need what Cliff Lynch of CNI has called “computational access,” a higher level of access that is less about reading a page image on your computer than applying digital tools and analyses to many pages or books at one time to create new knowledge and understanding.

An API would be ideal for this purpose if Google doesn’t want to expose their entire collection. Google has APIs for most of their other projects—why not Google Books?

[Image courtesy of Ubisoft.]

The “Google Five” Describe Progress, Challenges

Thursday, June 28th, 2007

Among other things learned by the original five libraries that signed up with Google to have their collections digitized is this gem: “About one percent of the Bodleian Library’s books have uncut pages, meaning they’ve never been opened.” I used to find books like this at Yale and felt quite bad for their authors. Imagine all of the effort that goes into writing a book–and then no one, for hundreds of years at Oxford or Yale (bookish places with esoteric interests, wouldn’t you say?), takes even a peek inside. Evidently the long tail doesn’t quite extend all the way.

Google Books: What’s Not to Like?

Wednesday, May 9th, 2007

The American Historical Association’s Rob Townsend takes some sharp jabs at Google’s ambitious library scanning project. Some of the comments are equally sharp.