Category Archives: Digitization

The Digital Public Library of America: Coming Together

I’m just back from the Digital Public Library of America meeting in Chicago, and like many others I found the experience inspirational. Just two years ago a small group convened at the Radcliffe Institute and came up with a one-sentence sketch for this new library:

An open, distributed network of comprehensive online resources that would draw on the nation’s living heritage from libraries, universities, archives and museums in order to educate, inform and empower everyone in the current and future generations.

In a word: ambitious. Just two short years later, out of the efforts of that steering committee, the workstream members (I’m a convening member of the Audience and Participation workstream), over a thousand people who participated in online discussions and at three national meetings, the tireless efforts of the secretariat, and the critical leadership of Maura Marx and John Palfrey, the DPLA has gone from the drawing board to an impending beta launch in April 2013.

As I was tweeting from the Chicago meeting, distant respondents asked what the DPLA is actually going to be. What follows is what I see as some of its key initial elements, though it will undoubtedly grow substantially. (One worry expressed by many in Chicago was that the website launch in April will be seen as the totality of the DPLA, rather than a promising starting point.)

The primary theme in Chicago is the double-entendre subtitle of this post: coming together. It was clear to everyone at the meeting that the project was reaching fruition, garnering essential support from public funders such as the National Endowment for the Humanities and the Institute of Museum and Library Services, and private foundations such as Sloan, Arcadia, and (most recently) Knight. Just as clear was the idea that what distinguishes the DPLA from—and means it will be complementary to—other libraries (online and off) is its potent combination of local and national efforts, and digital and physical footprints.

Ponds->Lakes->Ocean

The foundation of the DPLA will be a huge store of metadata (and potentially thumbnails), culled from hundreds of sources across America. A large part of the initial collection will come from recently freed metadata about books, videos, audio recordings, images, manuscripts, and maps from large institutions like Harvard, provided under the couldn’t-be-more-permissive CC0 license. Wisely, in my estimation (perhaps colored by the fact that I’m a historian), the DPLA has sought out local archival content that has been digitized but is languishing in places that cannot solicit a large audience, and that do not have the know-how to enable modern web services such as APIs.

As I put it on Twitter, one can think of this initial set of materials (beyond the millions of metadata records from universities) as content from local ponds—small libraries, archives, museums, and historic sites—sent through streams to lakes—state digital libraries, which already exist in 40 states (a surprise to many, I suspect)—and then through rivers to the ocean—the DPLA. The DPLA will run a sophisticated technical infrastructure that will support manifold uses of this aggregation of aggregations.

Plan Nationally, Scan Locally

Since the Roy Rosenzweig Center for History and New Media has worked with many local archives, museums, and historic sites, especially through our Omeka project (which has been selected as the software to run online exhibits for the DPLA), I was aware of the great cultural heritage materials that are out there in this country. The DPLA is right: much of this incredible content is effectively invisible, failing to reach national and international audiences. The DPLA will bring huge new traffic to local scanning efforts. Funding agencies such as the Institute of Museum and Library Services have already provided the resources to scan numerous items at the local level; as IMLS Director Susan Hildreth pointed out, their grant to the DPLA meant that they could bring that already-scanned content to the world—a multiplier effect.

In Chicago we discussed ways of gathering additional local content. My thought was that local libraries can brand a designated computer workstation with the blue DPLA banner, with a scanner and a nice screen showing the cultural riches of the community in slideshow mode. Directions and help will be available to scan in new documents from personal or community collections.

[My very quick mockup of a public library DPLA workstation; underlying Creative Commons photo by Flickr user JennieB]

Others envisioned “Antiques Roadshow”-type events, and Emily Gore, Director of Content at the DPLA, who coined the great term Scannebagos, spoke of mobile scanning units that could digitize content across the country.

The DPLA is not alone in sensing this great unmet need for public libraries and similar institutions to assist communities in the digital preservation of personal and local history. For instance, Bill LeFurgy, who works at the Library of Congress with the National Digital Information Infrastructure and Preservation Program (NDIIPP), recently wrote:

Cultural heritage organizations have a great opportunity to fulfill their mission through what I loosely refer to as personal digital archiving…Cultural heritage institutions, as preserving entities with a public service orientation, are well-positioned to help people deal with their growing–and fragile–personal digital archives. This is a way for institutions to connect with their communities in a new way, and to thrive.

I couldn’t agree more, and although Bill focused mostly on the born-digital materials that we all have in abundance today, this mission of digital preservation can easily extend back to analog artifacts from our past. As the University of Wisconsin’s Dorothea Salo has put it, let’s turn collection development inside out, from centralized organizations to a distributed model.

When Roy and I wrote Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web, we debated the merits of “preservation through digitization.” While it may be problematic for certain kinds of rare materials, there is no doubt that local and personal collections could use this pathway. Given recent (and likely forthcoming) cuts to local archives, this seems even more meritorious.

The Best of the Digital and the Physical

The core strength, and unique feature, of the DPLA is thus that it will bring together the power and reach of the digital realm with the local community and trust in the thousands of American public libraries, museums, and historical sites—an extremely compelling combination. We are going through a difficult transition from print to digital reading, in which people are buying ebooks they cannot share or pass down to their children. The ephemerality of the digital is likely to become increasingly worrisome in this transition. At the same time people are demanding of their local libraries a greater digital engagement.

Ideally the DPLA can help public libraries and vice versa. With a stable, open DPLA combined with on-the-ground libraries, we can begin to articulate a model that protects and makes accessible our cultural heritage through and beyond the digital transition. For the foreseeable future public libraries will continue to house physical materials—the continued wonders of the codex—as well as provide access to the internet for the still significant minority without such access. And the DPLA can serve as a digital attic and distribution center for those libraries.

The key point, made by DPLA board member Laura DeBonis, is that with this physical footprint in communities the DPLA can do things that Google and other dotcoms cannot. She did not mean this as a criticism of Google Books (a project she was involved with when she worked at Google), which has done impressive work in scanning over 20 million books. But the DPLA has an incredible potential local network it can take advantage of to reach out to millions of people and have them share their history—in general, to democratize the access to knowledge.

It is critical to underline this point: the DPLA will be much more than its technical infrastructure. It will succeed or fail not on its web services but on its ability to connect with localities across the United States and have them use—and contribute—to the DPLA.

A Community-Oriented Platform

Having said that, the technical infrastructure is looking solid. But here, too, the Technical Aspects workstream is keeping foremost in their mind community uses. As workstream member David Weinberger has written, we can imagine a future library as a platform, one that serves communities:

In many instances, those communities will be defined geographically, whether it’s a town’s local library or a university community; in some instances, the community will be defined by interest, not by geography. In either case, serving a defined community has two advantages. First, it enables libraries to accomplish the mission they’ve been funded to accomplish. Second, user networks depend upon and assume local knowledge, interests, and norms. While a local library platform should interoperate with the rest of the world’s library platforms, it may do best if it is distinctively local…

Just as each project created by a developer makes it easier for the next developer to create the next app, each interaction by users ought to make the library platform a little smarter, a little wiser, a little more tuned to its users interests. Further, the visible presence of neighbors and the availability of their work will not only make the library an ever more essential piece of the locality’s infrastructure, it can make the local community itself more coherent and humane.

Conceiving of the library as a platform not only opens a range of new services and provides for a continuous increase in the library’s value, it also does something libraries urgently need to do: it changes the criteria of success. A library platform should be measured less on the circulation of its works than in the circulation of the ideas and passions these works spark — from how many works are checked out to the community’s engagement with its own grappling with those works. This is not only a metric that libraries-as-platforms can excel at, it is in fact a measure of what has always been the truest value of libraries.

In that sense, by becoming a platform the library can better fulfill the abiding mission it set itself: to be a civic institution essential to democracy.

Nicely put.

New Uses for Local History

It’s not hard to imagine many apps and sites incorporating the DPLA’s aggregation of local historical content. It struck me that an easy first step is incorporation of the DPLA into existing public library apps. Here in Fairfax, Virginia, our county has an app that is fairly rudimentary but quickly becoming popular because it replaces that library card you can never find. (The app also can alert you to available holds and new titles, and search the catalog.)

I fired up the Fairfax Library app on my phone at the Chicago meeting, and although the county doesn’t know it yet, there’s already a slot for the DPLA in the app. That “local” tab at the bottom can sense where you are and direct you to nearby physical collections; through the DPLA API it will be trivial to also show people digitized items from their community or current locale.

Granted, Fairfax County is affluent and has a well-capitalized public library system that can afford a smartphone app. But my guess is the app is fairly simple and was probably built from a framework other libraries use (indeed, it may be part of Fairfax County’s ILS vendor package), so DPLA integration could happen with many public libraries in this way. For libraries without such resources, I can imagine local hackfests lending a hand, perhaps working from a base app that can be customized for different public libraries easily.

Long-time readers of this blog can identify dozens of other apps that will be hungry for DPLA content. The idea of marrying geolocation with historical materials has flourished in the last two years, with apps like HistoryPin showing how people can find out about the history around them.

Even Google has gotten into the act of location + history with its recently launched Field Trip app. I suspect countless similar projects will be enhanced by, or based on, the DPLA API.

Moreover, geolocating historical documents is but one way to use the technical infrastructure of the DPLA. As the technical working group has wisely noted, the platform exists for unintended uses as well as obvious ones. To explore the many possibilities, there will next be an “Appfest” at the Chattanooga Public Library on November 8-9, 2012. And I’m planning a DPLA hacking session here at the Roy Rosenzweig Center for History and New Media for December 6, 2012, concurrent with an Audience and Participation workstream meeting. Stay tuned for details.

The Speculative

Only hinted at in Chicago, but worthy of greater thought, is what else we might do with the combination of thousands of public libraries and the DPLA. This area is more speculative, for reasons ranging from legal considerations to the changing nature of reading. The strong fair use arguments that won the day in the Authors Guild v. HathiTrust case (the ruling was handed down the day before DPLA Midwest) may—may— enable new kinds of  sharing of digital materials within geofenced areas such as public libraries. (Chicago did not have a report from DPLA’s legal workstream, so we await their understanding of the shifting copyright and fair use landscape in the wake of landmark positive rulings in the HathiTrust and Georgia State cases.)

Perhaps the public library can achieve, in the medium term, some kind of hybrid physical-digital browsability as imagined in this video of a French bookstore from the near future, in which a simple scan of a book using a tablet transfers an e-text to the tablet. The video gets at the ongoing need for in-person reading advice and the superior browsability of physical bookshelves.

I’ve been tracking a number of these speculative exercises, such as the student projects in Harvard Graduate School of Design’s Library Test Kitchen, which experiments with media transformations of libraries. I suspect that bookfuturists will think of other potential physical/digital hybrids.

But we need not get fancy. More obvious benefits abound. The DPLA will be widely used by teachers and students, with scans being placed into syllabi and contextualized by scholars. Judging by the traffic RRCHNM’s educational sites and digital archives get, I expect a huge waiting audience for this. I can also anticipate local groups of readers and historical enthusiasts gathering in person to discuss works from the DPLA.

Momentum, but Much Left to Do

To be sure, many tough challenges still await the DPLA. Largely absent from the discussion in Chicago, with its focus on local history, is the need to see what the digital library can do with books. After all, the majority of circulations from public libraries are popular, in-copyright works, and despite great unique local content the public may expect that P in DPLA to provide a bit more of what they are used to from their local library. Finding ways to have big publishers share at least some books through the system—or perhaps start with smaller publishers willing to experiment with new models of distribution—will be an important piece of the puzzle.

As I noted at the start, the DPLA now has funding from public and private sources, but it will have to raise much, much more, not easy in these austere times. It needs a staff with the energy to match the ambition of the project, and the chops to execute a large digital project that also has in-person connections in 50 states.

A big challenge, indeed. But who wouldn’t like a public, open, digital library that draws from across the United States “to educate, inform and empower everyone”?

 

What Scholars Want from the Digital Public Library of America

[A rough transcript of my talk at the Digital Public Library of America meeting at Harvard on March 1, 2011. To permit unguarded, open discussion, we operated under the Chatham House Rule, which prevents attribution of comments, but I believe I'm allowed to violate my own anonymity.]

I was once at a meeting similar to this one, where technologists and scholars were discussing what a large digital library should look like. During a breakout session, the technologists huddled and talked about databases, indices, search mechanisms; the scholars, on the other side of the room, painted a vision of what the archive would look like online, in their view a graphical representation as close to the library as possible, where one could pull down boxes from the shelves, and then open those boxes and leaf through the folios one by one.

While the technologists debated digital infrastructure, the scholars were trying to replicate or maintain what they liked about the analog world they knew: a trusted order, the assurance of the physical, all of the cues they pick up from the shelf and the book. If we want to think about the Digital Public Library of America from the scholar’s point of view, we must think about how to replicate those signals while taking advantage of the technology. In short: the best of the single search box with the trust and feel of the bookshelf.

So how can this group translate those scholarly concerns into elements of the DPLA? I did what any rigorous, traditionally trained scholar would do: I asked my Twitter followers. Here are their thoughts, with my thanks for their help:

First, scholars want reliable metadata about scholarly objects like books. Close enough doesn’t count. Although Google has relatively few metadata errors (given that they handle literally a trillion pieces of metadata), these errors drive scholars mad, and make them skeptical of online collections.

Second, serendipity. Many works of scholarship come from the chance encounter of the scholar with primary sources. How can that be enhanced? Some in my feed suggested a user interface with links to “more like this,” “recent additions in your field,” or “sample collections.” Others advocated social cues, such as user-contributed notes on works in the library.

Third, there are different modes of scholarly research, and the interface has to reflect that: a simple discovery layer with a sophisticated advanced search underneath, faceted search, social search methods for collaborative practice, the ability to search within a collection or subcollection.

Fourth, connection with the physical. We need better representations of books online than the sameness of Google books, where everything looks like a PDF of the same size. Scholars also need the ability to go from the digital to the analog by finding a local copy of a work.

Finally, as I have often said, scholars have uses for libraries that libraries can’t anticipate. So we need the DPLA to enable other parties to build upon, reframe, and reuse the collection. In technical terms, this means open APIs.

Mass Digitization of Books: Exit Microsoft, What Next?

So Microsoft has left the business of digitizing millions of books—apparently because they saw it as no business at all.

This leaves Microsoft’s partner (and our partner on the Zotero project), the Internet Archive, somewhat in the lurch, although Microsoft has done the right thing and removed the contractual restrictions on the books they digitized so they may become part of IA’s fully open collection (as part of the broader Open Content Alliance), which now has about 400,000 volumes. Also still on the playing field is the Universal Digital Library (a/k/a the Million Books Project), which has 1.5 million volumes.

And then there’s Google and its Book Search program. For those keeping score at home, my sources tell me that Google, which coyly likes to say it has digitized “over a million books” so far, has actually finished scanning five million. It will be hard for non-profits like IA to catch up with Google without some game-changing funding or major new partnerships.

Foundations like the Alfred P. Sloan Foundation have generously made substantial (million-dollar) grants to add to the digital public domain. But with the cost of digitizing 10 million pre-1923 books at around $300 million, where might this scale of funds and new partners come from? To whom can the Open Content Alliance turn to replace Microsoft?

Frankly, I’ve never understood why institutions such as Harvard, Yale, and Princeton haven’t made a substantial commitment to a project like OCA. Each of these universities has seen its endowment grow into the tens of billions in the last decade, and each has the means and (upon reflection) the motive to do a mass book digitization project of Google’s scale. $300 million sounds like a lot, but it’s less than 1% of Harvard’s endowment and my guess is that the amount is considerably less than all three universities are spending to build and fund laboratories for cutting-edge sciences like genomics. And a 10 million public-domain book digitization project is just the kind of outrageously grand project HYP should be doing, especially if they value the humanities as much as the sciences.

Moreover, Harvard, Yale, and Princeton find themselves under enormous pressure to spend more of their endowment for a variety of purposes, including tuition remission and the public good. (Full and rather vain disclosure: I have some relationship to all three institutions; I complain because I love.) Congress might even get into the act, mandating that universities like HYP spend a more generous minimum percentage of their endowment every year, just like private foundations who benefit (as does HYP, though in an indirect way) from the federal tax code.

In one stroke HYP could create enormous good will with a moon-shot program to rival Google’s: free books for the world. (HYP: note the generous reaction to, and the great press for, MIT’s OpenCourseWare program.) And beyond access, the project could enable new forms of scholarship through computational access to a massive corpora of full texts.

Alas, Harvard and Princeton partnered with Google long ago. Princeton has committed to digitizing about one million volumes with Google; Harvard’s number is unclear, but probably smaller. The terms of the agreement with Google are non-exclusive; Harvard and Princeton could initiate their own digitization projects or form other partnerships. But I suspect that would be politically difficult since the two universities are getting free digitization services from Google and would have to explain to their overseers why they want to replace free with very expensive. (The answer sounds like Abbott and Costello: the free program produces something that’s not free, while the expensive one is free.)

If Google didn’t exist, Harvard would probably be the most obvious candidate to pull off the Great Digitization of Widener. Not only does it have the largest endowment; historian Robert Darnton, a leader in thinking about the future (and the past) of the book, is now the director of the Harvard library system. Harvard also recently passed an open access mandate for the publications of its faculty.

Princeton has the highest per-student endowment of any university, and could easily undertake a mass digitization project of this scale. Perhaps some of the many Princeton alumni who went on to vast riches on the Web, such as EBay‘s Meg Whitman (who has already given $100 million to Princeton) or Amazon‘s Jeff Bezos, could pitch in.

But Harvard’s and Princeton’s Google “non-exclusive” partnership makes these outcomes unlikely, as does the general resistance in these universities to spending science-scale funds outside of the sciences (unless it’s for a building).

That leaves Yale. Yale chose Microsoft last year to do its digitization, and has now been abandoned right in the middle of its project. Since Microsoft is apparently leaving its equipment and workflow in place at partner institutions, Yale could probably pick up the pieces with an injection of funding from its endowment or from targeted alumni gifts. Yale just spent an enormous amount of money on a new campus for the sciences, and this project could be seen as a counterbalance for the humanities.

Or, HYP could band together and put in a mere $100 million each to get the job done.

Is this likely to happen? Of course not. HYP and other wealthy institutions are being asked to spend their prodigious endowments on many other things, and are reluctant to up their spending rate at all. But I believe a HYP or HYP-like solution is much more likely than public funding for this kind of project, as the Human Genome Project received.

Google Book Search Begins Adding Quality Control Measures

As predicted in this space six months ago, Google has added the ability for users to report missing or poorly scanned pages in their Book Search. (From my post “Google Books: Champagne or Sour Grapes?“: “Just as they have recently added commentary to Google News, they could have users flag problematic pages.”)

I’ll say it again: criticism of Google Book Search that focuses on quality chases a red herring—something that Google can easily fix. Let’s focus instead on more substantive issues, such as the fact that Google’s book archive is not truly open.

Digitization and Repatriation

Elgin MarblesIt’s always worth listening to Cliff Lynch‘s opening talks at the CNI task force meetings, and this week’s meeting in Washington was no exception. (My apologies for not blogging the meeting; busy week.) Like no one else, Cliff has his finger on the pulse of all that is new and important in the world of the digital humanities. Although Cliff discussed some issues that have received a lot of press, such as net neutrality, I found one issue he raised totally unexpected and fascinating.

Cliff noted that digital surrogates for museum objects—that is, digital photographs or 2- or 3-D scans—are becoming so good that for most scholarly and classroom purposes they can replace the originals. For many years, one of the main arguments museums have used to avoid the repatriation of foreign materials—e.g., sculpture or pottery taken during colonization or war—is that they worried about the accessibility and condition of an object if they returned it. Scholars might lose important evidence, museums argued, and researchers often needed to look at the original object for small details like texture or paint color. With advances in digitization, however, this objection no longer holds water, and museums should feel more pressure (or more freedom) to repatriate controversial items in their collections.

[Creative Commons licensed photo of the Elgin Marbles courtesy of zakgallop on Flickr.]

Tony Grafton on Digital Texts and Reading

Anthony Grafton was the first person to turn me onto intellectual history. His seminar on ideas in the Renaissance was one of the most fascinating courses I took at Princeton, and I still remember well Tony rocking in his seat, looking a bit like a young Karl Marx, making brilliant connections among a broad array of sources.

So it’s not unexpected given his wide-ranging interests but still terrific to see a scholar who has spent so much time with early books thinking deeply about “digitization and its discontents” in his article “Future Reading” in the latest issue of The New Yorker. And it’s even more gratifying to see Tony note in his online companion piece to “Future Reading,” “Adventures in Wonderland,” that “One of the best ways to get a handle on the sprawling world of digital sources is through George Mason University’s Center for History and New Media.”

Google Books: Is It Good for History?

The September 2007 issue of the American Historical Association’s Perspectives is now available online, and it is worth reading Rob Townsend’s article “Google Books: Is It Good for History?” The article is an update of Rob’s much-debated post on the AHA blog in May, and I believe this revised version now reads as the best succinct critique of Google Books available (at least from the perspective of scholars). Rob finds fault with Google’s poor scans, frequently incorrect metadata, and too-narrow interpretation of the public domain.

Regular readers of this blog know of my aversion to jeremiads about Google, but Rob’s piece is well-reasoned and I agree with much of what he says.

Understanding reCAPTCHA

reCAPTCHAOne of the things I added to this blog when I moved from my own software to WordPress was the red and yellow box in the comments section, which defends this blog against comment spam by asking commenters to decipher a couple of words. Such challenge-response systems are called CAPTCHAs (a tortured and unmellifluous acroynm of “completely automated public Turing test to tell computers and humans apart”). What really caught my imagination about the CAPTCHA I’m using, called reCAPTCHA, is that it uses words from books scanned by the Internet Archive/Open Content Alliance. Thus at the same time commenters solve the word problems they are effectively serving as human OCR machines.

To date, about two million words have been deciphered using reCAPTCHA (see the article in Technology Review lauding reCAPTCHA’s mastermind, Luis von Ahn), which is a great start but by my calculation (100,000 words per average book) only the equivalent of about 20 books. Of course, it’s really much more than that because the words in reCAPTCHA are the hardest ones to decipher by machine and are sprinkled among thousands of books.

Indeed, that is the true genius of reCAPTCHA—it “tells computers and humans apart” by first using OCR software to find words computers can’t decipher, then feeds those words to humans, who can decipher the words (proving themselves human). Therefore a spammer running OCR software (as many of them do to decipher lesser CAPTCHAs), will have great difficulty cracking it. If you would like an in-depth lesson about how reCAPTCHA (and CAPTCHAs in general) works, take a listen to Steve Gibson’s podcast on the subject.

The brilliance of reCAPTCHA and its simultaneous assistance to the digital commons leads one to ponder: What other aspects of digitization, cataloging, and research could be aided by giving a large, distributed group of humans the bits that computers have great difficulty with?

And imagine the power of this system if all 60 million CAPTCHAs answered daily were reCAPTCHAs instead. Why not convert your blog or login system to reCAPTCHA today?

Google Books: Champagne or Sour Grapes?

Beyond Good and EvilIs it possible to have a balanced discussion of Google’s outrageously ambitious and undoubtedly flawed project to scan tens of millions of books in dozens of research libraries? I have noted in this space the advantages and disadvantages of Google Books—sometimes both at one time. Heck, the only time this blog has ever been seriously “dugg” is when I noted the appearance of fingers in some Google scans. Google Books is an easy target.

This week Paul Duguid has received a lot of positive press (e.g., Peter Brantley, if:book) for his dressing down of Google Books, “Inheritance and loss? A brief survey of Google Books.” It’s a very clever article, using poorly scanned Google copies of Lawrence Sterne’s absurdist and raunchy comedy Tristram Shandy to reveal the extent of Google’s folly and their “disrespect” for physical books.

I thought I would enjoy reading Duguid’s article, but I found myself oddly unenthusiastic by the end.

Of course Google has poor scans—as the saying goes, haste makes waste—but this is not a scientific survey of the percentage of pages that are unreadable or missing (surely less than 0.1% in my viewing of scores of Victorian books). Nor does the article note that Google might have possible remedies for some of these inadequacies. For example, they almost certainly have higher-resolution, higher-contrast scans that are different than the lo-res ones they display (a point made at the Million Books workshop; they use the originals for OCR), which they can revisit to produce better copies for the web. Just as they have recently added commentary to Google News, they could have users flag problematic pages. Truly bad books could be rescanned or replaced by other libraries’ versions.

Most egregiously, none of the commentaries I have seen on Duguid’s jeremiad have noted the telling coda to the article: “This paper is based on a talk given to the Society of Scholarly Publishers, San Francisco, 6 June 2007. I am grateful to the Society for the invitation.” The question of playing to the audience obviously arises.

Google Books will never be perfect, or even close. Duguid is right that it disrespects age-old, critical elements of books. (Although his point that Google disrespects metadata strangely fails to note that Google is one of the driving forces behind the Future of Bibliographic Control meetings, which are all about metadata.) Google Books is the outcome, like so many things at Google, of a mathematical challenge: How can you scan tens of millions of books in five years? It’s easy to say they should do a better job and get all the details right, but if you do the calculations of that assessment, you’ll probably see that the perfect library scanning project would take 50 years rather than 5. As in OCR, getting from 98% to 100% accuracy would probably take an order of magnitude longer and be an order of magnitude more expensive. That’s the trade-off they have decided to make, and as a company interested in search, where near-100% accuracy is unnecessary (I have seen OCR specialists estimate that even 90% accuracy is perfectly fine for search), it must have been an easy decision to make.

Complaining about the quality, thoroughness, and fidelity of Google’s (public) scans distracts us from the larger problem of Google Books. As I have argued repeatedly in this space, the real problem—especially for those in the digital humanities but also for many others—is that Google Books is not open. Recently they have added the ability to view some books in “plain text” (i.e., the OCRed text, but it’s hard to copy text from multiple pages at once), and even in some cases to download PDFs of public domain works. But those moves don’t go far enough for scholarly needs. We need what Cliff Lynch of CNI has called “computational access,” a higher level of access that is less about reading a page image on your computer than applying digital tools and analyses to many pages or books at one time to create new knowledge and understanding.

An API would be ideal for this purpose if Google doesn’t want to expose their entire collection. Google has APIs for most of their other projects—why not Google Books?

[Image courtesy of Ubisoft.]