Category Archives: Digitization

The “Google Five” Describe Progress, Challenges

Among other things learned by the original five libraries that signed up with Google to have their collections digitized is this gem: “About one percent of the Bodleian Library’s books have uncut pages, meaning they’ve never been opened.” I used to find books like this at Yale and felt quite bad for their authors. Imagine all of the effort that goes into writing a book–and then no one, for hundreds of years at Oxford or Yale (bookish places with esoteric interests, wouldn’t you say?), takes even a peek inside. Evidently the long tail doesn’t quite extend all the way.

“The Object of History” Site Launches

Thanks to the hard work of my colleagues at the Center for History and New Media, led by Sharon Leon, you can now go behind the scenes with the curators of the National Museum of American History. This month the discussion begins with the famous Greensboro Woolworth’s lunch counter and the origins of the Civil Rights movement. Each month will highlight a new object and its corresponding context, delivered in rich multimedia and with the opportunity to chat with the curators themselves.

A Closer Look at the National Archives-Footnote Agreement

I’ve spent the past two weeks trying to get a better understanding of the agreement signed by the National Archives and Footnote, about which I raised several concerns in my last post. Before making further (possibly unfounded) criticisms I thought it would a good idea to talk to both NARA and Footnote. So I picked up the phone and found several people eager to clarify things. At NARA, Jim Hastings, director of access programs, was particularly helpful in explaining their perspective. (Alas, NARA’s public affairs staff seemed to have only the sketchiest sense of key details.) Most helpful—and most eager to rebut my earlier post—were Justin Schroepfer and Peter Drinkwater, the marketing director and product lead at Footnote. Much to their credit, Justin and Peter patiently answered most of my questions about the agreement and the operation of the Footnote website.

Surprisingly, everyone I spoke to at both NARA and Footnote emphasized that despite the seemingly set-in-stone language of the legal agreement, there is a great deal of latitude in how it is executed, and they asked me to spread the word about how historians and the general public can weigh in. It has received virtually no publicity, but NARA is currently in a public comment phase for the Footnote (a/k/a iArchives) agreement. Scroll down to the bottom of the “Comment on Draft Policy” page at NARA’s website and you’ll find a request for public comment (you should email your thoughts to Vision@nara.gov). It’s a little odd to have a request for comment after the ink is dry on an agreement or policy, and this URL probably should have been included in the press release of the Footnote agreement, but I do think after speaking with them that both NARA and Footnote are receptive to hearing responses to the agreement. Indeed, in response to this post and my prior post on the agreement, Footnote has set up a web page, “Finding the Right Balance,” to receive feedback from the general public on the issues I’ve raised. They also asked me to round up professional opinion on the deal.

I assume Footnote will explain their policies in greater depth on their blog, but we agreed that it would be helpful to record some important details of our conversations in this space. Here are the answers Justin and Peter gave to a few pointed questions.

When I first went to the Footnote site, I was unpleasantly surprised that it required registration even to look at “milestone” documents like Lincoln’s draft of the Gettysburg Address. (Unfortunately, Footnote doesn’t have a list of all of its free content yet, so it’s hard to find such documents.) Justin and Peter responded that when they launched the site there was an error in the document viewer, so they had to add authentication to all document views. A fix was rolled out on January 23, and it’s now possible to view these important documents without registering.

You do need to register, however, to print or download any document, whether it’s considered “free” or “premium.” Why? Justin and Peter candidly noted that although they have done digitization projects before, the National Archives project, which contains millions of critical—and public domain—documents, is a first for them. They are understandably worried about the “leakage” of documents from their site, and want to take it one step at a time. So to start they will track all downloads to see how much escapes, especially in large batches. I noted that downloading and even reusing these documents (even en masse) very well might be legal, despite Footnote’s terms of service, because the scans are “slavish” copies of the originals, which are not protected by copyright. Footnote lawyers are looking at copyright law and what other primary-source sites are doing, and they say that they view these initial months as a learning experience to see if the terms of service can or should change. Footnote’s stance on copyright law and terms of usage will clearly be worth watching.

Speaking of terms of usage, I voiced a similar concern about Footnote’s policies toward minors. As you’ll recall, Footnote’s terms of service say the site is intended for those 18 and older, thus seeming to turn away the many K-12 classes that could take advantage of it. Justin and Peter were most passionate on this point. They told me that Footnote would like to give free access to the site for the K-12 market, but pointed to the restrictiveness of U.S. child protection laws. Because the Footnote site allows users to upload documents as well as view them, they worry about what youngsters might find there in addition to the NARA docs. These laws also mandate the “over 18” clause because the site captures personal information. It seems to me that there’s probably a technical solution that could be found here, similar to the one PBS.org uses to provide K-12 teaching materials without capturing information from the students.

Footnote seems willing to explore such a possibility, but again, Justin and Peter chalked up problems to the newness of the agreement and their inexperience running an interactive site with primary documents such as these. Footnote’s lawyers consulted (and borrowed, in some cases) the boilerplate language from terms of service at other sites, like Ancestry.com. But again, the Footnote team emphasized that they are going to review the policies and look into flexibility under the laws. They expect to tweak their policies in the coming months.

So, now is your chance to weigh in on those potential changes. If you do send a comment to either Footnote or NARA, try to be specific in what you would like to see. For instance, at the Center for History and New Media we are exploring the possibility of mining historical texts, which will only be possible to do on these millions of NARA documents if the Archives receives not only the page images from Footnote but also the OCRed text. (The handwritten documents cannot be automatically transcribed using optical character recognition, of course, but there are many typescript documents that have been converted to machine-readable text.) NARA has not asked to receive the text for each document back from Footnote—only the metadata and a combined index of all documents. There was some discussion that NARA is not equipped to handle the flood of data that a full-text database would entail. Regardless, I believe it would be in the best interest of historical researchers to have NARA receive this database, even if they are unable to post it to the web right away.

The Flawed Agreement between the National Archives and Footnote, Inc.

I suppose it’s not breaking news that libraries and archives aren’t flush with cash. So it must be hard for a director of such an institution when a large corporation, or even a relatively small one, comes knocking with an offer to digitize one’s holdings in exchange for some kind of commercial rights to the contents. But as a historian worried about open access to our cultural heritage, I’m a little concerned about the new agreement between Footnote, Inc. and the United States National Archives. And I’m surprised that somehow this agreement has thus far flown under the radar of all of those who attacked the troublesome Smithsonian/Showtime agreement. Guess what? From now until 2012 it will cost you $100 a year, or even more offensively, $1.99 a page, for online access to critical historical documents such as the Papers of the Continental Congress.

This was the agreement signed by Archivist of the United States Allen Weinstein and Footnote, Inc., a Utah-based digital archives company, on January 10, 2007. For the next five years, unless you have the time and money to travel to Washington, you’ll have to fork over money to Footnote to take a peek at Civil War pension documents or the case files of the early FBI. The National Archives says this agreement is “non-exclusive”—I suppose crossing their fingers that Google will also come along and make a deal—but researchers shouldn’t hold their breaths for other options.

Footnote.com, the website that provide access to these millions of documents, charges for anything more than viewing a small thumbnail of a page or photograph. Supposedly the value-added of the site (aside from being able to see detailed views of the documents) is that it allows you to save and annotate documents in your own library, and share the results of your research (though not the original documents). Hmm, I seem to remember that there’s a tool being developed that will allow you to do all of that—for free, no less.

Moreover, you’ll also be subject to some fairly onerous terms of usage on Footnote.com, especially considering that this is our collective history and that all of these documents are out of copyright. (For a detailed description of the legal issues involved here, please see Chapter 7 of Digital History, “Owning the Past?”, especially the section covering the often bogus claims of copyright on scanned archival materials.) I’ll let the terms speak for themselves (plus one snide aside): “Professional historians and others conducting scholarly research may use the Website [gee, thanks], provided that they do so within the scope of their professional work, that they obtain written permission from us before using an image obtained from the Website for publication, and that they credit the source. You further agree that…you will not copy or distribute any part of the Website or the Service in any medium without Footnote.com’s prior written authorization.”

Couldn’t the National Archives have at least added a provision to the agreement with Footnote to allow students free access to these documents? I guess not; from the terms of usage: “The Footnote.com Website is intended for adults over the age of 18.” What next? Burly bouncers carding people who want to see the Declaration of Independence?

Five Catalonian Libraries Join the Google Library Project

The Google Library Project has, for the most part, focused on American libraries, thus pushing the EU to mount a competing project; will this announcement (which includes the National Library of Barcelona), coming on the heels of an agreement with the Complutense University of Madrid, signal the beginning of Google making inroads in Europe?

Google Fingers

No, it’s not another amazing new piece of software from Google, which will type for you (though that would be nice). Just something that I’ve noticed while looking at many nineteenth-century books in Google’s massive digitization project. The following screenshot nicely reminds us that at the root of the word “digitization” is “digit,” which is from the Latin word “digitus,” meaning finger. It also reminds us that despite our perception of Google as a collection of computer geniuses, and despite their use of advanced scanning technology, their library project involves an almost unfathomable amount of physical labor. I’m glad that here and there, the people doing this difficult work (or at least their fingers) are being immortalized.

[The first page of a Victorian edition of Plato’s Euthyphron, a dialogue about the origin and nature of piety. Insert your own joke here about Google’s “Don’t be evil” motto.]

Google Book Search Blog

For those interested in the Google book digitization project (one of my three copyright-related stories to watch for 2006), Google launched an official blog yesterday. Right now “Inside Google Book Search” seems more like “Outside Google Book Search,” with a first post celebrating the joys of books and discovery, and with a set of links lauding the project, touting “success stories,” and soliciting participation from librarians, authors, and publishers. Hopefully we’ll get more useful insider information about the progress of the project, hints about new ways of searching millions of books, and other helpful tips for scholars in the near future. As I recently wrote in an article in D-Lib Magazine, Google’s project has some serious—perhaps fatal—flaws for those in the digital humanities (not so for the competing, but much smaller, Open Content Alliance). In particular, it would be nice to have more open access to the text (rather than mere page images) of pre-1923 books (i.e., those that are out of copyright). Of course, I’m a historian of the Victorian era who wants to scan thousands of nineteenth-century books using my own digital tools, not a giant company that may want to protect its very expensive investment in digitizing whole libraries.

Impact of Field v. Google on the Google Library Project

I’ve finally had a chance to read the federal district court ruling in a case, Field v. Google, that has not been covered much (except in the technology press), but which has obvious and important implications for the upcoming battle over the legality of Google’s library digitization project. The case, Field v. Google, involved a lawyer who dabbles in some online poetry, and who was annoyed that Google’s spider cached a version of his copyrighted ode to delicious tea (“Many of us must have it iced, some of us take it hot and combined with milk, and others are not satisfied unless they know that only the rarest of spices and ingredients are contained therein…”). Field sued Google for copyright infringement; Google argued fair use. Field lost the case, with most of his points rejected by the court. The Electronic Frontier Foundation has hailed Google’s victory as a significant one, and indeed there are some very good aspects of the ruling for the book copying case. But there also seem to be some major differences between Google’s wholesale copying of websites and its wholesale copying of books that the court implicitly recognized. The following seem to be the advantages and disadvantages of this ruling for Google, the University of Michigan, and others who wish to see the library project reach completion.

Courts have traditionally used four factors to determine fair use—the purpose of the copying, the nature of the work, the extent of the copying, and the effect on the market of the work.

On purpose, the court ruled that Google’s cache was not simply a copy of that work, but added substantial value that was important to users of Google’s search engine. Users could still read Field’s poetry even if his site was down; they could compare Google’s cache with the original site to see if any changes had been made; they could see their search terms highlighted in the page. Furthermore, with a clear banner across the top Google tells its users that this is a copy and provides a link to the original. It also provides methods for website owners to remove their pages from the cache. This emphasis on opt out seems critical, since Google has argued that book publishers can simply tell them if they don’t want their books digitized. Also, the court ruled that the Google’s status as a commercial enterprise doesn’t matter here. Advantage for Google et al.

On the nature of the work, the court looked less at the quality of Field’s writing (“Simple flavors, simple aromas, simple preparation…”) than at Field’s intentions. Since he “sought to make his works available to the widest possible audience for free” by posting his poems on the Internet, and since Field was aware that he could (through the robots.txt file) exclude search engines from indexing his site, the court thought Field’s case with respect to this fair use factor was weakened. But book publishers and authors fighting Google will argue that they do not intend this free and wide distribution. Disadvantage for Google et al.

One would think that the third factor, the extent of the copying, would be a clear loser for Google, since they copy entire web pages as a matter of course. But the Nevada court ruled that because Google’s cache serves “multiple transformative and socially valuable purposes…that could not be effectively accomplished by using only portions” of web pages, and because Google points users to the original texts, this wholesale copying was OK. You can see why Google’s lawyers are overjoyed by this part of the ruling with respect to the book digitization project. Big advantage for Google et al.

Perhaps the cruelest part of the ruling had to do with the fourth factor of fair use, the effect on the market of the work. The court determined from its reading of Field’s ode to tea that “there is no evidence of any market for Field’s works.” Ouch. But there is clearly a market for many books that remain in copyright. And since the Google library project has just begun we don’t have any economic data about Google Book Search’s impact on the market for hard copies. No clear winner here.

In additional, the Nevada court added a critical fifth factor for determining fair use in this case: “Google’s Good Faith.” By providing ways to include and exclude materials from its cache, by providing a way to complain to the company, and by clearly spelling out its intentions in the display of the cache, the court determined that Google was acting in good faith—it was simply trying to provide a useful service and had no intention to profit from Field’s obsession with tea. Google has a number of features that replicate this sense of good faith in its book program, like providing links to libraries and booksellers, methods for publishers and authors to complain, and techniques for preventing user copies of copyrighted works. Advantage for Google et al.

A couple of final points that may work against Google. First, the court made a big deal out of the fact that the cache copying was completely automated, which the Google book project is clearly not. Second, the ruling constantly emphasizes the ability of Field to opt out of the program, but upset book publishers and authors believe this should be opt in, and it’s quite possible another court could agree with that position, which would weaken many of the points made above.

Google, the Khmer Rouge, and the Public Good

Like Daniel into the lion’s den, Mary Sue Coleman, the President of the University of Michigan, yesterday went in front of the Association of American Publishers to defend her institution’s participation in Google’s massive book digitization project. Her speech, “Google, the Khmer Rouge and the Public Good,” is an impassioned defense of the project, if a bit pithy at certain points. It’s worth reading in its entirety, but here are some highlights with commentary.

In two prior posts, I wondered what will happen to those digital copies of the in-copyright books the university receives as part of its deal with Google. Coleman obviously knew that this was a major concern of her audience, and she went overboard to satisfy them: “Believe me, students will not be reading digital copies of ‘Harry Potter’ in their dorm rooms…We will safeguard the entirety of this archive with the same diligence we accord our most sensitive materials at the University: medical records, Defense Department data, and highly infectious disease agents used in research.” I’m not sure if books should be compared to infectious disease agents, but it seems clear that the digital copies Michigan receives are not likely to make it into “the wild” very easily.

Coleman reminded her audience that for a long time the books in the Michigan library did not circulate and were only accessible to the Board of Regents and the faculty (no students allowed, of course). Finally Michigan President James Angell declared that books were “not to be locked up and kept away from readers, but to be placed at their disposal with the utmost freedom.” Coleman feels that the Google project is a natural extension of that declaration, and more broadly, of the university’s mission to disseminate knowledge.

Ultimately, Coleman turns from more abstract notions of sharing and freedom to the more practical considerations of how students learn today: “When students do research, they use the Internet for digitized library resources more than they use the library proper. It’s that simple. So we are obligated to take the resources of the library to the Internet. When people turn to the Internet for information, I want Michigan’s great library to be there for them to discover.” Sounds about right to me.