Dan Cohen

Archive for the ‘Libraries’ Category

Mass Digitization of Books: Exit Microsoft, What Next?

Thursday, May 29th, 2008

So Microsoft has left the business of digitizing millions of books—apparently because they saw it as no business at all.

This leaves Microsoft’s partner (and our partner on the Zotero project), the Internet Archive, somewhat in the lurch, although Microsoft has done the right thing and removed the contractual restrictions on the books they digitized so they may become part of IA’s fully open collection (as part of the broader Open Content Alliance), which now has about 400,000 volumes. Also still on the playing field is the Universal Digital Library (a/k/a the Million Books Project), which has 1.5 million volumes.

And then there’s Google and its Book Search program. For those keeping score at home, my sources tell me that Google, which coyly likes to say it has digitized “over a million books” so far, has actually finished scanning five million. It will be hard for non-profits like IA to catch up with Google without some game-changing funding or major new partnerships.

Foundations like the Alfred P. Sloan Foundation have generously made substantial (million-dollar) grants to add to the digital public domain. But with the cost of digitizing 10 million pre-1923 books at around $300 million, where might this scale of funds and new partners come from? To whom can the Open Content Alliance turn to replace Microsoft?

Frankly, I’ve never understood why institutions such as Harvard, Yale, and Princeton haven’t made a substantial commitment to a project like OCA. Each of these universities has seen its endowment grow into the tens of billions in the last decade, and each has the means and (upon reflection) the motive to do a mass book digitization project of Google’s scale. $300 million sounds like a lot, but it’s less than 1% of Harvard’s endowment and my guess is that the amount is considerably less than all three universities are spending to build and fund laboratories for cutting-edge sciences like genomics. And a 10 million public-domain book digitization project is just the kind of outrageously grand project HYP should be doing, especially if they value the humanities as much as the sciences.

Moreover, Harvard, Yale, and Princeton find themselves under enormous pressure to spend more of their endowment for a variety of purposes, including tuition remission and the public good. (Full and rather vain disclosure: I have some relationship to all three institutions; I complain because I love.) Congress might even get into the act, mandating that universities like HYP spend a more generous minimum percentage of their endowment every year, just like private foundations who benefit (as does HYP, though in an indirect way) from the federal tax code.

In one stroke HYP could create enormous good will with a moon-shot program to rival Google’s: free books for the world. (HYP: note the generous reaction to, and the great press for, MIT’s OpenCourseWare program.) And beyond access, the project could enable new forms of scholarship through computational access to a massive corpora of full texts.

Alas, Harvard and Princeton partnered with Google long ago. Princeton has committed to digitizing about one million volumes with Google; Harvard’s number is unclear, but probably smaller. The terms of the agreement with Google are non-exclusive; Harvard and Princeton could initiate their own digitization projects or form other partnerships. But I suspect that would be politically difficult since the two universities are getting free digitization services from Google and would have to explain to their overseers why they want to replace free with very expensive. (The answer sounds like Abbott and Costello: the free program produces something that’s not free, while the expensive one is free.)

If Google didn’t exist, Harvard would probably be the most obvious candidate to pull off the Great Digitization of Widener. Not only does it have the largest endowment; historian Robert Darnton, a leader in thinking about the future (and the past) of the book, is now the director of the Harvard library system. Harvard also recently passed an open access mandate for the publications of its faculty.

Princeton has the highest per-student endowment of any university, and could easily undertake a mass digitization project of this scale. Perhaps some of the many Princeton alumni who went on to vast riches on the Web, such as EBay’s Meg Whitman (who has already given $100 million to Princeton) or Amazon’s Jeff Bezos, could pitch in.

But Harvard’s and Princeton’s Google “non-exclusive” partnership makes these outcomes unlikely, as does the general resistance in these universities to spending science-scale funds outside of the sciences (unless it’s for a building).

That leaves Yale. Yale chose Microsoft last year to do its digitization, and has now been abandoned right in the middle of its project. Since Microsoft is apparently leaving its equipment and workflow in place at partner institutions, Yale could probably pick up the pieces with an injection of funding from its endowment or from targeted alumni gifts. Yale just spent an enormous amount of money on a new campus for the sciences, and this project could be seen as a counterbalance for the humanities.

Or, HYP could band together and put in a mere $100 million each to get the job done.

Is this likely to happen? Of course not. HYP and other wealthy institutions are being asked to spend their prodigious endowments on many other things, and are reluctant to up their spending rate at all. But I believe a HYP or HYP-like solution is much more likely than public funding for this kind of project, as the Human Genome Project received.

NYPL’s New Blog

Friday, February 22nd, 2008

A few months ago I mentioned a blog from the New York Public Library’s digital labs. Now the NPYL has launched a superb new overall blog with some terrific images from their collection and some rather humorous and engaging text.

Two Misconceptions about the Zotero-IA Alliance

Friday, December 14th, 2007

Thanks to everyone for their helpful (and thankfully, mostly positive) feedback on the new Zotero-IA alliance. I wanted to try to clear up a couple of things that the press coverage and my own writing failed to communicate. (Note to self: finally get around to going to one of those media training courses so I can learn how to communicate all of the elements of a complex project well in three minutes, rather than lapsing into my natural academic long-windedness.)

1. Zotero + IA is not simply the Zotero Commons

Again, this is probably my fault for not communicating the breadth of the project better. The press has focused on items #1 and 2 in my original post—they are the easiest to explain—but while the project does indeed try to aggregate scholarly resources, it is also trying to solve another major problem with contemporary scholarship: scholars are increasingly using and citing web resources but have no easy way to point to stable URLs and cached web pages. In particular, I encourage everyone to read item #3 in my original post again, since I consider it extremely important to the project.

Items #4 and 5 also note that we are going to leverage IA for better collaboration, discovery, and recommendation systems. So yes, the Commons, but much more too.

2. Zotero + IA is not intended to put institutional repositories out of business, nor are they excluded from participation

There has been some hand-wringing in the library blogosphere this week (see, e.g., Library 2.0) that this project makes an end-run around institutional repositories. These worries were probably exacerbated by the initial press coverage that spoke of “bypassing” the libraries. However, I want to emphasize that this project does not make IA the exclusive back end for contributions. Indeed, I am aware of several libraries that are already experimenting with using Zotero as an input device for institutional repositories. There is already an API for the Zotero client that libraries can extract data and files from, and the server will have an even more powerful API so that libraries can (with their users’ permission, of course) save materials into an archive of their own.

The Strange Dynamics of Technology Adoption and Promotion in Academia

Monday, November 5th, 2007

Kudos to Bruce D’Arcus for writing the blog post I’ve been meaning to write for a while. Bruce notes with some amazement the resistance that free and open source projects like Zotero meet when they encounter the institutional buying patterns and tech evangelism that is all too common in academia. The problem here seems to be that the people doing the purchasing of software are not the end users (often the libraries at colleges and universities for reference managers like EndNote or Refworks and the IT departments for course management systems) nor do they have the proper incentives to choose free alternatives.

As Roy Rosenzweig and I noted in Digital History, the exorbitant yearly licensing fee for Blackboard or WebCT (loathed by every professor I know) could be exchanged for an additional assistant professor–or another librarian. But for some reason a certain portion of academic technology purchasers feel they need to buy something for each of these categories (reference managers, CMS), and then, because they have invested the time and money and long-term contracts on those somethings, they feel they need to exclusively promote those tools without listening to the evolving needs and desires of the people they serve. Nor do they have the incentive to try new technologies or tools.

Any suggestions on how to properly align these needs and incentives? Break out the technology spending in students’ bills (”What, my university is spending that much on Blackboard?”)?

NYPL Labs Blog

Monday, November 5th, 2007

NYPL Labs Logo

Center for History and New Media alum and incredibly innovative digital thinker Josh Greenberg is now the Director of Digital Strategy and Scholarship at the New York Public Library. One of his first actions was to set up the NYPL Labs to produce and test new tools, technologies, and interfaces. It’s great to see they now have a blog that will expose these experiments in action.

What Do Electronic Resources Mean for the Future of University Libraries?

Thursday, July 26th, 2007

On our Digital Campus podcast, Tom Scheinfeldt, Mills Kelly, and I have been talking a lot about the growing disconnect between students and faculty who are increasingly using software and services, such as web email and Google Docs, that are not the university’s “officially supported” (and often quite expensive to buy, maintain, and support) software and services. In Roger C. Schonfeld and Kevin M. Guthrie, “The Changing Information Services Needs of Faculty” (EDUCAUSE Review, vol. 42, no. 4 (July/August 2007): 8–9), the authors note another possible disconnect on campus:

In the future, faculty expect to be less dependent on the library and increasingly dependent on electronic materials. By contrast, librarians generally think their role will remain unchanged and their responsibilities will only grow in the future. Indeed, over four-fifths of librarians believe that the role of the library as the starting point or gateway for locating scholarly information will be very or extremely important in five years, a decided mismatch with faculty views.

Perceptions of a decline in dependence are probably unavoidable as services are increasingly being provided remotely, and in some ways, these shifting faculty attitudes can be viewed as a sign of the library’s success. The mismatch in views on the gateway function is somewhat more troubling: if librarians view this function as critical but faculty in certain disciplines see it as declining in importance, how can libraries, individually or collectively, strategically realign the services that support the gateway function?

Good question.

Personal WorldCat Lists Now Zotero-Compatible

Thursday, July 5th, 2007

A great example of what I’ve been calling the “fluidity of bibliography.” WorldCat adds a feature that allows registered users to save and share lists of items they find in the WorldCat catalog. We tweak Zotero to work with it. Et voila–easy to find, save, share, grab, and re-share scholarly records.

The “Google Five” Describe Progress, Challenges

Thursday, June 28th, 2007

Among other things learned by the original five libraries that signed up with Google to have their collections digitized is this gem: “About one percent of the Bodleian Library’s books have uncut pages, meaning they’ve never been opened.” I used to find books like this at Yale and felt quite bad for their authors. Imagine all of the effort that goes into writing a book–and then no one, for hundreds of years at Oxford or Yale (bookish places with esoteric interests, wouldn’t you say?), takes even a peek inside. Evidently the long tail doesn’t quite extend all the way.

Five Catalonian Libraries Join the Google Library Project

Friday, January 12th, 2007

The Google Library Project has, for the most part, focused on American libraries, thus pushing the EU to mount a competing project; will this announcement (which includes the National Library of Barcelona), coming on the heels of an agreement with the Complutense University of Madrid, signal the beginning of Google making inroads in Europe?

Impact of Field v. Google on the Google Library Project

Thursday, February 9th, 2006

I’ve finally had a chance to read the federal district court ruling in a case, Field v. Google, that has not been covered much (except in the technology press), but which has obvious and important implications for the upcoming battle over the legality of Google’s library digitization project. The case, Field v. Google, involved a lawyer who dabbles in some online poetry, and who was annoyed that Google’s spider cached a version of his copyrighted ode to delicious tea (”Many of us must have it iced, some of us take it hot and combined with milk, and others are not satisfied unless they know that only the rarest of spices and ingredients are contained therein…”). Field sued Google for copyright infringement; Google argued fair use. Field lost the case, with most of his points rejected by the court. The Electronic Frontier Foundation has hailed Google’s victory as a significant one, and indeed there are some very good aspects of the ruling for the book copying case. But there also seem to be some major differences between Google’s wholesale copying of websites and its wholesale copying of books that the court implicitly recognized. The following seem to be the advantages and disadvantages of this ruling for Google, the University of Michigan, and others who wish to see the library project reach completion.

Courts have traditionally used four factors to determine fair use—the purpose of the copying, the nature of the work, the extent of the copying, and the effect on the market of the work.

On purpose, the court ruled that Google’s cache was not simply a copy of that work, but added substantial value that was important to users of Google’s search engine. Users could still read Field’s poetry even if his site was down; they could compare Google’s cache with the original site to see if any changes had been made; they could see their search terms highlighted in the page. Furthermore, with a clear banner across the top Google tells its users that this is a copy and provides a link to the original. It also provides methods for website owners to remove their pages from the cache. This emphasis on opt out seems critical, since Google has argued that book publishers can simply tell them if they don’t want their books digitized. Also, the court ruled that the Google’s status as a commercial enterprise doesn’t matter here. Advantage for Google et al.

On the nature of the work, the court looked less at the quality of Field’s writing (”Simple flavors, simple aromas, simple preparation…”) than at Field’s intentions. Since he “sought to make his works available to the widest possible audience for free” by posting his poems on the Internet, and since Field was aware that he could (through the robots.txt file) exclude search engines from indexing his site, the court thought Field’s case with respect to this fair use factor was weakened. But book publishers and authors fighting Google will argue that they do not intend this free and wide distribution. Disadvantage for Google et al.

One would think that the third factor, the extent of the copying, would be a clear loser for Google, since they copy entire web pages as a matter of course. But the Nevada court ruled that because Google’s cache serves “multiple transformative and socially valuable purposes…that could not be effectively accomplished by using only portions” of web pages, and because Google points users to the original texts, this wholesale copying was OK. You can see why Google’s lawyers are overjoyed by this part of the ruling with respect to the book digitization project. Big advantage for Google et al.

Perhaps the cruelest part of the ruling had to do with the fourth factor of fair use, the effect on the market of the work. The court determined from its reading of Field’s ode to tea that “there is no evidence of any market for Field’s works.” Ouch. But there is clearly a market for many books that remain in copyright. And since the Google library project has just begun we don’t have any economic data about Google Book Search’s impact on the market for hard copies. No clear winner here.

In additional, the Nevada court added a critical fifth factor for determining fair use in this case: “Google’s Good Faith.” By providing ways to include and exclude materials from its cache, by providing a way to complain to the company, and by clearly spelling out its intentions in the display of the cache, the court determined that Google was acting in good faith—it was simply trying to provide a useful service and had no intention to profit from Field’s obsession with tea. Google has a number of features that replicate this sense of good faith in its book program, like providing links to libraries and booksellers, methods for publishers and authors to complain, and techniques for preventing user copies of copyrighted works. Advantage for Google et al.

A couple of final points that may work against Google. First, the court made a big deal out of the fact that the cache copying was completely automated, which the Google book project is clearly not. Second, the ruling constantly emphasizes the ability of Field to opt out of the program, but upset book publishers and authors believe this should be opt in, and it’s quite possible another court could agree with that position, which would weaken many of the points made above.