Category Archives: Libraries

Two Misconceptions about the Zotero-IA Alliance

Thanks to everyone for their helpful (and thankfully, mostly positive) feedback on the new Zotero-IA alliance. I wanted to try to clear up a couple of things that the press coverage and my own writing failed to communicate. (Note to self: finally get around to going to one of those media training courses so I can learn how to communicate all of the elements of a complex project well in three minutes, rather than lapsing into my natural academic long-windedness.)

1. Zotero + IA is not simply the Zotero Commons

Again, this is probably my fault for not communicating the breadth of the project better. The press has focused on items #1 and 2 in my original post—they are the easiest to explain—but while the project does indeed try to aggregate scholarly resources, it is also trying to solve another major problem with contemporary scholarship: scholars are increasingly using and citing web resources but have no easy way to point to stable URLs and cached web pages. In particular, I encourage everyone to read item #3 in my original post again, since I consider it extremely important to the project.

Items #4 and 5 also note that we are going to leverage IA for better collaboration, discovery, and recommendation systems. So yes, the Commons, but much more too.

2. Zotero + IA is not intended to put institutional repositories out of business, nor are they excluded from participation

There has been some hand-wringing in the library blogosphere this week (see, e.g., Library 2.0) that this project makes an end-run around institutional repositories. These worries were probably exacerbated by the initial press coverage that spoke of “bypassing” the libraries. However, I want to emphasize that this project does not make IA the exclusive back end for contributions. Indeed, I am aware of several libraries that are already experimenting with using Zotero as an input device for institutional repositories. There is already an API for the Zotero client that libraries can extract data and files from, and the server will have an even more powerful API so that libraries can (with their users’ permission, of course) save materials into an archive of their own.

The Strange Dynamics of Technology Adoption and Promotion in Academia

Kudos to Bruce D’Arcus for writing the blog post I’ve been meaning to write for a while. Bruce notes with some amazement the resistance that free and open source projects like Zotero meet when they encounter the institutional buying patterns and tech evangelism that is all too common in academia. The problem here seems to be that the people doing the purchasing of software are not the end users (often the libraries at colleges and universities for reference managers like EndNote or Refworks and the IT departments for course management systems) nor do they have the proper incentives to choose free alternatives.

As Roy Rosenzweig and I noted in Digital History, the exorbitant yearly licensing fee for Blackboard or WebCT (loathed by every professor I know) could be exchanged for an additional assistant professor–or another librarian. But for some reason a certain portion of academic technology purchasers feel they need to buy something for each of these categories (reference managers, CMS), and then, because they have invested the time and money and long-term contracts on those somethings, they feel they need to exclusively promote those tools without listening to the evolving needs and desires of the people they serve. Nor do they have the incentive to try new technologies or tools.

Any suggestions on how to properly align these needs and incentives? Break out the technology spending in students’ bills (“What, my university is spending that much on Blackboard?”)?

What Do Electronic Resources Mean for the Future of University Libraries?

On our Digital Campus podcast, Tom Scheinfeldt, Mills Kelly, and I have been talking a lot about the growing disconnect between students and faculty who are increasingly using software and services, such as web email and Google Docs, that are not the university’s “officially supported” (and often quite expensive to buy, maintain, and support) software and services. In Roger C. Schonfeld and Kevin M. Guthrie, “The Changing Information Services Needs of Faculty” (EDUCAUSE Review, vol. 42, no. 4 (July/August 2007): 8–9), the authors note another possible disconnect on campus:

In the future, faculty expect to be less dependent on the library and increasingly dependent on electronic materials. By contrast, librarians generally think their role will remain unchanged and their responsibilities will only grow in the future. Indeed, over four-fifths of librarians believe that the role of the library as the starting point or gateway for locating scholarly information will be very or extremely important in five years, a decided mismatch with faculty views.

Perceptions of a decline in dependence are probably unavoidable as services are increasingly being provided remotely, and in some ways, these shifting faculty attitudes can be viewed as a sign of the library’s success. The mismatch in views on the gateway function is somewhat more troubling: if librarians view this function as critical but faculty in certain disciplines see it as declining in importance, how can libraries, individually or collectively, strategically realign the services that support the gateway function?

Good question.

The “Google Five” Describe Progress, Challenges

Among other things learned by the original five libraries that signed up with Google to have their collections digitized is this gem: “About one percent of the Bodleian Library’s books have uncut pages, meaning they’ve never been opened.” I used to find books like this at Yale and felt quite bad for their authors. Imagine all of the effort that goes into writing a book–and then no one, for hundreds of years at Oxford or Yale (bookish places with esoteric interests, wouldn’t you say?), takes even a peek inside. Evidently the long tail doesn’t quite extend all the way.

Five Catalonian Libraries Join the Google Library Project

The Google Library Project has, for the most part, focused on American libraries, thus pushing the EU to mount a competing project; will this announcement (which includes the National Library of Barcelona), coming on the heels of an agreement with the Complutense University of Madrid, signal the beginning of Google making inroads in Europe?

Impact of Field v. Google on the Google Library Project

I’ve finally had a chance to read the federal district court ruling in a case, Field v. Google, that has not been covered much (except in the technology press), but which has obvious and important implications for the upcoming battle over the legality of Google’s library digitization project. The case, Field v. Google, involved a lawyer who dabbles in some online poetry, and who was annoyed that Google’s spider cached a version of his copyrighted ode to delicious tea (“Many of us must have it iced, some of us take it hot and combined with milk, and others are not satisfied unless they know that only the rarest of spices and ingredients are contained therein…”). Field sued Google for copyright infringement; Google argued fair use. Field lost the case, with most of his points rejected by the court. The Electronic Frontier Foundation has hailed Google’s victory as a significant one, and indeed there are some very good aspects of the ruling for the book copying case. But there also seem to be some major differences between Google’s wholesale copying of websites and its wholesale copying of books that the court implicitly recognized. The following seem to be the advantages and disadvantages of this ruling for Google, the University of Michigan, and others who wish to see the library project reach completion.

Courts have traditionally used four factors to determine fair use—the purpose of the copying, the nature of the work, the extent of the copying, and the effect on the market of the work.

On purpose, the court ruled that Google’s cache was not simply a copy of that work, but added substantial value that was important to users of Google’s search engine. Users could still read Field’s poetry even if his site was down; they could compare Google’s cache with the original site to see if any changes had been made; they could see their search terms highlighted in the page. Furthermore, with a clear banner across the top Google tells its users that this is a copy and provides a link to the original. It also provides methods for website owners to remove their pages from the cache. This emphasis on opt out seems critical, since Google has argued that book publishers can simply tell them if they don’t want their books digitized. Also, the court ruled that the Google’s status as a commercial enterprise doesn’t matter here. Advantage for Google et al.

On the nature of the work, the court looked less at the quality of Field’s writing (“Simple flavors, simple aromas, simple preparation…”) than at Field’s intentions. Since he “sought to make his works available to the widest possible audience for free” by posting his poems on the Internet, and since Field was aware that he could (through the robots.txt file) exclude search engines from indexing his site, the court thought Field’s case with respect to this fair use factor was weakened. But book publishers and authors fighting Google will argue that they do not intend this free and wide distribution. Disadvantage for Google et al.

One would think that the third factor, the extent of the copying, would be a clear loser for Google, since they copy entire web pages as a matter of course. But the Nevada court ruled that because Google’s cache serves “multiple transformative and socially valuable purposes…that could not be effectively accomplished by using only portions” of web pages, and because Google points users to the original texts, this wholesale copying was OK. You can see why Google’s lawyers are overjoyed by this part of the ruling with respect to the book digitization project. Big advantage for Google et al.

Perhaps the cruelest part of the ruling had to do with the fourth factor of fair use, the effect on the market of the work. The court determined from its reading of Field’s ode to tea that “there is no evidence of any market for Field’s works.” Ouch. But there is clearly a market for many books that remain in copyright. And since the Google library project has just begun we don’t have any economic data about Google Book Search’s impact on the market for hard copies. No clear winner here.

In additional, the Nevada court added a critical fifth factor for determining fair use in this case: “Google’s Good Faith.” By providing ways to include and exclude materials from its cache, by providing a way to complain to the company, and by clearly spelling out its intentions in the display of the cache, the court determined that Google was acting in good faith—it was simply trying to provide a useful service and had no intention to profit from Field’s obsession with tea. Google has a number of features that replicate this sense of good faith in its book program, like providing links to libraries and booksellers, methods for publishers and authors to complain, and techniques for preventing user copies of copyrighted works. Advantage for Google et al.

A couple of final points that may work against Google. First, the court made a big deal out of the fact that the cache copying was completely automated, which the Google book project is clearly not. Second, the ruling constantly emphasizes the ability of Field to opt out of the program, but upset book publishers and authors believe this should be opt in, and it’s quite possible another court could agree with that position, which would weaken many of the points made above.

Google, the Khmer Rouge, and the Public Good

Like Daniel into the lion’s den, Mary Sue Coleman, the President of the University of Michigan, yesterday went in front of the Association of American Publishers to defend her institution’s participation in Google’s massive book digitization project. Her speech, “Google, the Khmer Rouge and the Public Good,” is an impassioned defense of the project, if a bit pithy at certain points. It’s worth reading in its entirety, but here are some highlights with commentary.

In two prior posts, I wondered what will happen to those digital copies of the in-copyright books the university receives as part of its deal with Google. Coleman obviously knew that this was a major concern of her audience, and she went overboard to satisfy them: “Believe me, students will not be reading digital copies of ‘Harry Potter’ in their dorm rooms…We will safeguard the entirety of this archive with the same diligence we accord our most sensitive materials at the University: medical records, Defense Department data, and highly infectious disease agents used in research.” I’m not sure if books should be compared to infectious disease agents, but it seems clear that the digital copies Michigan receives are not likely to make it into “the wild” very easily.

Coleman reminded her audience that for a long time the books in the Michigan library did not circulate and were only accessible to the Board of Regents and the faculty (no students allowed, of course). Finally Michigan President James Angell declared that books were “not to be locked up and kept away from readers, but to be placed at their disposal with the utmost freedom.” Coleman feels that the Google project is a natural extension of that declaration, and more broadly, of the university’s mission to disseminate knowledge.

Ultimately, Coleman turns from more abstract notions of sharing and freedom to the more practical considerations of how students learn today: “When students do research, they use the Internet for digitized library resources more than they use the library proper. It’s that simple. So we are obligated to take the resources of the library to the Internet. When people turn to the Internet for information, I want Michigan’s great library to be there for them to discover.” Sounds about right to me.

Rough Start for Digital Preservation

How hard will it be to preserve today’s digital record for tomorrow’s historians, researchers, and students? Judging by the preliminary results of some attempts to save for the distant future the September 11 Digital Archive (a project I co-directed), it won’t be easy. While there are some bright spots to the reports in D-Lib Magazine last month on the efforts of four groups to “ingest” (or digitally accession) the thousands of files from the 9/11 collection, the overall picture is a little bit sobering. And this is a fairly well-curated (though by no means perfect) collection. Just imagine what ingesting a messy digital collection, e.g., the hard drive of your average professor, would entail. Here are some of the important lessons from these early digital preservation attempts, as I see it.

But first, a quick briefing on the collection. The September 11 Digital Archive is a joint project of the Center for History and New Media at George Mason University and the American Social History Project/Center for Media and Learning at the Graduate Center of the City University of New York. From January 2002 to the present (though mostly in the first two years) it has collected via the Internet (and some analog means, later run through digitization processes) about 150,000 objects, ranging from emails and BlackBerry communications to voicemail and digital audio, to typed recollections, photographs, and art. I think it’s a remarkable collection that will be extremely valuable to researchers in the future who wish to understand the attacks of 9/11 and their aftermath. In September 2003, the Library of Congress agreed to accession the collection, one of its first major digital accessions.

We started the project as swiftly as possible after 9/11, with the sense that we should do our best on the preservation front, but also with the understanding that we would probably have to cut some corners if we wanted to collect as much as we could. We couldn’t deliberate for months about the perfect archival structure or information architecture or wait for the next release of DSpace. Indeed, I wrote most of the code for the project in a week or so over the holiday break at the end of 2001. Not my best PHP programming effort ever, but it worked fine for the project. And as Clay Shirky points out in the D-Lib opening piece, this is likely to be the case for many projects—after all, projects that spend a lot of time and effort on correct metadata schemes and advanced hardware and software probably are going to be in the position to preserve their own materials anyway. The question is what will happen when more normal collections are passed from their holders to preservation outfits, such as the Library of Congress.

All four of the groups that did a test ingest of our 9/11 collection ran into some problems, though not necessarily at the points they expected. Harvard, Johns Hopkins, Old Dominion, and Stanford encountered some hurdles, beginning with my first point:

You can’t trust anything, even simple things like file types. The D-Lib reports note that a very small but still significant percentage of files in the 9/11 collection seemed to not be the formats they presented themselves as. What amazes me reading this is that I wrote some code to validate file types as they were being uploaded by contributors onto our server, using some powerful file type assessment tools built into PHP and Apache (our web server software). Obviously these validations failed to work perfectly. When you consider handling billion-object collections, even a 1% (or .1%) error rate is a lot. Which leads me to point #2…

We may have to modify to preserve. Although for generations archival science has emphasized keeping objects in their original format, I wonder if it might have been better if (as we had thought about at first on the 9/11 project) we had converted files contributed by the general public into just a few standardized formats. For instance, we could have converted (using the powerful ImageMagick server software) all of the photographs into one of the JPEG formats (yes, there are more than one, which turned out to be a pain). We would have “destroyed” the original photograph in the upload process—indeed, worse than that from a preservation perspective, we would have compressed it again, losing some information—but we could have presented the Library of Congress with a simplified set of files. That simplification process leads me to point #3…

Simple almost always beats complex when it comes to computer technology. I have incredible admiration for preservation software such as DSpace and Fedora, and I tend toward highly geeky solutions, but I’m much more pessimistic than those who believe that we are on the verge of preservation solutions that will keep digital files for centuries. Maybe it’s the historian of the Victorian age in me, reminding myself of the fate of so many nineteenth-century books that were not acid-free and so are deteriorating slowly in libraries around the world. Anyway, it was nice to see Shirky conclude in a similar vein that it looks like digital preservation efforts will have to be “data-centric” rather than “tool-centric” or “process-centric.” Specific tools will fade away over time, and so will ways of processing digital materials. Focusing on the data itself and keeping those files intact (and in use—that which is frequently used will be preserved) is critical. We’ll hopefully be able to access those saved files in the future with a variety of tools and using a variety of processes that haven’t even been invented yet.