Category Archives: Books

What Would You Do With a Million Books?

What would you do with a million digital books? That’s the intriguing question this month’s D-Lib Magazine asked its contributors, as an exercise in understanding what might happen when massive digitization projects from Google, the Open Content Alliance, and others reach their fruition. I was lucky enough to be asked to write one of the responses, “From Babel to Knowledge: Data Mining Large Digital Collections,” in which I discuss in much greater depth the techniques behind some of my web-based research tools. (A bonus for readers of the article: learn about the secret connection between cocktail recipes and search engines.) Most important, many of the contributors make recommendations for owners of any substantial online resource. My three suggestions, summarized here, focus on why openness is important (beyond just “free beer” and “free speech” arguments), the relatively unexplored potential of application programming interfaces (APIs), and the curious implications of information theory.

1. More emphasis needs to be placed on creating APIs for digital collections. Readers of this blog have seen this theme in several prior posts, so I won’t elaborate on it again here, though it’s a central theme of the article.

2. Resources that are free to use in any way, even if they are imperfect, are more valuable than those that are gated or use-restricted, even if those resources are qualitatively better. The techniques discussed in my article require the combination of dispersed collections and programming tools, which can only happen if each of these services or sources is openly available on the Internet. Why use Wikipedia (as I do in my H-Bot tool), which can be edited—or vandalized—by anyone? Not only can one send out a software agent to scan entire articles on the Wikipedia site (whereas the same spider is turned away by the gated Encyclopaedia Britannica), one can instruct a program to download the entire Wikipedia and store it on one’s server (as we have done at the Center for History and New Media), and then subject that corpus to more advanced manipulations. While flawed, Wikipedia is thus extremely valuable for data-mining purposes. For the same reason, the Open Content Alliance digitization project (involving Yahoo, Microsoft, and the Internet Archive, among others) will likely prove more useful for advanced digital research than Google’s far more ambitious library scanning project, which only promises a limited kind of search and retrieval.

3. Quantity may make up for a lack of quality. We humanists care about quality; we greatly respect the scholarly editions of texts that grace the well-tended shelves of university research libraries and disdain the simple, threadbare paperback editions that populate the shelves of airport bookstores. The former provides a host of helpful apparatuses, such as a way to check on sources and an index, while the latter merely gives us plain, unembellished text. But the Web has shown what can happen when you aggregate a very large set of merely decent (or even worse) documents. As the size of a collection grows, you can begin to extract information and knowledge from it in ways that are impossible with small collections, even if the quality of individual documents in that giant corpus is relatively poor.

Impact of Field v. Google on the Google Library Project

I’ve finally had a chance to read the federal district court ruling in a case, Field v. Google, that has not been covered much (except in the technology press), but which has obvious and important implications for the upcoming battle over the legality of Google’s library digitization project. The case, Field v. Google, involved a lawyer who dabbles in some online poetry, and who was annoyed that Google’s spider cached a version of his copyrighted ode to delicious tea (“Many of us must have it iced, some of us take it hot and combined with milk, and others are not satisfied unless they know that only the rarest of spices and ingredients are contained therein…”). Field sued Google for copyright infringement; Google argued fair use. Field lost the case, with most of his points rejected by the court. The Electronic Frontier Foundation has hailed Google’s victory as a significant one, and indeed there are some very good aspects of the ruling for the book copying case. But there also seem to be some major differences between Google’s wholesale copying of websites and its wholesale copying of books that the court implicitly recognized. The following seem to be the advantages and disadvantages of this ruling for Google, the University of Michigan, and others who wish to see the library project reach completion.

Courts have traditionally used four factors to determine fair use—the purpose of the copying, the nature of the work, the extent of the copying, and the effect on the market of the work.

On purpose, the court ruled that Google’s cache was not simply a copy of that work, but added substantial value that was important to users of Google’s search engine. Users could still read Field’s poetry even if his site was down; they could compare Google’s cache with the original site to see if any changes had been made; they could see their search terms highlighted in the page. Furthermore, with a clear banner across the top Google tells its users that this is a copy and provides a link to the original. It also provides methods for website owners to remove their pages from the cache. This emphasis on opt out seems critical, since Google has argued that book publishers can simply tell them if they don’t want their books digitized. Also, the court ruled that the Google’s status as a commercial enterprise doesn’t matter here. Advantage for Google et al.

On the nature of the work, the court looked less at the quality of Field’s writing (“Simple flavors, simple aromas, simple preparation…”) than at Field’s intentions. Since he “sought to make his works available to the widest possible audience for free” by posting his poems on the Internet, and since Field was aware that he could (through the robots.txt file) exclude search engines from indexing his site, the court thought Field’s case with respect to this fair use factor was weakened. But book publishers and authors fighting Google will argue that they do not intend this free and wide distribution. Disadvantage for Google et al.

One would think that the third factor, the extent of the copying, would be a clear loser for Google, since they copy entire web pages as a matter of course. But the Nevada court ruled that because Google’s cache serves “multiple transformative and socially valuable purposes…that could not be effectively accomplished by using only portions” of web pages, and because Google points users to the original texts, this wholesale copying was OK. You can see why Google’s lawyers are overjoyed by this part of the ruling with respect to the book digitization project. Big advantage for Google et al.

Perhaps the cruelest part of the ruling had to do with the fourth factor of fair use, the effect on the market of the work. The court determined from its reading of Field’s ode to tea that “there is no evidence of any market for Field’s works.” Ouch. But there is clearly a market for many books that remain in copyright. And since the Google library project has just begun we don’t have any economic data about Google Book Search’s impact on the market for hard copies. No clear winner here.

In additional, the Nevada court added a critical fifth factor for determining fair use in this case: “Google’s Good Faith.” By providing ways to include and exclude materials from its cache, by providing a way to complain to the company, and by clearly spelling out its intentions in the display of the cache, the court determined that Google was acting in good faith—it was simply trying to provide a useful service and had no intention to profit from Field’s obsession with tea. Google has a number of features that replicate this sense of good faith in its book program, like providing links to libraries and booksellers, methods for publishers and authors to complain, and techniques for preventing user copies of copyrighted works. Advantage for Google et al.

A couple of final points that may work against Google. First, the court made a big deal out of the fact that the cache copying was completely automated, which the Google book project is clearly not. Second, the ruling constantly emphasizes the ability of Field to opt out of the program, but upset book publishers and authors believe this should be opt in, and it’s quite possible another court could agree with that position, which would weaken many of the points made above.

Google, the Khmer Rouge, and the Public Good

Like Daniel into the lion’s den, Mary Sue Coleman, the President of the University of Michigan, yesterday went in front of the Association of American Publishers to defend her institution’s participation in Google’s massive book digitization project. Her speech, “Google, the Khmer Rouge and the Public Good,” is an impassioned defense of the project, if a bit pithy at certain points. It’s worth reading in its entirety, but here are some highlights with commentary.

In two prior posts, I wondered what will happen to those digital copies of the in-copyright books the university receives as part of its deal with Google. Coleman obviously knew that this was a major concern of her audience, and she went overboard to satisfy them: “Believe me, students will not be reading digital copies of ‘Harry Potter’ in their dorm rooms…We will safeguard the entirety of this archive with the same diligence we accord our most sensitive materials at the University: medical records, Defense Department data, and highly infectious disease agents used in research.” I’m not sure if books should be compared to infectious disease agents, but it seems clear that the digital copies Michigan receives are not likely to make it into “the wild” very easily.

Coleman reminded her audience that for a long time the books in the Michigan library did not circulate and were only accessible to the Board of Regents and the faculty (no students allowed, of course). Finally Michigan President James Angell declared that books were “not to be locked up and kept away from readers, but to be placed at their disposal with the utmost freedom.” Coleman feels that the Google project is a natural extension of that declaration, and more broadly, of the university’s mission to disseminate knowledge.

Ultimately, Coleman turns from more abstract notions of sharing and freedom to the more practical considerations of how students learn today: “When students do research, they use the Internet for digitized library resources more than they use the library proper. It’s that simple. So we are obligated to take the resources of the library to the Internet. When people turn to the Internet for information, I want Michigan’s great library to be there for them to discover.” Sounds about right to me.

First Impressions of Amazon Connect

Having already succumbed to the siren’s song that prodded me narcissistically to create a blog, I had very little resistance left when Amazon.com emailed me to ask if I might like to join the beta of program that allows authors to reach potential buyers and existing owners of their books by writing blog-like posts. Called “Amazon Connect,” this service will soon be made available to the authors of all of the books available for purchase on Amazon. Here are some notes about my experience joining the program (and how you can join if you’re an author), some thoughts about what Amazon Connect might be able to do, and some insider information about their upcoming launch.

First, the inside scoop. As far as I can tell, Amazon Connect began around Thanksgiving 2005 with a pilot that enlisted about a dozen authors. It has been slowly expanding since then but is still in beta, and a quiet beta at that. It’s unlikely you’ve seen an Amazon Connect section on one of their web pages. However, I recently learned from the Amazon Connect team that in early February the service will have its official launch, with a big publicity push.

After that point, each post an author makes will appear on the Amazon.com page for his or her book(s). I found out by writing a post of my own that his feature is actually already enabled, as you can see by looking at the page for Digital History (scroll down the page a bit to see my post).

But the launch will also entail a much more significant change—to the home page of Amazon.com itself, which is of course individualized for each user. Starting in February, on the home page of every Amazon user who has purchased your book(s), your posts will show up immediately. Since it’s unlikely that a purchaser of a book will return to that book’s buy page, this appearance on the Amazon home page is important: Authors will effectively gain the ability to send messages to a sizable number of their readers.

Since generally it has been impossible to compile a decent contact list for those who buy a specific book (unless you’re in the NSA or CIA), Amazon’s idea is intriguing. While Amazon Connect is clearly intended to sell more books, and the writing style they advocate less than academic (“a conversational, first-person tone”), it’s remarkable to think that the author of a scholarly monograph might be able to reach a good portion of their audience this way. Indeed, I suspect that for authors of academic press books that might not sell hundreds of thousands of copies, the proportion of buyers of their book that use Amazon is much higher than for popular books (since those books are sold in a higher percentage at physical Barnes & Noble and Borders stores, and increasingly at Costco and Wal-Mart). Could Amazon Connect foster smaller communities of authors and readers, for more esoteric topics?

If you are an author and would like to join the Amazon Connect beta in time for the February launch, here’s what you need to do:

1) First, you must have an Amazon account. If you already have one, go to the special Amazon Connect website, login, and claim your book(s) using the “Register Your Bibliography” link. This involves listing the contact info for your publisher, editor, publicist, or other third party that can verify that you are actually the author of the book(s) you list. About a week later you’ll get an email confirming that you have been verified.

2) Create a profile. You are required to upload a photo, write a short biography, and provide some other information about yourself (such as your email address) that you can choose to share with your audience (I didn’t fill a lot of this out, such as my favorite movies).

3) Once you’ve been added to the system, you can start writing posts. Good luck saying hello to your readers, and remember Amazon Connect rule #5: “No boring content”!

2006: Crossroads for Copyright

The coming year is shaping up as one in which a number of copyright and intellectual property issues will be highly contested or resolved, likely having a significant impact on academia and researchers who wish to use digital materials in the humanities. In short, at stake in 2006 are the ground rules for how professors, teachers, and students may carry out their work using computer technology and the Internet. Here are three major items to follow closely.

Item #1: What Will Happen to Google’s Massive Digitization Project?

The conflict between authors, publishers, and Google will probably reach a showdown in 2006, with either the beginning of court proceedings or some kind of compromise. Google believes it has a good case for continuing to digitize library books, even those still under copyright; some authors and most publishers believe otherwise. So far, not much in the way of compromise. Indeed, if you have been following the situation carefully, it’s clear that each side is making clever pre-trial maneuvers to bolster their case. Google cleverly changed the name of its project to Google Book Search from Google Print, which emphasizes not the (possibly illegal) wholesale digitization of printed works but the fact that the program is (as Google’s legal briefs assert) merely a parallel project to their indexing of the web. The implication is that if what they’re doing with their web search is OK (for which they also need to make copies, albeit of born-digital pages), then Google Book Search is also OK. As Larry Lessig, Siva Vaidhyanathan, and others have highlighted, if the ruling goes against Google given this parallelism (“it’s all in the service of search”), many important web services might soon be illegal as well.

Meanwhile, the publishers have made some shrewd moves of their own. They have announced a plan to work with Amazon to accept micropayments for a few page views from a book (e.g., a recipe). And HarperCollins recently decided to embark on its own digitization program, ostensibly to provide book searches through its website. If you look at the legal basis of fair use (which Google is professing for its project), you’ll understand why these moves are important to the publishers: they can now say that Google’s project hurts the market for their works, even if Google shows only a small amount of a copyrighted book. In addition, a judge can no longer rule that Google is merely providing a service of great use to the public that the publishers themselves are unable or unwilling to provide. And I thought the only smart people in this debate were on Google’s side.

If you haven’t already read it, I recommend looking at my notes on what a very smart lawyer and a digital visionary have to say about the impending lawsuits.

Item #2: Chipping Away at the DMCA

In the first few months of 2006, the Copyright Office of the United States will be reviewing the dreadful Digital Millenium Copyright Act—one of the biggest threats to scholars who wish to use digital materials. The DMCA has effectively made many researchers, such as film studies professors, criminals, because they often need to circumvent rights management protection schemes on devices like DVDs to use them in a classroom or for in-depth study (or just to play them on certain kinds of computers). This circumvention is illegal under the law, even if you own the DVD. Currently there are only four minor exemptions to the DMCA, so it is critical that other exemptions for teachers, students, and scholars be granted. If you would like to help out, you can go to the Copyright Office’s website in January and sign your name to various efforts to carve out exemptions. One effort you can join, for instance, is spearheaded by Peter Decherney and others at the University of Pennsylvania. They want to clear the way for fully legal uses of audiovisual works in educational settings. Please contact me if you would like to add your name to that important effort.

Item #3: Libraries Reach a Crossroads

In an upcoming post I plan to discuss at length a fascinating article (to be published in 2006) by Rebecca Tushnet, a Georgetown law professor, that highlights the strange place at which libraries have arrived in the digital age. Libraries are the center of colleges and universities (often quite literally), but their role has been increasingly challenged by the Internet and the protectionist copyright laws this new medium has engendered. Libraries have traditionally been in the long-term purchasing and preservation business, but they increasing spend their budgets on yearly subscriptions to digital materials that could disappear if their budgets shrink. They have also been in the business of sharing their contents as widely as possible, to increase knowledge and understanding broadly in society; in this way, they are unique institutions with “special concerns not necessarily captured by the end-consumer-oriented analysis with which much copyright scholarship is concerned,” as Prof. Tushnet convincingly argues. New intellectual property laws (such as the DMCA) threaten this special role of libraries (aloof from the market), and if they are going to maintain this role, 2006 will have to be the year they step forward and reassert themselves.

Clifford Lynch and Jonathan Band on Google Book Search

The topic for the November 2005 Washington DC Area Forum on Technology and the Humanities focused on “Massive Digitization Programs and Their Long-Term Implications: Google Print, the Open Content Alliance, and Related Developments.” The two speakers at the forum, Clifford Lynch and Jonathan Band, are among the most intelligent and thought-provoking commentators on the significance of Google’s Book Search project (formerly known as Google Print, with the Google Print Library Project being the company’s attempt to digitize millions of books at the University of Michigan, Stanford, Harvard, Oxford, and the New York Public Library). These are my notes from the forum, highlighting not the basics of the project, which have been covered well in the mainstream media, but angles and points that may interest the readers of this blog.

Clifford Lynch has been the Director of the Coalition for Networked Information (CNI) since July 1997. CNI, jointly sponsored by the Association of Research Libraries and Educause, includes about 200 member organizations concerned with the use of information technology and networked information to enhance scholarship and intellectual productivity. Prior to joining CNI, Lynch spent 18 years at the University of California Office of the President, the last 10 as Director of Library Automation. Lynch, who holds a Ph.D. in Computer Science from the University of California, Berkeley, is an adjunct professor at Berkeley’s School of Information Management and Systems.

Jonathan Band is a Washington-based attorney who helps shape the laws governing intellectual property and the Internet through a combination of legislative and appellate advocacy. He has represented library and technology clients with respect to the drafting of the Digital Millennium Copyright Act (DMCA), database protection legislation, and other statutes relating to copyrights, spam, cybersecurity, and indecency. He received his BA from Harvard College and his JD from Yale Law School. He worked in the Washington, D.C. office of Morrison & Foerster for nearly 20 years before opening his own law firm earlier this year.

Clifford Lynch

  • one of things that have made conversion of back runs of journals easy is the concentration of copyright in the journal owners, rather than the writers of articles
  • contrast this with books, where copyrights are much more elusive
  • strange that the university presses of these same univs. in the google print library project were among the first complainers about the project
  • there’s a lot more to the availability of out of copyright material than copyright law—for instance, look at the policies of museums, which don’t let you take photographs of their out of copyright paintings
  • same thing will likely happen with google print
  • while there has been a lot of press about the dynamic action plan for european digitization, it is probably a plan w/o a budget
  • important to remember that there has been a string of visionary literature—e.g., H.G. Wells’s “worldbrain”—promoting making the world’s knowledge accessible to everyone—knowledge’s power to make people’s lives better—not a commercial view—this feeling was also there at the beginning of the Internet
  • legal justifications have been made for policy decisions that are really bad
  • large scale open access corpora are now showing great value, using data mining applications: see the work of the intelligence community, pharmaceutical industry—will the humanities follow with these large digitization projects
  • we are entering an era that will give new value to ontologies, gazetteers, etc., to aid in searching large corpora
  • if google loses this case, search engines might be outlawed [Lawrence Lessig makes this point on his blog too —DC]
  • because of insane copyright law like sonny bono act there might be a bifurcation of the world into the digitized world of pre-1923 and the copyrighted, gated post-1923 world

Jonathan Band

  • fair use is at base about economics and morality—thus the cases (authors, publishers) against google are interesting cases in a broad social sense, not just pure law
  • only 20% of the books being digitized are out of copyright (approx.)
  • for certain works, like a dictionary, where even a snippet would have an economic impact on the copyright holder, google will probably not make even a snippet available
  • copyright owners say copyright is opt-in, not opt-out (as Google is making it in their progam)—it seems dumb, but this is a big legal issue for these cases
  • owners are correct that copyright is normally an opt-in experience—the owner must be contacted first before you make a use of their work, except when it’s fair use—then you don’t need to ask
  • thus the case will really be about fair use
  • key precendent: kelly vs. arribasoft: image search, found in favor of the search engine; kelly was a cantankerous photographer of the West who posted his photos on his website but didn’t want them copied by arribasoft (2 years ago; ended in 9th circuit); court found that search engine was a transformative use and useful for the public, even though it’s commercial use; court couldn’t find any negative economic impact on the market for kelly’s work [this case is covered in chapter 7 of Digital History —DC]
  • google’s case compares very favorably with arribasoft
  • publishers have weaker case because they are now saying that putting something on the web means that you’re giving an implied license to copy (no implied license for books)—but they’ve argued before that copyright applies just as strongly on the web
  • bot exclusion headers (robots.txt)—respected by search enginesvbut that sounds like opt-out, not opt-in—so publishers also probably shouldn’t be pointing to that in their case
  • publishers are also pointing to the google program for publishers, in which publishers allow google to scan their books and then they share in revenues—publishers are saying that the google library program is undermining this market, where publishers license their material; transaction costs of setting up a similar program for library books would be enormous–indeed it can’t be done: google is probably spending $750 million to scan 30 mil. books (at $25/bk); it would probably cost $1000/bk if you had to clear rights for scanning; no one would ever be able to pay for clearing rights like this, so what google is doing is broad and shallow vs. deep but narrow, which is what you could do if you cleared rights—many of these other digitization projects (e.g., Microsoft) are only doing 100K books at most
  • if google doesn’t succeed at this project, no one else will be able to do it—so if we agree that this book search project is a useful thing, then as a social matter Google should be allowed to do it under fair use
  • what’s the cost to the authors other than a little loss of control?