Category Archives: Books

First Impressions of the Google Books Settlement

Just announced is the settlement of the class action lawsuit that the Authors Guild, the Association of American Publishers and individual authors and publishers filed against Google for its Book Search program, which has been digitizing millions of books from libraries. (Hard to believe, but the lawsuit was first covered on this blog all the way back in November 2005.) Undoubtedly this agreement is a critical one not only for Google and the authors and publishers, but for all of us in academia and others who care about the present and future of learning and scholarship.

It will obviously take some time to digest this agreement; indeed, the Google post on it is fairly sketchy and we still need to hear details, such as the cost structure for full access the agreement now provides for. But my first impressions of some key points:

The agreement really focuses on in-copyright but out-of-print books. That is, books that can’t normally be copied but also can’t be purchased anywhere. Highlighting these books (which are numerous; most academic books, e.g., are out-of-print and have virtually no market) was smart for Google since it seems to provide value without stepping on publishers’ toes.

A second (also smart, but probably more controversial) focus is on access to the Google Books collection via libraries:

We’ll also be offering libraries, universities and other organizations the ability to purchase institutional subscriptions, which will give users access to the complete text of millions of titles while compensating authors and publishers for the service. Students and researchers will have access to an electronic library that combines the collections from many of the top universities across the country. Public and university libraries in the U.S. will also be able to offer terminals where readers can access the full text of millions of out-of-print books for free.

Again, we need to hear more details about this part of the agreement. We also need to begin thinking about how this will impact libraries, e.g., in terms of their own book acquisition plans and their subscriptions to other online databases.

Finally, and perhaps most interesting and surprising to those of us in the digital humanities, is an all-too-brief mention of computational access to these millions of books:

In addition to the institutional subscriptions and the free public access terminals, the agreement also creates opportunities for researchers to study the millions of volumes in the Book Search index. Academics will be able to apply through an institution to run computational queries through the index without actually reading individual books.

For years in this space I have been arguing for the necessity of such access (first envisioned, to give due credit, by Cliff Lynch of CNI). Inside Google they have methods for querying and analyzing these books that we academics could greatly benefit from, and that could enable new kinds of digital scholarship.

Update: The Association of American Publishers now has a page answering frequently asked questions about the agreement (have we had time to ask?).

Digital Campus #29 – Making It Count

Tom, Mills, and I take up the muchdebated issue of whether and how digital work should count toward promotion and tenure on this episode of the podcast. We also examine the significance of university presses putting their books on Amazon’s Kindle device, and the release of better copyright records. [Subscribe to this podcast.]

Happy 4th of July!

Mass Digitization of Books: Exit Microsoft, What Next?

So Microsoft has left the business of digitizing millions of books—apparently because they saw it as no business at all.

This leaves Microsoft’s partner (and our partner on the Zotero project), the Internet Archive, somewhat in the lurch, although Microsoft has done the right thing and removed the contractual restrictions on the books they digitized so they may become part of IA’s fully open collection (as part of the broader Open Content Alliance), which now has about 400,000 volumes. Also still on the playing field is the Universal Digital Library (a/k/a the Million Books Project), which has 1.5 million volumes.

And then there’s Google and its Book Search program. For those keeping score at home, my sources tell me that Google, which coyly likes to say it has digitized “over a million books” so far, has actually finished scanning five million. It will be hard for non-profits like IA to catch up with Google without some game-changing funding or major new partnerships.

Foundations like the Alfred P. Sloan Foundation have generously made substantial (million-dollar) grants to add to the digital public domain. But with the cost of digitizing 10 million pre-1923 books at around $300 million, where might this scale of funds and new partners come from? To whom can the Open Content Alliance turn to replace Microsoft?

Frankly, I’ve never understood why institutions such as Harvard, Yale, and Princeton haven’t made a substantial commitment to a project like OCA. Each of these universities has seen its endowment grow into the tens of billions in the last decade, and each has the means and (upon reflection) the motive to do a mass book digitization project of Google’s scale. $300 million sounds like a lot, but it’s less than 1% of Harvard’s endowment and my guess is that the amount is considerably less than all three universities are spending to build and fund laboratories for cutting-edge sciences like genomics. And a 10 million public-domain book digitization project is just the kind of outrageously grand project HYP should be doing, especially if they value the humanities as much as the sciences.

Moreover, Harvard, Yale, and Princeton find themselves under enormous pressure to spend more of their endowment for a variety of purposes, including tuition remission and the public good. (Full and rather vain disclosure: I have some relationship to all three institutions; I complain because I love.) Congress might even get into the act, mandating that universities like HYP spend a more generous minimum percentage of their endowment every year, just like private foundations who benefit (as does HYP, though in an indirect way) from the federal tax code.

In one stroke HYP could create enormous good will with a moon-shot program to rival Google’s: free books for the world. (HYP: note the generous reaction to, and the great press for, MIT’s OpenCourseWare program.) And beyond access, the project could enable new forms of scholarship through computational access to a massive corpora of full texts.

Alas, Harvard and Princeton partnered with Google long ago. Princeton has committed to digitizing about one million volumes with Google; Harvard’s number is unclear, but probably smaller. The terms of the agreement with Google are non-exclusive; Harvard and Princeton could initiate their own digitization projects or form other partnerships. But I suspect that would be politically difficult since the two universities are getting free digitization services from Google and would have to explain to their overseers why they want to replace free with very expensive. (The answer sounds like Abbott and Costello: the free program produces something that’s not free, while the expensive one is free.)

If Google didn’t exist, Harvard would probably be the most obvious candidate to pull off the Great Digitization of Widener. Not only does it have the largest endowment; historian Robert Darnton, a leader in thinking about the future (and the past) of the book, is now the director of the Harvard library system. Harvard also recently passed an open access mandate for the publications of its faculty.

Princeton has the highest per-student endowment of any university, and could easily undertake a mass digitization project of this scale. Perhaps some of the many Princeton alumni who went on to vast riches on the Web, such as EBay‘s Meg Whitman (who has already given $100 million to Princeton) or Amazon‘s Jeff Bezos, could pitch in.

But Harvard’s and Princeton’s Google “non-exclusive” partnership makes these outcomes unlikely, as does the general resistance in these universities to spending science-scale funds outside of the sciences (unless it’s for a building).

That leaves Yale. Yale chose Microsoft last year to do its digitization, and has now been abandoned right in the middle of its project. Since Microsoft is apparently leaving its equipment and workflow in place at partner institutions, Yale could probably pick up the pieces with an injection of funding from its endowment or from targeted alumni gifts. Yale just spent an enormous amount of money on a new campus for the sciences, and this project could be seen as a counterbalance for the humanities.

Or, HYP could band together and put in a mere $100 million each to get the job done.

Is this likely to happen? Of course not. HYP and other wealthy institutions are being asked to spend their prodigious endowments on many other things, and are reluctant to up their spending rate at all. But I believe a HYP or HYP-like solution is much more likely than public funding for this kind of project, as the Human Genome Project received.

Still Waiting for a Real Google Book Search API

For years on this blog, at conferences, and even in direct conversations with Google employees I have been agitating for an API (application programming interface) for Google Book Search. (For a summary of my thoughts on the matter, see my imaginatively titled post, “Why Google Books Should Have an API.”) With the world’s largest collection of scanned books, I thought such an API would have major implications for doing research in the humanities. And I looked forward to building applications on top of the API, as I had done with my Syllabus Finder.

So why was I disappointed when Google finally released an API for their book scanning project a couple of weeks ago?

My suspicion began with the name of the API itself. Even though the URL for the API is http://code.google.com/apis/books/, suggesting that this is the long-awaited API for the kind of access to Google Books that I’ve been waiting for, the rather prosaic and awkward title of the API suggests otherwise: The Google Book Search Book Viewability API. From the API’s home page:

The Google Book Search Book Viewability API enables developers to:

  • Link to Books in Google Book Search using ISBNs, LCCNs, and OCLC numbers
  • Know whether Google Book Search has a specific title and what the viewability of that title is
  • Generate links to a thumbnail of the cover of a book
  • Generate links to an informational page about a book
  • Generate links to a preview of a book

These are remarkably modest goals. Certainly the API will be helpful for online library catalogs and other book services (such as LibraryThing) that wish to embed links to Google’s landing pages for books and (when copyright law allows) links to the full texts. The thumbnails of book covers will make OPACs look prettier.

But this API does nothing to advance the kind of digital scholarship I have advocated for in this space. To do that the API would have to provide direct access to the full OCRed text of the books, to provide the ability to mine these texts for patterns and to combine them with other digital tools and corpora. Undoubtedly copyright concerns are part of the story here, hobbling what Google can do. But why not give full access to pre-1923 books through the API?

I’m not hopeful that there are additional Google Book Search APIs coming. If that were the case the URL for the viewability API would be http://code.google.com/apis/books/viewability/. The result is that this API simply seems like a way to drive traffic to Google Books, rather than to help academia or to foster a external community of developers, as other Google APIs have done.

Google Book Search Begins Adding Quality Control Measures

As predicted in this space six months ago, Google has added the ability for users to report missing or poorly scanned pages in their Book Search. (From my post “Google Books: Champagne or Sour Grapes?“: “Just as they have recently added commentary to Google News, they could have users flag problematic pages.”)

I’ll say it again: criticism of Google Book Search that focuses on quality chases a red herring—something that Google can easily fix. Let’s focus instead on more substantive issues, such as the fact that Google’s book archive is not truly open.

The Case for Open Access Books

Open BookThis month’s First Monday has one of the most pragmatic, sensible articles I’ve read about the promise and perils of open access books. In “Open access book publishing in writing studies: A case study,” by Charles Bazerman, David Blakesley, Mike Palmquist, and David Russell, the authors describe their experience deciding to eschew a traditional publication arrangement with an academic press (what supposedly gives our monographs the sheen of value and gets us tenure). Instead they publish an edited volume straight to the web.

Along the way the authors discover that many of the concerns that humanities scholars have about publishing in a free and open way are either overblown or simply myths. Only one junior scholar (out of the 20 scholars asked to contribute) worries about promotion and tenure. And indeed all of the scholars who contribute to the edited volume receive credit for their chapters. More important, the editors and contributors are surprised to discover that the book makes its way rapidly and powerfully into the consciousness of their field:

[The] initial reaction [to the book] did not prepare us for the acceptance the book ultimately received from the academic communities to which it was addressed.

Since its publication, the Writing Selves/Writing Societies Web page has been visited more than 85,000 times by more than 36,000 unique visitors. The trend, interestingly, has been a steady increase in visits over the past four years, with more than 30,000 occurring in the past 12 months. Since its publication, the book has been downloaded in its entirety more than 36,000 times. Individual essays have been downloaded more than 108,000 times. In terms of perceived quality of the scholarly work in the collection, the book has been well received by the field. Within six months of publication, the book was positively reviewed by four journals: two print and two electronic. One year after its publication, in the keynote address to the Conference on College Composition and Communication, the major annual conference in writing studies, Kathleen Blake Yancey quoted extensively from chapters in the book. And the book has continued to figure prominently in scholarly work subsequently published in the field of composition and rhetoric.

According to a search of Google Scholar, which indexes scholarly publications available on the Web (29 September 2006), the book or individual chapters in it has been cited 68 times, according to a search of Google Scholar. Although we do not have comprehensive comparison data for print publications, we suspect that this is a higher rate. A print–only collection with about the same number of chapters (15) published in the same year as Writing Selves/Writing Societies (and winner of a best book award given by a leading journal in the field), had far fewer citations: 10. Our experience suggests that open access scholarly books follow a pattern of citation similar to journals, which indicate that open access journal articles in a wide range of fields are both more likely to be cited and likely to be cited more quickly. Our experience with Writing Selves/Writing Societies supports this…

Overall, Writing Selves/Writing Societies appears to have entered into the system of book publishing neatly, in spite of the fact that it was not published by a traditional academic publisher and was being offered at no charge.

Beyond the questions of business models, scholarly influence, and promotion and tenure, there is also the nagging question Roy Rosenzweig posed in “Should Historical Scholarship Be Free?” At the time Roy was the Vice President for Research at the American Historical Association, and was pushing for open access to the American Historical Review. (Ultimately he got the powers that be to agree to put AHR articles online for free, although the book reviews remain behind gates.)

Besides the ethical good of publishing in an open access model—sharing educational and scholarly materials—Roy noted that the work of most scholars is funded, directly or indirectly, by the public. Noting the National Institutes of Health‘s recent mandate that grantees share their work openly with the public, Roy wrote:

The new policy affects few historians, but its implications ought to give us serious pause. After all, historical research also benefits directly (albeit considerably less generously) through grants from federal agencies like the National Endowment for the Humanities; even more of us are on the payroll of state universities, where research support makes it possible for us to write our books and articles. If we extend the notion of “public funding” to private universities and foundations (who are, of course, major beneficiaries of the federal tax codes), it can be argued that public support underwrites almost all historical scholarship.

Do the fruits of this publicly supported scholarship belong to the public? Should the public have free access to it?

Roy, of course, thought this meant that like NIH grantees we should provide open access to our articles, such as those in the AHR. But doesn’t the same argument hold true for books?

[Postscript: Some scientists have been wondering the same thing.]

[Image credit]

MacEachern and Turkel, The Programming Historian

Bill Turkel, the always creative mind behind Digital History Hacks (logrolling disclosure: Bill is a friend of CHNM, a collaborator on various fronts, and was the thought-provoking guest on Digital Campus #9; still, he deserves the compliments), and his colleague at the University of Western Ontario, Alan MacEachern, are planning to write a book entitled The Programming Historian. Better yet, the book will be open access and hosted on the Network in Canadian History & Environment (NiCHE) site. Bill’s summary of the book on his blog sounds terrific. Can’t wait to read it and use it in my classes.

The Digital Critique of “To Read or Not To Read”

More healthy debate about the NEA’s jeremiad To Read or Not To Read is happening on the Institute for the Future of the Book’s blog. Let me try to summarize my critique of the NEA report, and you should be sure to read the whole report so as not to be swiftly criticized by the evidently touchy authors and their supporters.

I have no doubt that book reading is declining. My offense at the report has to do with the second-class status of the digital realm throughout. Sunil Iyengar, the Director of Research & Analysis for the NEA states on p. 23 of the report:

Unless “book-reading” is specifically mentioned, study results on voluntary reading should be taken as referencing all varieties of leisure reading (e.g., magazines, newspapers, online reading), and not books alone. [my emphasis]

But the rest of the report makes it almost impossible to see how “online reading” was actually included as “voluntary reading” and lauded as such. While there are indeed charts about “book reading,” most charts are at best ambiguous about what “reading” means and at worst seem to make the online world devoid of words. For example, Table 3E, on p. 40, lists the “Weekly Average Hours and/or Minutes Spent on Various Activities by American Children, 2002-3.” But bizarrely “computer activities” (2:45) are distinct from “reading” (1:17), as if no reading occurred during those online hours.

More generally—and this is what I think many of us in the digital humanities are reacting to—the report is suffused with the nostalgic view of armchair leisure book reading (a nostalgia I share, by the way, and indeed deeply yearn for as an overstretched father of young children with a very busy day job). The report thus belittles the work of all of us trying to move serious reading and scholarship where it will surely go in the coming decades—online. As a historian, it reminds me of the early modern disparagements of writing and reading in the vernacular, back when only Latin would do for “serious” study and scholarship.

The double standard for digital reading versus paper reading can be seen in a letter to the Chronicle of Higher Ed by Mark Bauerlein. Bauerlein’s retort to Matt Kirschenbaum is to look at “what eye-tracking technology reveals about how users scan Web pages.” I assume his point is that these studies reveal the ADHD that we “votaries of online and screen reading” have, skimming and grazing rather than “really” reading. But can you imagine what would be revealed in eye tracking studies of readers of newspapers and magazines? Ad agencies have long known—indeed, it is the first principle of graphic design in advertising—that most pages are glanced at for mere seconds or even a fraction of a second, not “read.” (The report, sensing this potential criticism and keeping to its theme, emphasizes on p. 51 that teenagers are more likely to “skim” the “news sections.”)

But what of books? I’m sure I’m not the only academic who would like to strap eye trackers onto the heads of the book prize committees for professional academic organizations, who are supposed to read dozens or hundreds of books in short order—but surely skim (or worse). Matt Kirschenbaum and many others are simply making what, upon reflection, is a rather commonsensical point: that “reading” has always included multiple styles, including deep linear styles and more flighty ones. As Roy Rosenzweig and I point out in our book Digital History, we academics should be finding ways to encourage long-form reading on the screen (where all reading will ultimately head anyway) rather than, in our bookish nostalgia, ceding the medium to web usability specialists who encourage blurb writing for short attention spans.

Ultimately, To Read or Not To Read seems strangely dated in 2008. On its pages it remains obsessed with TV just at the point when kids’ leisure time pursuits are moving swiftly online. In an age when an “academic blog” is no longer an oxymoron, the report inexplicably mentions “blogs”—the source of so much online reading and writing and now even part of so many classrooms—on a single page out of 98, and only to dismiss them as pseudo-reading and writing in a worn critique that resorts to quoting from Sven Birkerts’ early-Web Gutenberg Elegies (1994). The report also oddly dismisses the exponential rise in online newspaper readership while lamenting the 2 or 3 percent yearly decline in paper “subscribers.”

After reading the civics portion of the report (pp. 86-92), which particularly emphasizes the importance of book reading (see pp. 88-89), a question came to mind: might email, IM, texting, social networking and other online pursuits enhance “civic engagement” and understanding more than reading a good thick policy treatise? The smartphone-bearing, Facebook-using teenagers currently working (often virtually) on the presidential primaries in the United States have little time for leisure reading, and a good number of them are probably not “voluntary readers” of the Platonic sort envisioned in To Read or Not To Read. But they are learning—and doing, and reading—much more in the digital realm than this myopic report can conceive.

The Idealization of the Book

Let me take the liberty of being the last academic with a blog to comment on the launch of Amazon’s new e-book reader, the Kindle. And let me also not waste any time on its design, screen, wireless technology, business model, or its uncanny resemblance to the Sinclair ZX80 I used in seventh grade. What little I have to say about the Kindle has less to do with the “e” than with the “book” part of it.

Although I’m generally an early adopter of technology, the Kindle—and indeed all e-book readers—strike me as similar to “photoplays,” or the filming of stage performances, that followed the introduction of film in the early twentieth century. In Janet Murray’s Hamlet on the Holodeck, she distinguishes between “additive” and “expressive” features of new media, and notes that photoplays were “a merely additive art form (photography plus theatre).” Only when filmmakers learned to use montage, close-ups, zooms, and the like as part of storytelling did photoplays give way to the new expressive form of movies.

Just as many of those who were used to plays assumed the highest form of film would be the fixed-camera photographing of Shakespeare, those used to books assume the highest form of digital reading will be the book transported to a dedicated electronic device. This idealization of reading paper as the highest form of intellectual consumption has led so many to believe that we need an electronic book reader like Amazon’s just-released Kindle: to do real reading we have to take a text from our computer and put it onto a book-ish device that’s as close to paper as possible.

Wrong. Just as people kept going to plays, people will continue to read books (albeit perhaps fewer) while they adjust to online reading for many other purposes. Only a rigid elitist would insist that book reading is optimal in all cases. What book or journal allows me to keep up with the work of over 250 scholars in the digital humanities? My RSS reader does, and quite well. And while some of us older folks may idealize the daily reading of newspapers (in addition to loving books I subscribe to two newspapers because I love them so) we might as well admit that online reading is a far better way to stay informed about many topics—just ask sports junkies. Or compare the breadth and depth of the coverage of Web trends between the New York Times‘s business section and the TechCrunch blog.

Matt Kirschenbaum of the Maryland Institute for Technology in the Humanities eloquently covers this issue in his far more subtle analysis of the state of reading than found in the overanxious National Endowment for the Arts’ report To Read or Not to Read: A Question of National Consequence. (Unfortunately Matt’s article is behind the Chronicle of Higher Ed‘s electronic gates; when will they join the New York Times and the Wall Street Journal in opening these gates and becoming part of the online discussion?) Matt highlights the many new forms of reading uncatalogued—or worse, dismissed—by the NEA report. These types include the exponentially growing forms of online reading that young people take for granted. While not idealizing these new forms, Matt notes that they can (contrary to the NEA’s belief) involve serious thought, and that they can engender writing as well.

As Matt points out, people are already voraciously reading on their computers, and when they read in an electronic format they want to take full advantage of the medium—link to texts from their blog or syllabus, email them, connect them to the universe of other writing and other people online.

To be sure, the reading of books has declined and there are elements of that decline to worry about. But let’s also remember that that very little of what kids read offline is Proust, and not all of what kids read online is their Facebook news feed.

Update: The Chronicle of Higher Ed has made a rare exception to their gating and provided an open access copy of Matt’s article.

Symposium on the Future of Scholarly Communication

For those who missed it, between October 12 and 27, 2007, there was a very thoughtful and insightful online discussion of how the publication of scholarship is changing—or trying to change—in the digital age. Participating in the discussion were Ed Felton, David Robinson, Paul DiMaggio, and Andrew Appel from Princeton University (the symposium was hosted by the Center for Information Technology Policy at Princeton), Ira Fuchs of the Mellon Foundation, Peter Suber of the indispensable Open Access News blog (and philosophy professor at Earlham College), Stan Katz, the President Emeritus of the American Council of Learned Societies, and Laura Brown of Ithaka (and formerly the President of Oxford University Press USA).

The symposium is really worth reading from start to finish. (Alas, one of the drawbacks of hosting a symposium on a blog is that it keeps everything in reverse chronological order; it would be great if CITP could flip the posts now that the discussion has ended.) But for those of us in the humanities the most relevant point is that we are going to have a much harder transition to an online model of scholarship than in the sciences. The main reason for this is that for us the highest form of scholarship is the book, whereas in the sciences it is the article, which is far more easily put online, posted in various forms (including as pre- and e-prints), and networked to other articles (through, e.g., citation analysis). In addition, we’re simply not as technologically savvy. As Paul DiMaggio points out, “every computer scientist who received his or her Ph.D. in computer science after 1980 or so has a website” (on which they can post their scholarly production), whereas the number is about 40% for political scientists and I’m sure far less for historians and literature professors.

I’m planning a long post in this space on the possible ways for humanities professors to move from print to open online scholarship; this discussion is great food for thought.