Category Archives: Books

Tony Grafton on Digital Texts and Reading

Anthony Grafton was the first person to turn me onto intellectual history. His seminar on ideas in the Renaissance was one of the most fascinating courses I took at Princeton, and I still remember well Tony rocking in his seat, looking a bit like a young Karl Marx, making brilliant connections among a broad array of sources.

So it’s not unexpected given his wide-ranging interests but still terrific to see a scholar who has spent so much time with early books thinking deeply about “digitization and its discontents” in his article “Future Reading” in the latest issue of The New Yorker. And it’s even more gratifying to see Tony note in his online companion piece to “Future Reading,” “Adventures in Wonderland,” that “One of the best ways to get a handle on the sprawling world of digital sources is through George Mason University’s Center for History and New Media.”

Steven Johnson at the Italian Embassy

Well, they didn’t have my favorite wine (Villa Cafaggio Chianti Classico Reserva, if you must know), but I had a nice evening at the Italian Embassy in Washington. The occasion was the start of a conference, “Using New Technologies to Explore Cultural Heritage,” jointly sponsored by the National Endowment for the Humanities and the Consiglio Nazionale delle Ricerche (National Research Council) of Italy. The setting was the embassy’s postmodern take on the Florentine palazzo (see below); the speaker was bestselling author and digerati Steven Johnson (Everything Bad is Good for You: How Today’s Popular Culture Is Actually Making Us Smarter; Outside.in).

Italian Embassy

Steven Johnson

Johnson’s talk was entitled “The Open Book: The Future of Text in the Digital Age.” (I present his thoughts here without criticism; it’s late.) Johnson argued that despite all of the hand-wringing and dire predictions, the book was not in decline. Indeed, he thought that because of new media books have new channels to expand into. While some believed ten years ago that we were entering an age of image and video, the rise of web instead led to the continued dominance of text, online and off. He noted that more hardcover books were sold in 2006 than 2005; and more in 2005 than in 2004. Newspapers have huge online audiences that dwarf their paper readership, thus strengthening their importance to culture.

Johnson pointed to four important innovations in online writing:

1) Collaborative writing is in a golden age because of the Internet. One need only look at Wikipedia, especially the social process of its underlying discussion pages (in addition to the surface article pages).

2) Fan fiction is also in its heyday. There are almost 300,000 (!) fan-written, unauthorized sequels to Harry Potter on fanfiction.net. There are even countless reviews of this fan fiction.

3) Blogging has become an important force, and great for authors. Blogs often provide unpolished comments about books by readers that are just as helpful as professional reviews.

4) Discovery of relevant materials and passages has been made much easier by new media–just think about the difference between research for a book now and roaming through the stacks in a library. Software like DEVONthink has made scholarship easier by connecting hidden dots and sorting through masses of text.

Finally, Johnson argued that despite the allure of the web, physical books are still the best way for an author to get inside someone’s head and convince them about something important. The book still has much greater weight and impact than even the most important blog post.

Google Books: Is It Good for History?

The September 2007 issue of the American Historical Association’s Perspectives is now available online, and it is worth reading Rob Townsend’s article “Google Books: Is It Good for History?” The article is an update of Rob’s much-debated post on the AHA blog in May, and I believe this revised version now reads as the best succinct critique of Google Books available (at least from the perspective of scholars). Rob finds fault with Google’s poor scans, frequently incorrect metadata, and too-narrow interpretation of the public domain.

Regular readers of this blog know of my aversion to jeremiads about Google, but Rob’s piece is well-reasoned and I agree with much of what he says.

Why Google Books Should Have an API

No Way Out[This post is a version of a message I sent to the listserv for CenterNet, the consortium of digital humanities centers. Google has expressed interest in helping CenterNet by providing a (limited) corpus of full texts from their Google Books program, but I have been arguing for an API instead. My sense is that this idea has considerable support but that there are also some questions about the utility of an API, including from within Google.]

My argument for an API over an extracted corpus of books begins with a fairly simple observation: how are we to choose a particular dataset for Google to compile for us? I’m a scholar of the Victorian era, so a large corpus from the nineteenth century would be great, but how about those who study the Enlightenment? If we choose novels, what about those (like me) who focus on scientific literature? Moreover, many of us wish to do more expansive horizontal (across genres in a particular age) and vertical (within the same genre but through large spans of time) analyses. How do we accommodate the wishes of everyone who does computational research in the humanities?

Perhaps some of the misunderstanding here is about the kinds of research a humanities scholar might do as opposed to, say, the computational linguist, who might make use of a dataset or corpus (generally a broad and/or normalized one) to assess the nature of (a) language itself, examine frequencies and patterns of words, or address computer science problems such as document classification. Some of these corpora can provide a historian like me with insights as long as the time span involved is long enough and each document includes important metadata such as publication date (e.g., you can trace the rise and fall of certain historical themes using BYU’s Time Magazine corpus).

But there are many other analyses that humanities scholars could undertake with an API, especially one that allowed them to first search for books of possible interest and then to operate on the full texts of that ad hoc corpus. An example from my own research: in my last book I argued that mathematics was “secularized” in the nineteenth century, and part of my evidence was that mathematical treatises, which normally contained religious language in the early nineteenth century, lost such language by the end of the century. By necessity, researching in the pre-Google Books era, my textual evidence was limited–I could only read a certain number of treatises and chose to focus on the writing of high-profile mathematicians.

How would I go about supporting this thesis today using Google Books? I would of course love to have an exhaustive corpus of mathematical treatises. But in my book I also used published books of poems, sermons, and letters about math. In other words, it’s hard to know exactly what to assemble in advance–just treatises would leave out much of the story and evidence.

Ideally, I would like to use an API to find books that matched a complicated set of criteria (it would be even better if I could use regular expressions to find the many variants of religious language and also to find religious language relatively close to mentions of mathematics), and then use get_cache to acquire the full OCRed text of these matching books. From that ad hoc corpus I would want to do some further computational analyses on my own server, such as extracting references to touchstones for the divine vision of mathematics (e.g., Plato’s later works, geometry rather than number theory), and perhaps even do some aggregate analyses (from which works did British mathematicians most often acquire this religious philosophy of mathematics?). I would also want to examine these patterns over time to see if indeed the bond between religion and mathematics declined in the late Victorian era.

This is precisely the model I use for my Syllabus Finder. I first find possible syllabi using an algorithm-based set of searches of Google (via the unfortunately deprecated SOAP Search API) while also querying local Center for History and New Media databases for matches. Since I can then extract the full texts of matching web pages from Google (using the API’s cache function), I can do further operations, such as pulling book assignments out of the syllabi (using regular expressions).

It seems to me that a model is already in place at Google for such an API for Google Books: their special university researcher’s version of the Search API. That kind of restricted but powerful API program might be ideal because 1) I don’t think an API would be useful without the get_OCRed_text function, which (let’s face it) liberates information that is currently very hard to get even though Google has recently released a plain text view of (only some of) its books; and 2) many of us want to ping the Google Books API with more than the standard daily hit limit for Google APIs.

[Image credit: the best double-entendre cover I could find on Google Books: No Way Out by Beverly Hastings.]

Google Books: Champagne or Sour Grapes?

Beyond Good and EvilIs it possible to have a balanced discussion of Google’s outrageously ambitious and undoubtedly flawed project to scan tens of millions of books in dozens of research libraries? I have noted in this space the advantages and disadvantages of Google Books—sometimes both at one time. Heck, the only time this blog has ever been seriously “dugg” is when I noted the appearance of fingers in some Google scans. Google Books is an easy target.

This week Paul Duguid has received a lot of positive press (e.g., Peter Brantley, if:book) for his dressing down of Google Books, “Inheritance and loss? A brief survey of Google Books.” It’s a very clever article, using poorly scanned Google copies of Lawrence Sterne’s absurdist and raunchy comedy Tristram Shandy to reveal the extent of Google’s folly and their “disrespect” for physical books.

I thought I would enjoy reading Duguid’s article, but I found myself oddly unenthusiastic by the end.

Of course Google has poor scans—as the saying goes, haste makes waste—but this is not a scientific survey of the percentage of pages that are unreadable or missing (surely less than 0.1% in my viewing of scores of Victorian books). Nor does the article note that Google might have possible remedies for some of these inadequacies. For example, they almost certainly have higher-resolution, higher-contrast scans that are different than the lo-res ones they display (a point made at the Million Books workshop; they use the originals for OCR), which they can revisit to produce better copies for the web. Just as they have recently added commentary to Google News, they could have users flag problematic pages. Truly bad books could be rescanned or replaced by other libraries’ versions.

Most egregiously, none of the commentaries I have seen on Duguid’s jeremiad have noted the telling coda to the article: “This paper is based on a talk given to the Society of Scholarly Publishers, San Francisco, 6 June 2007. I am grateful to the Society for the invitation.” The question of playing to the audience obviously arises.

Google Books will never be perfect, or even close. Duguid is right that it disrespects age-old, critical elements of books. (Although his point that Google disrespects metadata strangely fails to note that Google is one of the driving forces behind the Future of Bibliographic Control meetings, which are all about metadata.) Google Books is the outcome, like so many things at Google, of a mathematical challenge: How can you scan tens of millions of books in five years? It’s easy to say they should do a better job and get all the details right, but if you do the calculations of that assessment, you’ll probably see that the perfect library scanning project would take 50 years rather than 5. As in OCR, getting from 98% to 100% accuracy would probably take an order of magnitude longer and be an order of magnitude more expensive. That’s the trade-off they have decided to make, and as a company interested in search, where near-100% accuracy is unnecessary (I have seen OCR specialists estimate that even 90% accuracy is perfectly fine for search), it must have been an easy decision to make.

Complaining about the quality, thoroughness, and fidelity of Google’s (public) scans distracts us from the larger problem of Google Books. As I have argued repeatedly in this space, the real problem—especially for those in the digital humanities but also for many others—is that Google Books is not open. Recently they have added the ability to view some books in “plain text” (i.e., the OCRed text, but it’s hard to copy text from multiple pages at once), and even in some cases to download PDFs of public domain works. But those moves don’t go far enough for scholarly needs. We need what Cliff Lynch of CNI has called “computational access,” a higher level of access that is less about reading a page image on your computer than applying digital tools and analyses to many pages or books at one time to create new knowledge and understanding.

An API would be ideal for this purpose if Google doesn’t want to expose their entire collection. Google has APIs for most of their other projects—why not Google Books?

[Image courtesy of Ubisoft.]

Google Fingers

No, it’s not another amazing new piece of software from Google, which will type for you (though that would be nice). Just something that I’ve noticed while looking at many nineteenth-century books in Google’s massive digitization project. The following screenshot nicely reminds us that at the root of the word “digitization” is “digit,” which is from the Latin word “digitus,” meaning finger. It also reminds us that despite our perception of Google as a collection of computer geniuses, and despite their use of advanced scanning technology, their library project involves an almost unfathomable amount of physical labor. I’m glad that here and there, the people doing this difficult work (or at least their fingers) are being immortalized.

[The first page of a Victorian edition of Plato’s Euthyphron, a dialogue about the origin and nature of piety. Insert your own joke here about Google’s “Don’t be evil” motto.]

The Perfect and the Good Enough: Books and Wikis

As you may have noticed, I haven’t posted to my blog for an entire month. I have a good excuse: I just finished the final edits on my forthcoming book, Equations from God: Pure Mathematics and Victorian Faith, due out early next year. (I realized too late that I could have capitalized on Da Vinci Code fever and called the book The God Code, thus putting an intellectual and cultural history of Victorian mathematics in the hands of numerous unsuspecting Barnes & Noble shoppers.) The process of writing a book has occasionally been compared to pregnancy and childbirth; as the awe-struck husband of a wife who bore twins, I suspect this comparison is deeply flawed. But on a more superficial level, I guess one can say that it’s a long process that produces something of which one can be very proud, but which can involve some painful moments. These labor pains are especially pronounced (at least for me) in the final phase of book production, in which all of the final adjustments are made and tiny little errors (formatting, spelling, grammar) are corrected. From the “final” draft of a manuscript until its appearance in print, this process can take an entire year. Reading Roy Rosenzweig’s thought-provoking article on the production of the Wikipedia, just published in the Journal of American History, was apropos: it got me thinking about the value of this extra year of production work on printed materials and its relationship to what’s going on online now.

Is the time spent getting books as close to perfection as possible worth it? Of course it is. The value of books comes from an implicit contract between the reader and those who produce the book, the author and publisher. The producers ensure, through many cycles of revision, editing, and double checking, that the book contains as few errors as possible and is as cogent and forceful as possible. And the reader comes to a book with an understanding that the pages they are reading entail a tremendous amount of effort to reach near-perfection—thus making the book worthy of careful attention and consideration.

On the other hand, I’ve become increasingly fond of Voltaire’s dictum that “the perfect is the enemy of the good”; that is, in human affairs the (often nearly endless) search for perfection often means you fail to produce a good-enough solution. Roy Rosenzweig and I use the aphorism in Digital History, because there’s so much to learn and tinker with in trying to put history online that if you obsess about it all you will never even get started with a basic website. As it turns out, the history of computing includes many examples of this dynamic. For instance, Ethernet was not as “perfect” a technology as IBM’s Token-Ring, which, as its name implies, passed a “token” around so that every item on a network wouldn’t talk at once and get in each other’s way. But Ethernet was good enough, had decent (but not perfect) solutions to the problems that IBM’s top-notch engineers had elegantly solved, and was cheaper to implement. I suspect you know which technology triumphed.

Roy’s article, “Can History Be Open Source? Wikipedia and the Future of the Past,” suggests that we professional historians (and academics who produce books in general) may be underestimating good-enough online publishing like Wikipedia. Yes, Wikipedia has errors—though not as many as the ivory tower believes. Moreover, it is slowly figuring out how to deal with its imperfections, such as the ability of anyone to come along and edit a topic about which they know nothing, by using fairly sophisticated social and technological methods. Will it ever be as good as a professionally produced book? Probably not. But maybe that’s not the point. (And of course many books are far from perfect too.) Professors need to think carefully about the nature of what they produce given new forms of online production like wikis, rather than simply disparaging them as the province of cranks and amateurs. Finishing a book is as good a time to do that as any.

Google Book Search Blog

For those interested in the Google book digitization project (one of my three copyright-related stories to watch for 2006), Google launched an official blog yesterday. Right now “Inside Google Book Search” seems more like “Outside Google Book Search,” with a first post celebrating the joys of books and discovery, and with a set of links lauding the project, touting “success stories,” and soliciting participation from librarians, authors, and publishers. Hopefully we’ll get more useful insider information about the progress of the project, hints about new ways of searching millions of books, and other helpful tips for scholars in the near future. As I recently wrote in an article in D-Lib Magazine, Google’s project has some serious—perhaps fatal—flaws for those in the digital humanities (not so for the competing, but much smaller, Open Content Alliance). In particular, it would be nice to have more open access to the text (rather than mere page images) of pre-1923 books (i.e., those that are out of copyright). Of course, I’m a historian of the Victorian era who wants to scan thousands of nineteenth-century books using my own digital tools, not a giant company that may want to protect its very expensive investment in digitizing whole libraries.