Category Archives: Google

Initial Thoughts on the Google Books Ngram Viewer and Datasets

First and foremost, you have to be the most jaded or cynical scholar not to be excited by the release of the Google Books Ngram Viewer and (perhaps even more exciting for the geeks among us) the associated datasets. In the same way that the main Google Books site has introduced many scholars to the potential of digital collections on the web, Google Ngrams will introduce many scholars to the possibilities of digital research. There are precious few easy-to-use tools that allow one to explore text-mining patterns and anomalies; perhaps only Wordle has the same dead-simple, addictive quality as Google Ngrams. Digital humanities needs gateway drugs. Kudos to the pushers on the Google Books team.

Second, on the concurrent launch of “Culturomics“: Naming new fields is always contentious, as is declaring precedence. Yes, it was slightly annoying to have the Harvard/MIT scholars behind this coinage and the article that launched it, Michel et al., stake out supposedly new ground without making sufficient reference to prior work and even (ahem) some vaguely familiar, if simpler, graphs and intellectual justifications. Yes, “Culturomics” sounds like an 80s new wave band. If we’re going to coin neologisms, let’s at least go with Sean Gillies’ satirical alternative: Freakumanities. No, there were no humanities scholars in sight in the Culturomics article. But I’m also sure that longtime “humanities computing” scholars consider advocates of “digital humanities” like me Johnnies-come-lately. Luckily, digital humanities is nice, and so let us all welcome Michel et al. to the fold, applaud their work, and do what we can to learn from their clever formulations. (But c’mon, Cantabs, at least return the favor by following some people on Twitter.)

Third, on the quality and utility of the data: To be sure, there are issues. Some big ones. Mark Davies makes some excellent points about why his Corpus of Historical American English (COHA) might be a better choice for researchers, including more nuanced search options and better variety and normalization of the data. Natalie Binder asks some tough questions about Google’s OCR. On Twitter many of us were finding serious problems with the long “s” before 1800 (Danny Sullivan got straight to the naughty point with his discourse on the history of the f-bomb). But the Freakumanities, er, Culturomics guys themselves talk about this problem in their caveats, as does Google.

Moreover, the data will improve. The Google n-grams are already over a year old, and the plan is to release new data as soon as it can be compiled. In addition, unlike text-mining tools like COHA, Google Ngrams is multilingual. For the first time, historians working on Chinese, French, German, and Spanish sources can do what many of us have been doing for some time. Professors love to look a gift horse in the mouth. But let’s also ride the horse and see where it takes us.

So where does it take us? My initial tests on the viewer and examination of the datasets—which, unlike the public site, allow you to count words not only by overall instances but, critically, by number of pages those instances appear on and number of works they appear in—hint at much work to be done:

1) The best possibilities for deeper humanities research are likely in the longer n-grams, not in the unigrams. While everyone obsesses about individuals words (guilty here too of unigramism) or about proper names (which are generally bigrams), more elaborate and interesting interpretations are likelier in the 4- and 5-grams since they begin to provide some context. For instance, if you want to look at the history of marriage, charting the word itself is far less interesting than seeing if it co-occurs with words like “loving” or “arranged.” (This is something we learned in working on our NEH-funded grant on text mining for historians.)

2) We should remember that some of the best uses of Google’s n-grams will come from using this data along with other data. My gripe with the “Culturomics” name was that it implied (from “genomics”) that some single massive dataset, like the human genome, will be the be-all and end-all for cultural research. But much of the best digital humanities work has come from mashing up data from different domains. Creative scholars will find ways to use the Google n-grams in concert with other datasets from cultural heritage collections.

3) Despite my occasional griping about the Culturomists, they did some rather clever things with statistics in the latter part of their article to tease out cultural trends. We historians and humanists should be looking carefully at the more complex formulations of Michel et al., when they move beyond linguistics and unigram patterns to investigate in shrewd ways topics like how fleeting fame is and whether the suppression of authors by totalitarian regimes works. Good stuff.

4) For me, the biggest problem with the viewer and the data is that you cannot seamlessly move from distant reading to close reading, from the bird’s eye view to the actual texts. Historical trends often need to be investigated in detail (another lesson from our NEH grant), and it’s not entirely clear if you move from Ngram Viewer to the main Google Books interface that you’ll get the book scans the data represents. That’s why I have my students use Mark Davies’ Time Magazine Corpus when we begin to study historical text mining—they can easily look at specific magazine articles when they need to.

How do you plan to use the Google Books Ngram Viewer and its associated data? I would love to hear your ideas for smart work in history and the humanities in the comments, and will update this post with my own further thoughts as they occur to me.

Searching for the Victorians

[A rough transcript of my keynote at the Victorians Institute Conference, held at the University of Virginia on October 1-3, 2010. The conference had the theme "By the Numbers." Attended by "analog" Victorianists as well as some budding digital humanists, I was delighted by the incredibly energetic reaction to this talk—many terrific questions and ideas for doing scholarly text mining from those who may have never considered it before. The talk incorporates work on historical text mining under an NEH grant, as well as the first results of a grant that Fred Gibbs and I were awarded from Google to mine their vast collection of books.]

Why did the Victorians look to mathematics to achieve certainty, and how we might understand the Victorians better with the mathematical methods they bequeathed to us? I want to relate the Victorian debate about the foundations of our knowledge to a debate that we are likely to have in the coming decade, a debate about how we know the past and how we look at the written record that I suspect will be of interest to literary scholars and historians alike. It is a philosophical debate about idealism, empiricism, induction, and deduction, but also a practical discussion about the methodologies we have used for generations in the academy.

Victorians and the Search for Truth

Let me start, however, with the Heavens. This is Neptune. It was seen for the first time through a telescope in 1846.

At the time, the discovery was hailed as a feat of pure mathematics, since two mathematicians, one from France, Urbain Le Verrier, and one from England, John Couch Adams, had independently calculated Neptune’s position using mathematical formulas. There were dozens of poems written about the discovery, hailing the way these mathematicians had, like “magicians” or “prophets,” divined the Truth (often written with a capital T) about Neptune.

But in the less-triumphal aftermath of the discovery, it could also be seen as a case of the impact of cold calculation and the power of a good data set. Although pure mathematics, to be sure, were involved—the equations of geometry and gravity—the necessary inputs were countless observations of other heavenly bodies, especially precise observations of perturbations in the orbit of Uranus caused by Neptune. It was intellectual work, but intellectual work informed by a significant amount of data.

The Victorian era saw tremendous advances in both pure and applied mathematics. Both were involved in the discovery of Neptune: the pure mathematics of the ellipse and of gravitational pull; the computational modes of plugging observed coordinates into algebraic and geometrical formulas.

Although often grouped together under the banner of “mathematics,” the techniques and attitudes of pure and applied forms diverged significantly in the nineteenth century. By the end of the century, pure mathematics and its associated realm of symbolic logic had become so abstract and removed from what the general public saw as math—that is, numbers and geometric shapes—that Bertrand Russell could famously conclude in 1901 (in a Seinfeldian moment) that mathematics was a science about nothing. It was a set of signs and operations completely divorced from the real world.

Meanwhile, the early calculating machines that would lead to modern computers were proliferating, prodded by the rise of modern bureaucracy and capitalism. Modern statistics arrived, with its very unpure notions of good-enough averages and confidence levels.

The Victorians thus experienced the very modern tension between pure and applied knowledge, art and craft. They were incredibly self-reflective about the foundations of their knowledge. Victorian mathematicians were often philosophers of mathematics as much as practitioners of it. They repeatedly asked themselves: How could they know truth through mathematics? Similarly, as Meegan Kennedy has shown, in putting patient data into tabular form for the first time—thus enabling the discernment of patterns in treatment—Victorian doctors began wrestling with whether their discipline should be data-driven or should remain subject to the “genius” of the individual doctor.

Two mathematicians I studied for Equations from God used their work in mathematical logic to assail the human propensity to come to conclusions using faulty reasoning or a small number of examples, or by an appeal to interpretive genius. George Boole (1815-1864), the humble father of the logic that is at the heart of our computers, was the first professor of mathematics at Queen’s College, Cork. He had the misfortune of arriving in Cork (from Lincoln, England) on the eve of the famine and increasing sectarian conflict and nationalism.

Boole spend the rest of his life trying to find a way to rise above the conflict he saw all around him. He saw his revolutionary mathematical logic as a way to dispassionately analyze arguments and evidence. His seminal work, The Laws of Thought, is as much a work of literary criticism as it is of mathematics. In it, Boole deconstructs texts to find the truth using symbolical modes.

The stained-glass window in Lincoln Cathedral honoring Boole includes the biblical story of Samuel, which the mathematician enjoyed. It’s a telling expression of Boole’s worry about how we come to know Truth. Samuel hears the voice of God three times, but each time cannot definitively understand what he is hearing. In his humility, he wishes not to jump to divine conclusions.

Not jumping to conclusions based on limited experience was also a strong theme in the work of Augustus De Morgan (1806-1871). De Morgan, co-discoverer of symbolic logic and the first professor of mathematics at University College London, had a similar outlook to Boole’s, but a much more abrasive personality. He rather enjoyed proving people wrong, and also loved to talk about how quickly human beings leap to opinions.

De Morgan would give this hypothetical: “Put it to the first comer, what he thinks on the question whether there be volcanoes on the unseen side of the moon larger than those on our side. The odds are, that though he has never thought of the question, he has a pretty stiff opinion in three seconds.” Human nature, De Morgan thought, was too inclined to make mountains out of molehills, conclusions from scant or no evidence. He put everyone on notice that their deeply held opinions or interpretations were subject to verification by the power of logic and mathematics.

As Walter Houghton highlighted in his reading of the Victorian canon, The Victorian Frame of Mind, 1830-1870, the Victorians were truth-seekers and skeptics. They asked how they could know better, and challenged their own assumptions.

Foundations of Our Own Knowledge

This attitude seems healthy to me as we present-day scholars add digital methods of research to our purely analog ones. Many humanities scholars have been satisfied, perhaps unconsciously, with the use of a limited number of cases or examples to prove a thesis. Shouldn’t we ask, like the Victorians, what can we do to be most certain about a theory or interpretation? If we use intuition based on close reading, for instance, is that enough?

Should we be worrying that our scholarship might be anecdotally correct but comprehensively wrong? Is 1 or 10 or 100 or 1000 books an adequate sample to know the Victorians? What we might do with all of Victorian literature—not a sample, or a few canonical texts, as in Houghton’s work, but all of it.

These questions were foremost in my mind as Fred Gibbs and I began work on our Google digital humanities grant that is attempting to apply text mining to our understanding of the Victorian age. If Boole and De Morgan were here today, how acceptable would our normal modes of literary and historical interpretation be to them?

As Victorianists, we are rapidly approaching the time when we have access—including, perhaps, computational access—to the full texts not of thousands of Victorian books, or hundreds of thousands, but virtually all books published in the Victorian age. Projects like Google Books, the Internet Archive’s OpenLibrary, and HathiTrust will become increasingly important to our work.

If we were to look at all of these books using the computational methods that originated in the Victorian age, what would they tell us? And would that analysis be somehow more “true” than looking at a small subset of literature, the books we all have read that have often been used as representative of the Victorian whole, or, if not entirely representative, at least indicative of some deeper Truth?

Fred and I have received back from Google a first batch of data. This first run is limited just to words in the titles of books, but even so is rather suggestive of the work that can now be done. This data covers the 1,681,161 books that were published in English in the UK in the long nineteenth century, 1789-1914. We have  normalized the data in many ways, and for the most part the charts I’m about to show you graph the data from zero to one percent of all books published in a year so that they are on the same scale and can be visually compared.

Multiple printings of a book in a single year have been collapsed into one “expression.” (For the library nerds in the audience, the data has been partially FRBRized. One could argue that we should have accepted the accentuation of popular titles that went through many printings in a single year, but editions and printings in subsequent years do count as separate expressions. We did not go up to the level of “work” in the FRBR scale, which would have collapsed all expressions of a book into one data point.)

We plan to do much more; in the pipeline are analyses of the use of words in the full texts (not just titles) of those 1.7 million books, a comprehensive exploration of the use of the Bible throughout the nineteenth century, and more. And more could be be done to further normalize the data, such as accounting for the changing meaning of words over time.

Validation

So what does the data look like even at this early stage? And does it seem valid? That is where we began our analysis, with graphs of the percent of all books published with certain words in the titles (y-axis) on a year by year basis (x-axis). Victorian intellectual life as it is portrayed in this data set is in many respects consistent with what we already know.

The frequency chart of books with the word in “revolution” in the title, for example, shows spikes where it should, around the French Revolution and the revolutions of 1848. (Keen-eyed observers will also note spikes for a minor, failed revolt in England in 1817 and the successful 1830 revolution in France.)

Books about science increase as they should, though with some interesting leveling off in the late Victorian period. (We are aware that the word “science” changes over this period, becoming more associated with natural science rather than generalized knowledge.)

The rise of factories…

and the concurrent Victorian nostalgia for the more sedate and communal Middle Ages…

…and the sense of modernity, a new phase beyond the medieval organization of society and knowledge that many Britons still felt in the eighteenth century.

The Victorian Crisis of Faith, and Secularization

Even more validation comes from some basic checks of key Victorian themes such as the crisis of faith. These charts are as striking as any portrayal of the secularization that took place in Great Britain in the nineteenth century.

Correlation Is (not) Truth

So it looks fairly good for this methodology. Except, of course, for some obvious pitfalls. Looking at the charts of a hundred words, Fred noticed a striking correlation between the publication of books on “belief,” “atheism,” and…”Aristotle”?

Obviously, we cannot simply take the data at face value. As I have called this on my blog, we have to be on guard for oversimplifications that are the equivalent of saying that War and Peace is about Russia. We have to marry these attempts at what Franco Moretti has called “distant reading” with more traditional close reading to find rigorous interpretations behind the overall trends.

In Search of New Interpretations

Nevertheless, even at this early stage of the Google grant, there are numerous charts that are suggestive of new research that can be done, or that expand on existing research. Correlation can, if we go from the macro level to the micro level, help us to illustrate some key features of the Victorian age better. For instance, the themes of Jeffrey von Arx’s Progress and Pessimism: Religion, Politics and History in Late Nineteenth Century Britain, in which he notes the undercurrent of depression in the second half of the century, are strongly supported and enhanced by the data.

And given the following charts, we can imagine writing much more about the decline of certainty in the Victorian age. “Universal” is probably the most striking graph of our first data set, but they all show telling slides toward relativism that begin before most interpretations in the secondary literature.

Rather than looking for what we expect to find, perhaps we can have the computer show us tens, hundreds, or even thousands of these graphs. Many will confirm what we already know, but some will be strikingly new and unexpected. Many of those may show false correlations or have other problems (such as the changing or multiple meaning of words), but some significant minority of them will reveal to us new patterns, and perhaps be the basis of new interpretations of the Victorian age.

What if I were to give you Victorianists hundreds of these charts?

I believe it is important to keep our eyes open about the power of this technique. At the very least, it can tell us—as Augustus De Morgan would—when we have made mountains out of a molehills. If we do explore this new methodology, we might be able to find some charts that pique our curiosity as knowledgeable readers of the Victorians. We’re the ones that can accurately interpret the computational results.

We can see the rise of the modern work lifestyle…

…or explore the interaction between love and marriage, an important theme in the recent literature.

We can look back at the classics of secondary literature, such as Houghton’s Victorian Frame of Mind, and ask whether those works hold up to the larger scrutiny of virtually all Victorian books, rather than just the limited set of books those authors used. For instance, while in general our initial study supports Houghton’s interpretations, it also shows relatively few books on heroism, a theme Houghton adopts from Thomas Carlyle.

And where is the supposed Victorian obsession with theodicy in this chart on books about “evil”?

Even more suggestive are the contrasts and anomalies. For instance, publications on “Jesus” are relatively static compared to those on “Christ,” which drop from nearly 1 in 60 books in 1843 to less than 1 in 300 books 70 years later.

The impact of the ancient world on the Victorians can be contrasted (albeit with a problematic dual modern/ancient meaning for Rome)…

…as can the Victorians’ varying interest in the afterlife.

I hope that these charts have prodded you to consider the anecdotal versus the comprehensive, and the strengths and weaknesses of each. It is time we had a more serious debate—not just in the digital humanities but in the humanities more generally—about measurement and interpretation that the Victorians had. Can we be so confident in our methods of extrapolating from some literary examples to the universal whole?

This is a debate that we should have in the present, aided by our knowledge of what the Victorians struggled with in the past.

[Image credits (other than graphs): Wikimedia Commons]

Digital Campus #52 – What’s the Buzz?

The flawed launch of Google Buzz, with its privacy nightmare of exposing the social graph of one’s email account, makes me, Tom, Mills, and Amanda French consider the major issue of online privacy on this week’s Digital Campus podcast. Covering several stories, including Facebook attacks on teachers and teachers spying on students, we think about the ways in which technology enables new kinds of violations on campus—and what we should do about it. [Subscribe to this podcast.]

Is Google Good for History?

[These are my prepared remarks for a talk I gave at the American Historical Association Annual Meeting, on January 7, 2010, in San Diego. The panel was entitled "Is Google Good for History?" and also featured talks by Paul Duguid of the University of California, Berkeley and Brandon Badger of Google Books. Given my propensity to go rogue, what I actually said likely differed from this text, but it represents my fullest, and, I hope, most evenhanded analysis of Google.]

Is Google good for history? Of course it is. We historians are searchers and sifters of evidence. Google is probably the most powerful tool in human history for doing just that. It has constructed a deceptively simple way to scan billions of documents instantaneously, and it has spent hundreds of millions of dollars of its own money to allow us to read millions of books in our pajamas. Good? How about Great?

But then we historians, like other humanities scholars, are natural-born critics. We can find fault with virtually anything. And this disposition is unsurprisingly exacerbated when a large company, consisting mostly of better-paid graduates from the other side of campus, muscles into our turf. Had Google spent hundreds of millions of dollars to build the Widener Library at Harvard, surely we would have complained about all those steps up to the front entrance.

Partly out of fear and partly out of envy, it’s easy to take shots at Google. While it seems that an obsessive book about Google comes out every other week, where are the volumes of criticism of ProQuest or Elsevier or other large information companies that serve the academic market in troubling ways? These companies, which also provide search services and digital scans, charge universities exorbitant amounts for the privilege of access. They leech money out of library budgets every year that could be going to other, more productive uses.

Google, on the other hand, has given us Google Scholar, Google Books, newspaper archives, and more, often besting commercial offerings while being freely accessible. In this bigger picture, away from the myopic obsession with the Biggest Tech Company of the Moment (remember similar diatribes against IBM, Microsoft?), Google has been very good for history and historians, and one can only hope that they continue to exert pressure on those who provide costly alternatives.

Of course, like many others who feel a special bond with books and our cultural heritage, I wish that the Google Books project was not under the control of a private entity. For years I have called for a public project, or at least a university consortium, to scan books on the scale Google is attempting. I’m envious of France’s recent announcement to spend a billion dollars on public scanning. In addition, the Center for History and New Media has a strong relationship with the Internet Archive to put content in a non-profit environment that will maximize its utility and distribution and make that content truly free, in all senses of the word. I would much rather see Google’s books at the Internet Archive or the Library of Congress. There is some hope that HathiTrust will be this non-Google champion, but they are still relying mostly on Google’s scans. The likelihood of a publicly funded scanning project in the age of Tea Party reactionaries is slim.

* * *

Long-time readers of my blog know that I have not pulled punches when it comes to Google. To this day the biggest spike in readership on my blog was when, very early in Google’s book scanning project, I casually posted a scan of a human hand I found while looking at an edition of Plato. The post ended up on Digg, and since then it has been one of the many examples used by Google’s detractors to show a lack of quality in their library project.

Let’s discuss the quality issues for a moment, since it is one point of obsession within the academy, an obsession I feel is slightly misplaced. Of course Google has some poor scans—as the saying goes, haste makes waste—but I’ve yet to see a scientific survey of the overall percentage of pages that are unreadable or missing (surely a miniscule fraction in my viewing of scores of Victorian books). Regarding metadata errors, as Jon Orwant of Google Books has noted, when you are dealing with a trillion pieces of metadata, you are likely to have millions of errors in need of correction. Let us also not pretend the bibliographical world beyond Google is perfect. Many of the metadata problems with Google Books come from library partners and others outside of Google.

Moreover, Google likely has remedies for many of these inadequacies. Google is constantly improving its OCR and metadata correction capabilities, often in clever ways. For instance, it recently acquired the reCAPTCHA system from Carnegie Mellon, which uses unwitting humans who are logging into online services to transcribe particularly hard or smudged words from old books. They have added a feedback mechanism for users to report poor scans. Truly bad books can be rescanned or replaced by other libraries’ versions. I find myself nonplussed by quality complaints about Google Books that have engineering solutions. That’s what Google does; it solves engineering problems very well.

Indeed, we should recognize (and not without criticism, as I will note momentarily) that at its heart, Google Books is the outcome, like so many things at Google, of a engineering challenge and a series of mathematical problems: How can you scan tens of million books in a decade? It’s easy to say they should do a better job and get all the details right, but if you do the calculations with those key variables, as I assume Brandon and his team have done, you’ll probably see that getting a nearly perfect library scanning project would take a hundred years rather than ten. (That might be a perfectly fine trade-off, but that’s a different argument or a different project.) As in OCR, getting from 99% to 99.9% accuracy would probably take an order of magnitude longer and be an order of magnitude more expensive. That’s the trade-off they have decided to make, and as a company interested in search, where near-100% accuracy is unnecessary, and considering the possibilities for iterating toward perfection from an imperfect first version, it must have been an easy decision to make.

* * *

Google Books is incredibly useful, even with the flaws. Although I was trained at places with large research libraries of Google Books scale, I’m now at an institution that is far more typical of higher ed, with a mere million volumes and few rare works. At places like Mason, Google Books is a savior, enabling research that could once only be done if you got into the right places. I regularly have students discover new topics to study and write about through searches on Google Books. You can only imagine how historical researchers and all students and scholars feel in even less privileged places. Despite its flaws, it will be the the source of much historical scholarship, from around the globe, over the coming decades. It is a tremendous leveler of access to historical resources.

Google is also good for history in that it challenges age-old assumptions about the way we have done history. Before the dawn of massive digitization projects and their equally important indices, we necessarily had to pick and choose from a sea of analog documents. All of that searching and sifting we did, and the particular documents and evidence we chose to write on, were—let’s admit it—prone to many errors. Read it all, we were told in graduate school. But who ever does? We sift through large archives based on intuition; occasionally we even find important evidence by sheer luck. We have sometimes made mountains out of molehills because, well, we only have time to sift through molehills, not mountains. Regardless of our technique, we always leave something out; in an analog world we have rarely been comprehensive.

This widespread problem of anecdotal history, as I have called it, will only get worse. As more documents are scanned and go online, many works of historical scholarship will be exposed as flimsy and haphazard. The existence of modern search technology should push us to improve historical research. It should tell us that our analog, necessarily partial methods have had hidden from us the potential of taking a more comprehensive view, aided by less capricious retrieval mechanisms which, despite what detractors might say, are often more objective than leafing rapidly through paper folios on a time-delimited jaunt to an archive.

In addition, listening to Google may open up new avenues of exploring the past. In my book Equations from God I argued that mathematics was generally considered a divine language in 1800 but was “secularized” in the nineteenth century. Part of my evidence was that mathematical treatises, which often contained religious language in the early nineteenth century, lost such language by the end of the century. By necessity, researching in the pre-Google Books era, my textual evidence was limited—I could only read a certain number of treatises and chose to focus (I’m sure this will sound familiar) on the writings of high-profile mathematicians. The vastness of Google Books for the first time presents the opportunity to do a more comprehensive scan of Victorian mathematical writing for evidence of religious language. This holds true for many historical research projects.

So Google has provided us not only with free research riches but also with a helpful direct challenge to our research methods, for which we should be grateful. Is Google good for history? Of course it is.

* * *

But does that mean that we cannot provide constructive criticism of Google, to make it the best it can be, especially for historians? Of course not. I would like to focus on one serious issue that ripples through many parts of Google Books.

For a company that is a champion of openness, Google remains strangely closed when it comes to Google Books. Google Books seems to operate in ways that are very different from other Google properties, where Google aims to give it all away. For instance, I cannot understand why Google doesn’t make it easier for historians such as myself, who want to do technical analyses of historical books, to download them en masse more easily. If it wanted to, Google could make a portal to download all public domain books tomorrow. I’ve heard the excuses from Googlers: But we’ve spent millions to digitize these books! We’re not going to just give them away! Well, Google has also spent millions on software projects such as Android, Wave, Chrome OS, and the Chrome browser, and they are giving those away. Google’s hesitance with regard to its books project shows that openness goes only so far at Google. I suppose we should understand that; Google is a company, not public library. But that’s not the philanthropic aura they cast around Google Books at its inception or even today, in dramatic op-eds touting the social benefit of Google Books.

In short, complaining about the quality of Google’s scans distracts us from a much larger problem with Google Books. The real problem—especially for those in the digital humanities but increasingly for many others—is that Google Books is only open in the read-a-book-in-my-pajamas way. To be sure, you can download PDFs of many public domain books. But they make it difficult to download the OCRed text from multiple public domain books–what you would need for more sophisticated historical research. And when we move beyond the public domain, Google has pushed for a troubling, restrictive regime for millions of so-called “orphan” books.

I would like to see a settlement that offers greater, not lesser access to those works, in addition to greater availability of what Cliff Lynch has called “computational access” to Google Books, a higher level of access that is less about reading a page image on your computer than applying digital tools to many pages or books at one time to create new knowledge and understanding. This is partially promised in the Google Books settlement, in the form of text-mining research centers, but those centers will be behind a velvet rope and I suspect the casual historian will be unlikely to ever use them. Google has elaborate APIs, or application programming interfaces, for most of its services, yet only the most superficial access to Google Books.

For a company that thrives on openness and the empowerment of users and software developers, Google Books is a puzzlement. With much fanfare, Google has recently launched—evidently out of internal agitation—what it calls a “Data Liberation Front,” to ensure portability of data and openness throughout Google. On dataliberation.org, the website for the front, these Googlers list 25 Google projects and how to maximize their portability and openness—virtually all of the main services at Google. Sadly, Google Books is nowhere to be seen, even though it also includes user-created data, such as the My Library feature, not to mention all of the data—that is, books—that we have all paid for with our tax dollars and tuition. So while the Che Guevaras put up their revolutionary fist on one side of the Googleplex, their colleagues on the other side are working with a circumscribed group of authors and publishers to place messy restrictions onto large swaths of our cultural heritage through a settlement that few in the academy support.

Jon Orwant and Dan Clancy and Brandon Badger have done an admirable job explaining much of the internal process of Google Books. But it still feels removed and alien in way that other Google efforts are not. That is partly because they are lawyered up, and thus hamstrung from responding to some questions academics have, or from instituting more liberal policies and features. The same chutzpah that would lead a company to digitize entire libraries also led it to go too far with in-copyright books, leading to a breakdown with authors and publishers and the flawed settlement we have in front of us today.

We should remember that the reason we are in a settlement now is that Google didn’t have enough chutzpah to take the higher, tougher road—a direct challenge in the courts, the court of public opinion, or the Congress to the intellectual property regime that governs many books and makes them difficult to bring online, even though their authors and publishers are long gone. While Google regularly uses its power to alter markets radically, it has been uncharacteristically meek in attacking head-on this intellectual property tower and its powerful corporate defenders. Had Google taken a stronger stance, historians would have likely been fully behind their efforts, since we too face the annoyances that unbalanced copyright law places on our pedagogical and scholarly use of textual, visual, audio, and video evidence.

I would much rather have historians and Google to work together. While Google as a research tool challenges our traditional historical methods, historians may very well have the ability to challenge and make better what Google does. Historical and humanistic questions are often at the high end of complexity among the engineering challenges Google faces, similar to and even beyond, for instance, machine translation, and Google engineers might learn a great deal from our scholarly practice. Google’s algorithms have been optimized over the last decade to search through the hyperlinked documents of the Web. But those same algorithms falter when faced with the odd challenges of change over centuries and the alienness of the past and old books and documents that historians examine daily.

Because Google Books is the product of engineers, with tremendous talent in computer science but less sense of the history of the book or the book as an object rather than bits, it founders in many respects. Google still has no decent sense of how to rank search results in humanities corpora. Bibliometrics and text mining work poorly on these sources (as opposed to, say, the highly structured scientific papers Google Scholar specializes in). Studying how professional historians rank and sort primary and secondary sources might tell Google a lot, which it could use in turn to help scholars.

Ultimately, the interesting question might not be, Is Google good for history? It might be: Is history good for Google? To both questions, my answer is: Yes.

Digital Campus #45 – Wave Hello

If you’ve wondered what an academic trying to podcast while on Google Wave might sound like, you need listen no farther than the latest Digital Campus podcast. In addition to an appraisal of Wave, we cover the FTC ruling on bloggers accepting gifts (such as free books from academic presses), the great Kindle-on-campus experiment, and (of course) another update on the Google Books (un)settlement. Joining Tom, Mills, and me is another new irregular, Lisa Spiro. She’s the intelligent one who’s paying attention rather than muttering while watching Google waves go by. [Subscribe to this podcast.]

Digital Campus #44 – Unsettled

The latest edition of the Digital Campus podcast marks a break from the past. After three years of our small roundtable of Tom, Mills, and yours truly, we pull up a couple of extra seats for our first set of “irregulars,” Amanda French and Jeff McClurken. I think you’ll agree they greatly enliven the podcast and we’re looking forward to having them back on an irregular basis. On the discussion docket was the falling apart of the Google Books settlement, reCAPTCHA, Windows 7, and the future of libraries. [Subscribe to this podcast.]

Digital Campus #34 – Extra, Extra!

For the Thanksgiving Day Digital Campus podcast, Mills, Tom, and I covered a cornucopia of news, including  more on the Google Book Search settlement, some academic challenges to Google’s main search engine, some trouble in the virtual worlds (in a new segment, “We Told You So”), and the end of email service for students at Boston College. We also point the audience to a new site on place-based computing, a couple of easy (or bizarre) ways to write a book, and Processing, a programming language that’s useful in higher ed. An easily digested podcast for those still snacking on turkey leftovers. [Subscribe to this podcast.]

Digital Campus #33 – Classroom Action Settlement

After an unplanned month off (our apologies, things have been more than a little busy around here), the Digital Campus podcast triumphantly returns to the airwaves with a discussion of the recent Google Book Search settlement. Also up for analysis are Microsoft’s move to the cloud, the new Google phone, and, as always, recommendations from Tom, Mills, and me about helpful sites, tools, and publications. [Subscribe to this podcast.]

First Impressions of the Google Books Settlement

Just announced is the settlement of the class action lawsuit that the Authors Guild, the Association of American Publishers and individual authors and publishers filed against Google for its Book Search program, which has been digitizing millions of books from libraries. (Hard to believe, but the lawsuit was first covered on this blog all the way back in November 2005.) Undoubtedly this agreement is a critical one not only for Google and the authors and publishers, but for all of us in academia and others who care about the present and future of learning and scholarship.

It will obviously take some time to digest this agreement; indeed, the Google post on it is fairly sketchy and we still need to hear details, such as the cost structure for full access the agreement now provides for. But my first impressions of some key points:

The agreement really focuses on in-copyright but out-of-print books. That is, books that can’t normally be copied but also can’t be purchased anywhere. Highlighting these books (which are numerous; most academic books, e.g., are out-of-print and have virtually no market) was smart for Google since it seems to provide value without stepping on publishers’ toes.

A second (also smart, but probably more controversial) focus is on access to the Google Books collection via libraries:

We’ll also be offering libraries, universities and other organizations the ability to purchase institutional subscriptions, which will give users access to the complete text of millions of titles while compensating authors and publishers for the service. Students and researchers will have access to an electronic library that combines the collections from many of the top universities across the country. Public and university libraries in the U.S. will also be able to offer terminals where readers can access the full text of millions of out-of-print books for free.

Again, we need to hear more details about this part of the agreement. We also need to begin thinking about how this will impact libraries, e.g., in terms of their own book acquisition plans and their subscriptions to other online databases.

Finally, and perhaps most interesting and surprising to those of us in the digital humanities, is an all-too-brief mention of computational access to these millions of books:

In addition to the institutional subscriptions and the free public access terminals, the agreement also creates opportunities for researchers to study the millions of volumes in the Book Search index. Academics will be able to apply through an institution to run computational queries through the index without actually reading individual books.

For years in this space I have been arguing for the necessity of such access (first envisioned, to give due credit, by Cliff Lynch of CNI). Inside Google they have methods for querying and analyzing these books that we academics could greatly benefit from, and that could enable new kinds of digital scholarship.

Update: The Association of American Publishers now has a page answering frequently asked questions about the agreement (have we had time to ask?).

Digital Campus #31 – Back To School

On our first podcast of the school year, Bryan Alexander, the Director of Research of NITLE, joins us. Bryan closely follows emerging trends in academic technology on his terrific blog, and he lets us know what he thinks the critical trends are for the coming year. Google’s new web browser, Chrome, is the main topic in the news roundup as we try to figure out what impact it will have on academic web design and application development. A wide-ranging podcast covering the web, mobile technology, ebooks, virtual reality and much more. Join us for another year of Digital Campus! [Subscribe to this podcast.]