Category Archives: Text Mining

Million Books Workshop Wrap-up

May has been a month of travel for me (thus the light posting in this space). I gave a talk about Zotero and related developments in the humanities and technology at the Stanford Humanities Center, and spoke at the annual meeting of the American Council of Learned Societies about how digital research is a major emerging theme in scholarship. Finally, I participated in the Tufts “Million Books” Workshop, which explored the technical feasibility and theoretical validity of extracting evidence and meaning from the large new corpora of online texts. The three main topics were how to get from scanned documents (especially the complicated ones that scholars sometimes encounter, like Sanskrit manuscripts or early modern broadsides, rather than simply formatted texts like modern English books) to machine-readable text that can be searched and analyzed; machine translation of texts; and moving from text to actionable data (e.g., extraction all of the place names from a document or summarizing large masses of text). Some developments worth noting from the workshop:

I had vaguely heard about the open-source optical character recognition (OCR) project OCRopus, but Thomas Breuel’s detailed description of the project made it seem extremely promising, especially for scholarly applications. Even after two decades of research and development, the error rate of OCR is still too high for many historical texts, and atrocious for compound texts like Victorian mathematical monographs (with all of those equations that end up, improperly and disastrously, as regular text after OCR) or works with vertical text (e.g., Japanese poetry) or images. OCRopus ambitiously plans to support any language written in any direction with any layout. It also breaks down the conversion of scans to text into separate processes that produce probabilities rather than certainties. This is critical. Most OCR packages give you a text result without noting where the software was unsure of a word or letter. Thus you might get “Cohem” rather than “Cohen” without knowing that the software thought long and hard about the correct interpretation of that last letter. OCRopus instead produces a statistical output that says to any end-user application (like search), “I’m sure about ‘Cohe’ but the last letter has a 60% probability of being an ‘m’ and a 40% probability of being an ‘n’.” A search for “Cohen” could thus return the document as a result even if the “final” transcription defaults to “Cohem.”

OCRopus also uses far more sophisticated methods than current OCR software to find titles, ordered blocks of text (like columns), and marginalia. Brilliantly, rather than outputting XML at the end of its processes, OCRopus outputs to XHTML and CSS3 so that it can much more accurately represent the fonts and layout of the original. Very impressive. The project is just in pre-alpha right now with a 1.0 release to come in the fall of 2008. Unsurprisingly, OCRopus is supported by Google, which plans to use it for Google Book Search. (Right now they have OCR that’s good enough for search, which doesn’t need anywhere near 100% accuracy, but they plan to re-OCR their book scans with OCRopus when it’s ready.)

David Smith spoke about the cutting edge of machine translation (i.e., the use of computational methods to translate text from one language to another). The field seems extremely active right now, and new methods promise better translations in the near future. David spoke of several developments. First, many projects are seeding their software with parallel texts, such as documents from the United Nations or the European Parliament, which are translated very precisely by humans into many languages. Parallel text corpora (with English as one of the parallels) on the order of 20-200 million words (roughly 1-10 million sentences) are available for a number of languages. Unfortunately, the parallel texts often come from genres like laws, parliamentary proceedings, and religious texts (not only the Bible but also, quite interestingly, Dianetics is one English text that has been translated into virtually every language, including Uzbek). These genres are, of course, less than optimal for widespread translation uses. We might, however, be able to use parallel translated works from Google’s scans or the Open Content Alliance to help improve the seed corpus.

Second, David noted the resilience of n-gram analysis—breaking down a document into word pairs or triads. Usually you can predict the next word in a document by looking at the previous two words and then assessing the probability of the word following each pair. Most of the best machine translation services (like Google’s) now split a text into bi-grams and tri-grams (two- and three-word pieces) and then translate those n-grams into very exact parallels in the target text using an n-gram library. This is better at keeping the style of the text and avoiding the off-sounding literal translations that have dogged the field. David feels that machine translation has reached the point where it can very usefully tell a user when a primary source document has been mistranslated by a human, which can be very useful for scholarship.

Finally, David Mimno discussed how to move from the text that results from the work of OCR and machine translation (if necessary) into forms that will help with research and analysis in the humanities. David has been doing impressive work in document classification, i.e., computationally assessing a set of digitized texts and figuring out which ones are letters or poems or lab notes, or if the documents are all articles, separating them out into topic clusters. Like machine translation and OCR, when you begin to look under the hood this is an extraordinarily complicated field. The three main techniques—support vector machines (SVM), naive Bayes (probably the best-known method, often used in spam filters), and logistic regression—are best viewed mathematically, and so lie beyond the scope of this blog. David is working on the Mallet project at the University of Massachusetts, Amherst, which seems promising for document classification (a topic we are increasingly interested in at the Center for History and New Media for historical research). The software is still in alpha but I plan to keep an eye on it.

Obviously a lot to think about from the month of May. How do we get these complicated tools to scholars who don’t have technical skills? How can we use these tools to reveal new, meaningful information about the past, without reproducing the obvious using computational means? As I felt at the National Endowment for the Humanities meeting in April, the application of digital methods to the humanities is experiencing a burst of energy and attention in 2007. It will be interesting to see what happens next.

Second Chicago Colloquium on Digital Humanities and Computer Science

I went to the first of these last November and it’s well worth attending. This year’s theme is “exploring the scholarly query potential of high quality text and image archives in a collaborative environment.” The colloquium will take place on October 21-22, 2007, with proposals due July 31, 2007.

It’s About Russia

One of my favorite Woody Allen quips from his tragically short period as a stand-up comic is the punch line to his hyperbolic story about taking a speed-reading course and then digesting all of War and Peace in twenty minutes. The audience begins to giggle at the silliness of reading Tolstoy’s massive tome in a brief sitting. Allen then kills them with his summary of the book: “It’s about Russia.” The joke came to mind recently as I read the self-congratulatory blog post by IBM’s Many Eyes visualization project, applauding their first month on the web. (And I’m feeling a little embarrassed by my post on the one-year anniversary of this blog.) The Many Eyes researchers point to successes such as this groundbreaking visualization of the New Testament:

News flash: Jesus is a big deal in the New Testament. Even exploring the “network” of figures who are “mentioned together” (ostensibly the point of this visualization) doesn’t provide the kind of insight that even a first-year student in theology could provide over coffee. I have been slow to appreciate the power of textual visualization—in large part because I’ve seen far too many visualizations like this one, that merely use computational methods to reveal the obvious in fancy ways.

I’ve been doing some research on visualizations of texts recently for my next book (on digital scholarship), and trying to get over this aversion to visualizations. But when I see visualizations like this one, the lesson is clear: Make sure your visualizations expose something new, hidden, non-obvious.

Because War and Peace isn’t about Russia.

Google Book Search Now Maps Locations in the Text

Look at the bottom of this page for Illustrated New York: The Metropolis of To-day (1888), digitized by Google at the University of Michigan Library. Using the natural language processing of Google Maps to scan the text for addresses, the locations and surrounding text are placed onto a map of lower Manhattan. A great example of the power of historical data mining and the combination of digital resources via APIs (made easier for Google, of course, because this is all in-house). Kudos to the Google Book Search team.

10 Most Popular Philosophy Syllabi

It’s time once again to find the most influential syllabi in a discipline—this time, philosophy—as determined by data gleaned from the Syllabus Finder. As with my earlier analysis of the most popular history syllabi the following list was compiled by running a series of calculations to determine the number of times Syllabus Finder users glanced at a syllabus (had it turn up in a search), the number of times Syllabus Finder users inspected a syllabus (actually went from the Syllabus Finder website to the website of the syllabus to do further reading), and the overall “attractiveness” of a syllabus (defined as the ratio of full reads to mere glances). It goes without saying (but I’ll say it) that this methodology is unscientific and gives an advantage to older syllabi, but it still probably provides a good sense of the most visible and viewed syllabi on the web. Anyway, here are the ten most popular philosophy syllabi.

#1 – Philosophy of Art and Beauty, Julie Van Camp, California State University, Long Beach, Spring 1998 (total of 3992 points)

#2 – Introduction to Philosophy, Andreas Teuber, Brandeis University, Fall 2004 (3699 points)

#3 – Law, Philosophy, and the Humanities, Julie Van Camp, California State University, Long Beach, Fall 2003 (3174 points)

#4 – Introduction to Philosophy, Jonathan Cohen, University of California, San Diego, Fall 1999 (2448 points)

#5 – Comparative Methodology, Bryan W. Van Norden, Vassar College, multiple semesters (1944 points)

#6 – Aesthetics, Steven Crowell, Rice University, Fall 2003 (1913 points)

#7 – Philosophical Aspects of Feminism, Lisa Schwartzman, Michigan State University, Spring 2001 (1782 points)

#8 – Morality and Society, Christian Perring, University of Kentucky, Spring 1996 (1912 points)

#9 – Gay and Lesbian Philosophy, David Barber, University of Maryland, Spring 2002 (1442 points)

#10 – Social and Political Philosophy, Eric Barnes, Mount Holyoke College, Fall 1999 (1395 points)

I will leave it to readers of this blog to assess and compare these syllabi, but two brief comments. First of all, the diversity of topics within this list is notable compared to the overwhelming emphasis on American history among the most popular history syllabi. Asthetics, politics, law, morality, gender, sexuality, and methodology are all represented. Second, congratulations to Julie Van Camp of California State University, Long Beach, who becomes the first professor with two top syllabi in a discipline. Professor Van Camp was a very early adopter of the web, having established a personal home page almost ten years ago with links to all of her syllabi. Van Camp should watch her back, however; Andreas Teuber of Brandeis is coming up quickly with what seems to be the Platonic ideal of an introductory course on philosophy. In less than two years since its inception his syllabus has been very widely consulted.

[The fine print of how the rankings were determined: 1 point was awarded for each time a syllabus showed up in a Syllabus Finder search result; 10 points were awarded for each time a Syllabus Finder user clicked through to view the entire syllabus; 100 points were awarded for each percent of “attractiveness,” where 100% attractive means that every time a syllabus made an appearance in a search result it was clicked on for further information. For instance, the top syllabus appeared in 2164 searches and was clicked on 125 times (5.78% of the searches), for a point total of 2164 + (125 X 10) + (5.78 X 100) = 3992.]

Google Book Search Blog

For those interested in the Google book digitization project (one of my three copyright-related stories to watch for 2006), Google launched an official blog yesterday. Right now “Inside Google Book Search” seems more like “Outside Google Book Search,” with a first post celebrating the joys of books and discovery, and with a set of links lauding the project, touting “success stories,” and soliciting participation from librarians, authors, and publishers. Hopefully we’ll get more useful insider information about the progress of the project, hints about new ways of searching millions of books, and other helpful tips for scholars in the near future. As I recently wrote in an article in D-Lib Magazine, Google’s project has some serious—perhaps fatal—flaws for those in the digital humanities (not so for the competing, but much smaller, Open Content Alliance). In particular, it would be nice to have more open access to the text (rather than mere page images) of pre-1923 books (i.e., those that are out of copyright). Of course, I’m a historian of the Victorian era who wants to scan thousands of nineteenth-century books using my own digital tools, not a giant company that may want to protect its very expensive investment in digitizing whole libraries.

Google Adds Topic Clusters to Search Results

Google has been very conservative about changing their search results page. Indeed, the design of the page and the information presented has changed little since the search engine’s public introduction in 1998. Innovations have literally been marginal: Google has added helpful spelling corrections (“Did you mean…?”), related search terms, and news items near the top of the page, and of course the ubiquitous text ads to the right. But the primary search results block has remained fairly untouched. Competitors have come and gone (mostly the latter), promoting new—and they say better—ways of browsing masses of information. But Google’s clean, relevant list has brushed off these upstarts. So it surprised me when I was doing some fact checking on a book I’m finishing to see the following search results page:

As you can see, Google has evidently introduced a search results page that clusters relevant web pages by subject matter. Google has often disparaged other search engines that do this sort of clustering, like the gratingly named Clusty and Vivisimo, perhaps because Google’s engineers must be some of the few geeks who understand that regular human beings don’t particularly care for fancier ways of structuring or visualizing search results. Just the text, ma’am.

But while this addition of clustering (based on the information theory of document classification, as I recently discussed in D-Lib and in a popular prior blog post) to Google’s search results page is surprising, the way they’ve done it is typically simple and useful. No little topic folders in a sidebar; no floating circles connected by relationship lines. The page registers the same visually, but it’s more helpful. I was looking for the year in which the Victorian artist C.R. Ashbee died, and the first three results are about him. Then, above the fold, there’s a block of another three results that are mildly set apart (note the light grey lines), asking if I meant to look up information about the Ashbee Lacrosse League (with a link to the full results for that topic), then back to the artist. The page reads like a conversation, without any annoying, overly fancy technical flourishes: “Here’s some info about C.R. Ashbee…oh, did you mean the lacrosse league?…if you didn’t here’s some more about the artist.”

Now I just hope they add this clustering to their Web Search API, which would really help out with H-Bot, my automated historical fact finder.

What Would You Do With a Million Books?

What would you do with a million digital books? That’s the intriguing question this month’s D-Lib Magazine asked its contributors, as an exercise in understanding what might happen when massive digitization projects from Google, the Open Content Alliance, and others reach their fruition. I was lucky enough to be asked to write one of the responses, “From Babel to Knowledge: Data Mining Large Digital Collections,” in which I discuss in much greater depth the techniques behind some of my web-based research tools. (A bonus for readers of the article: learn about the secret connection between cocktail recipes and search engines.) Most important, many of the contributors make recommendations for owners of any substantial online resource. My three suggestions, summarized here, focus on why openness is important (beyond just “free beer” and “free speech” arguments), the relatively unexplored potential of application programming interfaces (APIs), and the curious implications of information theory.

1. More emphasis needs to be placed on creating APIs for digital collections. Readers of this blog have seen this theme in several prior posts, so I won’t elaborate on it again here, though it’s a central theme of the article.

2. Resources that are free to use in any way, even if they are imperfect, are more valuable than those that are gated or use-restricted, even if those resources are qualitatively better. The techniques discussed in my article require the combination of dispersed collections and programming tools, which can only happen if each of these services or sources is openly available on the Internet. Why use Wikipedia (as I do in my H-Bot tool), which can be edited—or vandalized—by anyone? Not only can one send out a software agent to scan entire articles on the Wikipedia site (whereas the same spider is turned away by the gated Encyclopaedia Britannica), one can instruct a program to download the entire Wikipedia and store it on one’s server (as we have done at the Center for History and New Media), and then subject that corpus to more advanced manipulations. While flawed, Wikipedia is thus extremely valuable for data-mining purposes. For the same reason, the Open Content Alliance digitization project (involving Yahoo, Microsoft, and the Internet Archive, among others) will likely prove more useful for advanced digital research than Google’s far more ambitious library scanning project, which only promises a limited kind of search and retrieval.

3. Quantity may make up for a lack of quality. We humanists care about quality; we greatly respect the scholarly editions of texts that grace the well-tended shelves of university research libraries and disdain the simple, threadbare paperback editions that populate the shelves of airport bookstores. The former provides a host of helpful apparatuses, such as a way to check on sources and an index, while the latter merely gives us plain, unembellished text. But the Web has shown what can happen when you aggregate a very large set of merely decent (or even worse) documents. As the size of a collection grows, you can begin to extract information and knowledge from it in ways that are impossible with small collections, even if the quality of individual documents in that giant corpus is relatively poor.

When Machines Are the Audience

I recently received an email from someone at the Woodrow Wilson Center that began in the following way: “Dear Sir/Madam: I was wondering if you might share the following fellowship opportunity with the members of your list…The Africa Program is pleased to announce that it is now accepting applications…” The email was, of course, tagged as spam by my email software, since it looked suspiciously like what the U.S. Secret Service calls a 419 fraud scheme, or a scam where someone (generally from Africa) asks you to send them your bank account information so they can smuggle cash out of their country (the transfer then occurs in the opposite direction, in case you were wondering). Checking the email against a statistical list of high-likelihood spam triggers identified the repeated use of words such as “application,” “generous,” “Africa,” and “award,” as well as the phrases “submitted electronically” and the opening “Dear Sir/Madam.” The email piqued my curiosity because over the past year I’ve started altering some of my email writing to avoid precisely this problem of a “false positive” spam label, e.g., never sending just an attachment with no text (a classic spam trigger) and avoiding the use of phrases such as “Hey, you’ve got to look at this.” In other words, I’ve semi-consciously started writing for a new audience: machines. One of the central theories of humanities disciplines such as literature and history is that our subjects write for an audience (or audiences). What happens when machines are part of this audience?

As the Woodrow Wilson Center email shows, the fact that digital text is machine readable suddenly makes the use of specific words problematic, because keyword searches can much more easily uncover these words (and perhaps act on them) than in a world of paper. It would be easy to find, for instance, all of the emails about Monica Lewinsky in the 40 million Clinton White House emails saved by the National Archives because “Lewinsky” is such an unusual word. Flipping that logic around, if I were currently involved in a White House scandal, I would studiously avoid the use of any identifying keywords (e.g., “Abramoff”) in my email correspondence.

In other cases, this keyword visibility is desirable. For instance, if I were a writer today thinking about my Word files, I would consider including or excluding certain words from each file for future research (either by myself or by others). Indeed, the “smart folder” technology in Apple’s Spotlight search or the upcoming Windows Vista search can automatically group documents based on the presence of a keyword or set of keywords. When people ask me how they can create a virtual network of websites on a historical topic, I often respond by saying that they could include at the bottom of each web page in the network a unique invented string of characters (e.g., “medievalhistorynetwork”). After Google indexes all of the web pages with this string, you could easily create a specialized search engine that scans only these particular sites.

“Machine audience consciousness” has probably already infected many other realms of our writing. Have some other examples? Let me know and I’ll post them here.

No Computer Left Behind

In this week’s issue of the Chronicle of Higher Education Roy Rosenzweig and I elaborate on the implications of my H-Bot software, and of similar data-mining services and the web in general. “No Computer Left Behind” (cover story in the Chronicle Review; alas, subscription required, though here’s a copy at CHNM) is somewhat more polemical than our recent article in First Monday (“Web of Lies? Historical Knowledge on the Internet”). In short, we argue that just as the calculator—an unavoidable modern technology—muscled its way into the mathematics exam room, devices to access and quickly scan the vast store of historical knowledge on the Internet (such as PDAs and smart phones) will inevitably disrupt the testing—and thus instruction—of humanities subjects. As the editors of the Chronicle put it in their headline: “The multiple-choice test is on its deathbed.” This development is to be praised; just as the teaching of mathematics should be about higher principles rather than the rote memorization of multiplication tables, the teaching of subjects like history should be freed by new technologies to focus once again (as it was before a century of multiple-choice exams) on more important principles such as the analysis and synthesis of primary sources. Here are some excerpts from the article.

“What if students will have in their pockets a device that can rapidly and accurately answer, say, multiple-choice questions about history? Would teachers start to face a revolt from (already restive) students, who would wonder why they were being tested on their ability to answer something that they could quickly find out about on that magical device?

“It turns out that most students already have such a device in their pockets, and to them it’s less magical than mundane. It’s called a cellphone. That pocket communicator is rapidly becoming a portal to other simultaneously remarkable and commonplace modern technologies that, at least in our field of history, will enable the devices to answer, with a surprisingly high degree of accuracy, the kinds of multiple-choice questions used in thousands of high-school and college history classes, as well as a good portion of the standardized tests that are used to assess whether the schools are properly “educating” our students. Those technological developments are likely to bring the multiple-choice test to the brink of obsolescence, mounting a substantial challenge to the presentation of history—and other disciplines—as a set of facts or one-sentence interpretations and to the rote learning that inevitably goes along with such an approach…

“At the same time that the Web’s openness allows anyone access, it also allows any machine connected to it to scan those billions of documents, which leads to the second development that puts multiple-choice tests in peril: the means to process and manipulate the Web to produce meaningful information or answer questions. Computer scientists have long dreamed of an adequately large corpus of text to subject to a variety of algorithms that could reveal underlying meaning and linkages. They now have that corpus, more than large enough to perform remarkable new feats through information theory.

“For instance, Google researchers have demonstrated (but not yet released to the general public) a powerful method for creating ‘good enough’ translations—not by understanding the grammar of each passage, but by rapidly scanning and comparing similar phrases on countless electronic documents in the original and second languages. Given large enough volumes of words in a variety of languages, machine processing can find parallel phrases and reduce any document into a series of word swaps. Where once it seemed necessary to have a human being aid in a computer’s translating skills, or to teach that machine the basics of language, swift algorithms functioning on unimaginably large amounts of text suffice. Are such new computer translations as good as a skilled, bilingual human being? Of course not. Are they good enough to get the gist of a text? Absolutely. So good the National Security Agency and the Central Intelligence Agency increasingly rely on that kind of technology to scan, sort, and mine gargantuan amounts of text and communications (whether or not the rest of us like it).

“As it turns out, ‘good enough’ is precisely what multiple-choice exams are all about. Easy, mechanical grading is made possible by restricting possible answers, akin to a translator’s receiving four possible translations for a sentence. Not only would those four possibilities make the work of the translator much easier, but a smart translator—even one with a novice understanding of the translated language—could home in on the correct answer by recognizing awkward (or proper) sounding pieces in each possible answer. By restricting the answers to certain possibilities, multiple-choice questions provide a circumscribed realm of information, where subtle clues in both the question and the few answers allow shrewd test takers to make helpful associations and rule out certain answers (for decades, test-preparation companies like Kaplan Inc. have made a good living teaching students that trick). The ‘gaming’ of a question can occur even when the test taker doesn’t know the correct answer and is not entirely familiar with the subject matter…

“By the time today’s elementary-school students enter college, it will probably seem as odd to them to be forbidden to use digital devices like cellphones, connected to an Internet service like H-Bot, to find out when Nelson Mandela was born as it would be to tell students now that they can’t use a calculator to do the routine arithmetic in an algebra equation. By providing much more than just an open-ended question, multiple-choice tests give students—and, perhaps more important in the future, their digital assistants—more than enough information to retrieve even a fairly sophisticated answer from the Web. The genie will be out of the bottle, and we will have to start thinking of more meaningful ways to assess historical knowledge or ‘ignorance.'”