Dan Cohen

Archive for the ‘Text Mining’ Category

Postdoc in Text Mining at CHNM

Thursday, April 3rd, 2008

[Yes, we're hiring again. Come join us if this sounds like you!]

The Center for History and New Media (CHNM) at George Mason University is seeking a postdoctoral fellow to work on a new text-mining initiative supported by the National Endowment for the Humanities. ABD candidates are also strongly encouraged to apply. This is a grant-funded, two-year position that is particularly appropriate for someone with interests in computational linguistics, machine learning, or technology and the humanities and social sciences. Specific background and experience is less important than the ability to learn new technical skills quickly. Knowledge of some combination of the following would be particularly helpful: Java, JavaScript, MySQL, PHP, or object-oriented programming. Ability to work in a team is very important. CHNM (http://chnm.gmu.edu), known for innovative work in digital media, is located in Fairfax, Virginia, 15 miles from Washington, DC, and is accessible by public transportation. Please send a cover letter and resume, including relevant programming projects and experience, to chnm@gmu.edu with subject line “Text Mining.” We will begin considering applications on 5/1/2008 and continue until the position is filled. Applications without a cover letter will not be considered.

Enhancing Historical Research With Text-Mining and Analysis Tools

Monday, February 4th, 2008

Open BookI’m delighted to announce that beginning this summer the Center for History and New Media will undertake a major two-year study of the potential of text-mining tools for historical (and by extension, humanities) scholarship. The project, entitled “Scholarship in the Age of Abundance: Enhancing Historical Research With Text-Mining and Analysis Tools,” has just received generous funding from the National Endowment for the Humanities.

In the last decade the library community and other providers of digital collections have created an incredibly rich digital archive of historical and cultural materials. Yet most scholars have not yet figured out ways to take full advantage of the digitized riches suddenly available on their computers. Indeed, the abundance of digital documents has actually exacerbated the problems of some researchers who now find themselves overwhelmed by the sheer quantity of available material. Meanwhile, some of the most profound insights lurking in these digital corpora remain locked up.

For some time computer scientists have been pursuing text mining as a solution to the problem of abundance, and there have even been a few attempts at bringing text-mining tools to the humanities (such as the MONK project). Yet there is not as much research as one might hope on what non-technically savvy scholars (especially historians) might actually want and use in their research, and how we might integrate sophisticated text analysis into the workflow of these scholars.

We will first conduct a survey of historians to examine closely their use of digital resources and prospect for particularly helpful uses of digital technology. We will then explore three main areas where text mining might help in the research process: locating documents of interest in the sea of texts online; extracting and synthesizing information from these texts; and analyzing large-scale patterns across these texts. A focus group of historians will be used to assess the efficacy of different methods of text mining and analysis in real-world research situations in order to offer recommendations, and even some tools, for the most promising approaches.

In addition to other forms of dissemination, I will of course provide project updates in this space.

[Image credit: Matt Wright]

Why Google Books Should Have an API

Tuesday, September 4th, 2007

No Way Out[This post is a version of a message I sent to the listserv for CenterNet, the consortium of digital humanities centers. Google has expressed interest in helping CenterNet by providing a (limited) corpus of full texts from their Google Books program, but I have been arguing for an API instead. My sense is that this idea has considerable support but that there are also some questions about the utility of an API, including from within Google.]

My argument for an API over an extracted corpus of books begins with a fairly simple observation: how are we to choose a particular dataset for Google to compile for us? I’m a scholar of the Victorian era, so a large corpus from the nineteenth century would be great, but how about those who study the Enlightenment? If we choose novels, what about those (like me) who focus on scientific literature? Moreover, many of us wish to do more expansive horizontal (across genres in a particular age) and vertical (within the same genre but through large spans of time) analyses. How do we accommodate the wishes of everyone who does computational research in the humanities?

Perhaps some of the misunderstanding here is about the kinds of research a humanities scholar might do as opposed to, say, the computational linguist, who might make use of a dataset or corpus (generally a broad and/or normalized one) to assess the nature of (a) language itself, examine frequencies and patterns of words, or address computer science problems such as document classification. Some of these corpora can provide a historian like me with insights as long as the time span involved is long enough and each document includes important metadata such as publication date (e.g., you can trace the rise and fall of certain historical themes using BYU’s Time Magazine corpus).

But there are many other analyses that humanities scholars could undertake with an API, especially one that allowed them to first search for books of possible interest and then to operate on the full texts of that ad hoc corpus. An example from my own research: in my last book I argued that mathematics was “secularized” in the nineteenth century, and part of my evidence was that mathematical treatises, which normally contained religious language in the early nineteenth century, lost such language by the end of the century. By necessity, researching in the pre-Google Books era, my textual evidence was limited–I could only read a certain number of treatises and chose to focus on the writing of high-profile mathematicians.

How would I go about supporting this thesis today using Google Books? I would of course love to have an exhaustive corpus of mathematical treatises. But in my book I also used published books of poems, sermons, and letters about math. In other words, it’s hard to know exactly what to assemble in advance–just treatises would leave out much of the story and evidence.

Ideally, I would like to use an API to find books that matched a complicated set of criteria (it would be even better if I could use regular expressions to find the many variants of religious language and also to find religious language relatively close to mentions of mathematics), and then use get_cache to acquire the full OCRed text of these matching books. From that ad hoc corpus I would want to do some further computational analyses on my own server, such as extracting references to touchstones for the divine vision of mathematics (e.g., Plato’s later works, geometry rather than number theory), and perhaps even do some aggregate analyses (from which works did British mathematicians most often acquire this religious philosophy of mathematics?). I would also want to examine these patterns over time to see if indeed the bond between religion and mathematics declined in the late Victorian era.

This is precisely the model I use for my Syllabus Finder. I first find possible syllabi using an algorithm-based set of searches of Google (via the unfortunately deprecated SOAP Search API) while also querying local Center for History and New Media databases for matches. Since I can then extract the full texts of matching web pages from Google (using the API’s cache function), I can do further operations, such as pulling book assignments out of the syllabi (using regular expressions).

It seems to me that a model is already in place at Google for such an API for Google Books: their special university researcher’s version of the Search API. That kind of restricted but powerful API program might be ideal because 1) I don’t think an API would be useful without the get_OCRed_text function, which (let’s face it) liberates information that is currently very hard to get even though Google has recently released a plain text view of (only some of) its books; and 2) many of us want to ping the Google Books API with more than the standard daily hit limit for Google APIs.

[Image credit: the best double-entendre cover I could find on Google Books: No Way Out by Beverly Hastings.]

Nora Project Screencast

Tuesday, June 19th, 2007

The Nora text analysis and visualization project has a screencast out explaining how to use a new web interface to their server-based software.

American Studies Tagline

Monday, June 18th, 2007

Dave Lester provides an interesting visualization of the history of American Studies over the last fifty years by running Lucy Maddox’s Locating American Studies: The Evolution of a Discipline through a tag cloud creator and then putting it on a slider timeline. Note the rise and fall of Leo Marx’s influence on the field, among other things.

Million Books Workshop Wrap-up

Thursday, May 24th, 2007

May has been a month of travel for me (thus the light posting in this space). I gave a talk about Zotero and related developments in the humanities and technology at the Stanford Humanities Center, and spoke at the annual meeting of the American Council of Learned Societies about how digital research is a major emerging theme in scholarship. Finally, I participated in the Tufts “Million Books” Workshop, which explored the technical feasibility and theoretical validity of extracting evidence and meaning from the large new corpora of online texts. The three main topics were how to get from scanned documents (especially the complicated ones that scholars sometimes encounter, like Sanskrit manuscripts or early modern broadsides, rather than simply formatted texts like modern English books) to machine-readable text that can be searched and analyzed; machine translation of texts; and moving from text to actionable data (e.g., extraction all of the place names from a document or summarizing large masses of text). Some developments worth noting from the workshop:

I had vaguely heard about the open-source optical character recognition (OCR) project OCRopus, but Thomas Breuel’s detailed description of the project made it seem extremely promising, especially for scholarly applications. Even after two decades of research and development, the error rate of OCR is still too high for many historical texts, and atrocious for compound texts like Victorian mathematical monographs (with all of those equations that end up, improperly and disastrously, as regular text after OCR) or works with vertical text (e.g., Japanese poetry) or images. OCRopus ambitiously plans to support any language written in any direction with any layout. It also breaks down the conversion of scans to text into separate processes that produce probabilities rather than certainties. This is critical. Most OCR packages give you a text result without noting where the software was unsure of a word or letter. Thus you might get “Cohem” rather than “Cohen” without knowing that the software thought long and hard about the correct interpretation of that last letter. OCRopus instead produces a statistical output that says to any end-user application (like search), “I’m sure about ‘Cohe’ but the last letter has a 60% probability of being an ‘m’ and a 40% probability of being an ‘n’.” A search for “Cohen” could thus return the document as a result even if the “final” transcription defaults to “Cohem.”

OCRopus also uses far more sophisticated methods than current OCR software to find titles, ordered blocks of text (like columns), and marginalia. Brilliantly, rather than outputting XML at the end of its processes, OCRopus outputs to XHTML and CSS3 so that it can much more accurately represent the fonts and layout of the original. Very impressive. The project is just in pre-alpha right now with a 1.0 release to come in the fall of 2008. Unsurprisingly, OCRopus is supported by Google, which plans to use it for Google Book Search. (Right now they have OCR that’s good enough for search, which doesn’t need anywhere near 100% accuracy, but they plan to re-OCR their book scans with OCRopus when it’s ready.)

David Smith spoke about the cutting edge of machine translation (i.e., the use of computational methods to translate text from one language to another). The field seems extremely active right now, and new methods promise better translations in the near future. David spoke of several developments. First, many projects are seeding their software with parallel texts, such as documents from the United Nations or the European Parliament, which are translated very precisely by humans into many languages. Parallel text corpora (with English as one of the parallels) on the order of 20-200 million words (roughly 1-10 million sentences) are available for a number of languages. Unfortunately, the parallel texts often come from genres like laws, parliamentary proceedings, and religious texts (not only the Bible but also, quite interestingly, Dianetics is one English text that has been translated into virtually every language, including Uzbek). These genres are, of course, less than optimal for widespread translation uses. We might, however, be able to use parallel translated works from Google’s scans or the Open Content Alliance to help improve the seed corpus.

Second, David noted the resilience of n-gram analysis—breaking down a document into word pairs or triads. Usually you can predict the next word in a document by looking at the previous two words and then assessing the probability of the word following each pair. Most of the best machine translation services (like Google’s) now split a text into bi-grams and tri-grams (two- and three-word pieces) and then translate those n-grams into very exact parallels in the target text using an n-gram library. This is better at keeping the style of the text and avoiding the off-sounding literal translations that have dogged the field. David feels that machine translation has reached the point where it can very usefully tell a user when a primary source document has been mistranslated by a human, which can be very useful for scholarship.

Finally, David Mimno discussed how to move from the text that results from the work of OCR and machine translation (if necessary) into forms that will help with research and analysis in the humanities. David has been doing impressive work in document classification, i.e., computationally assessing a set of digitized texts and figuring out which ones are letters or poems or lab notes, or if the documents are all articles, separating them out into topic clusters. Like machine translation and OCR, when you begin to look under the hood this is an extraordinarily complicated field. The three main techniques—support vector machines (SVM), naive Bayes (probably the best-known method, often used in spam filters), and logistic regression—are best viewed mathematically, and so lie beyond the scope of this blog. David is working on the Mallet project at the University of Massachusetts, Amherst, which seems promising for document classification (a topic we are increasingly interested in at the Center for History and New Media for historical research). The software is still in alpha but I plan to keep an eye on it.

Obviously a lot to think about from the month of May. How do we get these complicated tools to scholars who don’t have technical skills? How can we use these tools to reveal new, meaningful information about the past, without reproducing the obvious using computational means? As I felt at the National Endowment for the Humanities meeting in April, the application of digital methods to the humanities is experiencing a burst of energy and attention in 2007. It will be interesting to see what happens next.

Second Chicago Colloquium on Digital Humanities and Computer Science

Thursday, April 26th, 2007

I went to the first of these last November and it’s well worth attending. This year’s theme is “exploring the scholarly query potential of high quality text and image archives in a collaborative environment.” The colloquium will take place on October 21-22, 2007, with proposals due July 31, 2007.

It’s About Russia

Tuesday, March 6th, 2007

One of my favorite Woody Allen quips from his tragically short period as a stand-up comic is the punch line to his hyperbolic story about taking a speed-reading course and then digesting all of War and Peace in twenty minutes. The audience begins to giggle at the silliness of reading Tolstoy’s massive tome in a brief sitting. Allen then kills them with his summary of the book: “It’s about Russia.” The joke came to mind recently as I read the self-congratulatory blog post by IBM’s Many Eyes visualization project, applauding their first month on the web. (And I’m feeling a little embarrassed by my post on the one-year anniversary of this blog.) The Many Eyes researchers point to successes such as this groundbreaking visualization of the New Testament:

News flash: Jesus is a big deal in the New Testament. Even exploring the “network” of figures who are “mentioned together” (ostensibly the point of this visualization) doesn’t provide the kind of insight that even a first-year student in theology could provide over coffee. I have been slow to appreciate the power of textual visualization—in large part because I’ve seen far too many visualizations like this one, that merely use computational methods to reveal the obvious in fancy ways.

I’ve been doing some research on visualizations of texts recently for my next book (on digital scholarship), and trying to get over this aversion to visualizations. But when I see visualizations like this one, the lesson is clear: Make sure your visualizations expose something new, hidden, non-obvious.

Because War and Peace isn’t about Russia.

Google Book Search Now Maps Locations in the Text

Friday, January 26th, 2007

Look at the bottom of this page for Illustrated New York: The Metropolis of To-day (1888), digitized by Google at the University of Michigan Library. Using the natural language processing of Google Maps to scan the text for addresses, the locations and surrounding text are placed onto a map of lower Manhattan. A great example of the power of historical data mining and the combination of digital resources via APIs (made easier for Google, of course, because this is all in-house). Kudos to the Google Book Search team.

10 Most Popular Philosophy Syllabi

Sunday, May 21st, 2006

It’s time once again to find the most influential syllabi in a discipline—this time, philosophy—as determined by data gleaned from the Syllabus Finder. As with my earlier analysis of the most popular history syllabi the following list was compiled by running a series of calculations to determine the number of times Syllabus Finder users glanced at a syllabus (had it turn up in a search), the number of times Syllabus Finder users inspected a syllabus (actually went from the Syllabus Finder website to the website of the syllabus to do further reading), and the overall “attractiveness” of a syllabus (defined as the ratio of full reads to mere glances). It goes without saying (but I’ll say it) that this methodology is unscientific and gives an advantage to older syllabi, but it still probably provides a good sense of the most visible and viewed syllabi on the web. Anyway, here are the ten most popular philosophy syllabi.

#1 – Philosophy of Art and Beauty, Julie Van Camp, California State University, Long Beach, Spring 1998 (total of 3992 points)

#2 – Introduction to Philosophy, Andreas Teuber, Brandeis University, Fall 2004 (3699 points)

#3 – Law, Philosophy, and the Humanities, Julie Van Camp, California State University, Long Beach, Fall 2003 (3174 points)

#4 – Introduction to Philosophy, Jonathan Cohen, University of California, San Diego, Fall 1999 (2448 points)

#5 – Comparative Methodology, Bryan W. Van Norden, Vassar College, multiple semesters (1944 points)

#6 – Aesthetics, Steven Crowell, Rice University, Fall 2003 (1913 points)

#7 – Philosophical Aspects of Feminism, Lisa Schwartzman, Michigan State University, Spring 2001 (1782 points)

#8 – Morality and Society, Christian Perring, University of Kentucky, Spring 1996 (1912 points)

#9 – Gay and Lesbian Philosophy, David Barber, University of Maryland, Spring 2002 (1442 points)

#10 – Social and Political Philosophy, Eric Barnes, Mount Holyoke College, Fall 1999 (1395 points)

I will leave it to readers of this blog to assess and compare these syllabi, but two brief comments. First of all, the diversity of topics within this list is notable compared to the overwhelming emphasis on American history among the most popular history syllabi. Asthetics, politics, law, morality, gender, sexuality, and methodology are all represented. Second, congratulations to Julie Van Camp of California State University, Long Beach, who becomes the first professor with two top syllabi in a discipline. Professor Van Camp was a very early adopter of the web, having established a personal home page almost ten years ago with links to all of her syllabi. Van Camp should watch her back, however; Andreas Teuber of Brandeis is coming up quickly with what seems to be the Platonic ideal of an introductory course on philosophy. In less than two years since its inception his syllabus has been very widely consulted.

[The fine print of how the rankings were determined: 1 point was awarded for each time a syllabus showed up in a Syllabus Finder search result; 10 points were awarded for each time a Syllabus Finder user clicked through to view the entire syllabus; 100 points were awarded for each percent of "attractiveness," where 100% attractive means that every time a syllabus made an appearance in a search result it was clicked on for further information. For instance, the top syllabus appeared in 2164 searches and was clicked on 125 times (5.78% of the searches), for a point total of 2164 + (125 X 10) + (5.78 X 100) = 3992.]