Dan Cohen

Archive for the ‘Text Mining’ Category

Google Book Search Blog

Tuesday, May 9th, 2006

For those interested in the Google book digitization project (one of my three copyright-related stories to watch for 2006), Google launched an official blog yesterday. Right now “Inside Google Book Search” seems more like “Outside Google Book Search,” with a first post celebrating the joys of books and discovery, and with a set of links lauding the project, touting “success stories,” and soliciting participation from librarians, authors, and publishers. Hopefully we’ll get more useful insider information about the progress of the project, hints about new ways of searching millions of books, and other helpful tips for scholars in the near future. As I recently wrote in an article in D-Lib Magazine, Google’s project has some serious—perhaps fatal—flaws for those in the digital humanities (not so for the competing, but much smaller, Open Content Alliance). In particular, it would be nice to have more open access to the text (rather than mere page images) of pre-1923 books (i.e., those that are out of copyright). Of course, I’m a historian of the Victorian era who wants to scan thousands of nineteenth-century books using my own digital tools, not a giant company that may want to protect its very expensive investment in digitizing whole libraries.

Google Adds Topic Clusters to Search Results

Tuesday, March 21st, 2006

Google has been very conservative about changing their search results page. Indeed, the design of the page and the information presented has changed little since the search engine’s public introduction in 1998. Innovations have literally been marginal: Google has added helpful spelling corrections (”Did you mean…?”), related search terms, and news items near the top of the page, and of course the ubiquitous text ads to the right. But the primary search results block has remained fairly untouched. Competitors have come and gone (mostly the latter), promoting new—and they say better—ways of browsing masses of information. But Google’s clean, relevant list has brushed off these upstarts. So it surprised me when I was doing some fact checking on a book I’m finishing to see the following search results page:

As you can see, Google has evidently introduced a search results page that clusters relevant web pages by subject matter. Google has often disparaged other search engines that do this sort of clustering, like the gratingly named Clusty and Vivisimo, perhaps because Google’s engineers must be some of the few geeks who understand that regular human beings don’t particularly care for fancier ways of structuring or visualizing search results. Just the text, ma’am.

But while this addition of clustering (based on the information theory of document classification, as I recently discussed in D-Lib and in a popular prior blog post) to Google’s search results page is surprising, the way they’ve done it is typically simple and useful. No little topic folders in a sidebar; no floating circles connected by relationship lines. The page registers the same visually, but it’s more helpful. I was looking for the year in which the Victorian artist C.R. Ashbee died, and the first three results are about him. Then, above the fold, there’s a block of another three results that are mildly set apart (note the light grey lines), asking if I meant to look up information about the Ashbee Lacrosse League (with a link to the full results for that topic), then back to the artist. The page reads like a conversation, without any annoying, overly fancy technical flourishes: “Here’s some info about C.R. Ashbee…oh, did you mean the lacrosse league?…if you didn’t here’s some more about the artist.”

Now I just hope they add this clustering to their Web Search API, which would really help out with H-Bot, my automated historical fact finder.

What Would You Do With a Million Books?

Friday, March 17th, 2006

What would you do with a million digital books? That’s the intriguing question this month’s D-Lib Magazine asked its contributors, as an exercise in understanding what might happen when massive digitization projects from Google, the Open Content Alliance, and others reach their fruition. I was lucky enough to be asked to write one of the responses, “From Babel to Knowledge: Data Mining Large Digital Collections,” in which I discuss in much greater depth the techniques behind some of my web-based research tools. (A bonus for readers of the article: learn about the secret connection between cocktail recipes and search engines.) Most important, many of the contributors make recommendations for owners of any substantial online resource. My three suggestions, summarized here, focus on why openness is important (beyond just “free beer” and “free speech” arguments), the relatively unexplored potential of application programming interfaces (APIs), and the curious implications of information theory.

1. More emphasis needs to be placed on creating APIs for digital collections. Readers of this blog have seen this theme in several prior posts, so I won’t elaborate on it again here, though it’s a central theme of the article.

2. Resources that are free to use in any way, even if they are imperfect, are more valuable than those that are gated or use-restricted, even if those resources are qualitatively better. The techniques discussed in my article require the combination of dispersed collections and programming tools, which can only happen if each of these services or sources is openly available on the Internet. Why use Wikipedia (as I do in my H-Bot tool), which can be edited—or vandalized—by anyone? Not only can one send out a software agent to scan entire articles on the Wikipedia site (whereas the same spider is turned away by the gated Encyclopaedia Britannica), one can instruct a program to download the entire Wikipedia and store it on one’s server (as we have done at the Center for History and New Media), and then subject that corpus to more advanced manipulations. While flawed, Wikipedia is thus extremely valuable for data-mining purposes. For the same reason, the Open Content Alliance digitization project (involving Yahoo, Microsoft, and the Internet Archive, among others) will likely prove more useful for advanced digital research than Google’s far more ambitious library scanning project, which only promises a limited kind of search and retrieval.

3. Quantity may make up for a lack of quality. We humanists care about quality; we greatly respect the scholarly editions of texts that grace the well-tended shelves of university research libraries and disdain the simple, threadbare paperback editions that populate the shelves of airport bookstores. The former provides a host of helpful apparatuses, such as a way to check on sources and an index, while the latter merely gives us plain, unembellished text. But the Web has shown what can happen when you aggregate a very large set of merely decent (or even worse) documents. As the size of a collection grows, you can begin to extract information and knowledge from it in ways that are impossible with small collections, even if the quality of individual documents in that giant corpus is relatively poor.

When Machines Are the Audience

Thursday, March 2nd, 2006

I recently received an email from someone at the Woodrow Wilson Center that began in the following way: “Dear Sir/Madam: I was wondering if you might share the following fellowship opportunity with the members of your list…The Africa Program is pleased to announce that it is now accepting applications…” The email was, of course, tagged as spam by my email software, since it looked suspiciously like what the U.S. Secret Service calls a 419 fraud scheme, or a scam where someone (generally from Africa) asks you to send them your bank account information so they can smuggle cash out of their country (the transfer then occurs in the opposite direction, in case you were wondering). Checking the email against a statistical list of high-likelihood spam triggers identified the repeated use of words such as “application,” “generous,” “Africa,” and “award,” as well as the phrases “submitted electronically” and the opening “Dear Sir/Madam.” The email piqued my curiosity because over the past year I’ve started altering some of my email writing to avoid precisely this problem of a “false positive” spam label, e.g., never sending just an attachment with no text (a class spam trigger) and avoiding the use of phrases such as “Hey, you’ve got to look at this.” In other words, I’ve semi-consciously started writing for a new audience: machines. One of the central theories of humanities disciplines such as literature and history is that our subjects write for an audience (or audiences). What happens when machines are part of this audience?

As the Woodrow Wilson Center email shows, the fact that digital text is machine readable suddenly makes the use of specific words problematic, because keyword searches can much more easily uncover these words (and perhaps act on them) than in a world of paper. It would be easy to find, for instance, all of the emails about Monica Lewinsky in the 40 million Clinton White House emails saved by the National Archives because “Lewinsky” is such an unusual word. Flipping that logic around, if I were currently involved in a White House scandal, I would studiously avoid the use of any identifying keywords (e.g., “Abramoff”) in my email correspondence.

In other cases, this keyword visibility is desirable. For instance, if I were a writer today thinking about my Word files, I would consider including or excluding certain words from each file for future research (either by myself or by others). Indeed, the “smart folder” technology in Apple’s Spotlight search or the upcoming Windows Vista search can automatically group documents based on the presence of a keyword or set of keywords. When people ask me how they can create a virtual network of websites on a historical topic, I often respond by saying that they could include at the bottom of each web page in the network a unique invented string of characters (e.g., “medievalhistorynetwork”). After Google indexes all of the web pages with this string, you could easily create a specialized search engine that scans only these particular sites.

“Machine audience consciousness” has probably already infected many other realms of our writing. Have some other examples? Let me know and I’ll post them here.

No Computer Left Behind

Monday, February 20th, 2006

In this week’s issue of the Chronicle of Higher Education Roy Rosenzweig and I elaborate on the implications of my H-Bot software, and of similar data-mining services and the web in general. “No Computer Left Behind” (cover story in the Chronicle Review; alas, subscription required, though here’s a copy at CHNM) is somewhat more polemical than our recent article in First Monday (“Web of Lies? Historical Knowledge on the Internet”). In short, we argue that just as the calculator—an unavoidable modern technology—muscled its way into the mathematics exam room, devices to access and quickly scan the vast store of historical knowledge on the Internet (such as PDAs and smart phones) will inevitably disrupt the testing—and thus instruction—of humanities subjects. As the editors of the Chronicle put it in their headline: “The multiple-choice test is on its deathbed.” This development is to be praised; just as the teaching of mathematics should be about higher principles rather than the rote memorization of multiplication tables, the teaching of subjects like history should be freed by new technologies to focus once again (as it was before a century of multiple-choice exams) on more important principles such as the analysis and synthesis of primary sources. Here are some excerpts from the article.

“What if students will have in their pockets a device that can rapidly and accurately answer, say, multiple-choice questions about history? Would teachers start to face a revolt from (already restive) students, who would wonder why they were being tested on their ability to answer something that they could quickly find out about on that magical device?

“It turns out that most students already have such a device in their pockets, and to them it’s less magical than mundane. It’s called a cellphone. That pocket communicator is rapidly becoming a portal to other simultaneously remarkable and commonplace modern technologies that, at least in our field of history, will enable the devices to answer, with a surprisingly high degree of accuracy, the kinds of multiple-choice questions used in thousands of high-school and college history classes, as well as a good portion of the standardized tests that are used to assess whether the schools are properly “educating” our students. Those technological developments are likely to bring the multiple-choice test to the brink of obsolescence, mounting a substantial challenge to the presentation of history—and other disciplines—as a set of facts or one-sentence interpretations and to the rote learning that inevitably goes along with such an approach…

“At the same time that the Web’s openness allows anyone access, it also allows any machine connected to it to scan those billions of documents, which leads to the second development that puts multiple-choice tests in peril: the means to process and manipulate the Web to produce meaningful information or answer questions. Computer scientists have long dreamed of an adequately large corpus of text to subject to a variety of algorithms that could reveal underlying meaning and linkages. They now have that corpus, more than large enough to perform remarkable new feats through information theory.

“For instance, Google researchers have demonstrated (but not yet released to the general public) a powerful method for creating ‘good enough’ translations—not by understanding the grammar of each passage, but by rapidly scanning and comparing similar phrases on countless electronic documents in the original and second languages. Given large enough volumes of words in a variety of languages, machine processing can find parallel phrases and reduce any document into a series of word swaps. Where once it seemed necessary to have a human being aid in a computer’s translating skills, or to teach that machine the basics of language, swift algorithms functioning on unimaginably large amounts of text suffice. Are such new computer translations as good as a skilled, bilingual human being? Of course not. Are they good enough to get the gist of a text? Absolutely. So good the National Security Agency and the Central Intelligence Agency increasingly rely on that kind of technology to scan, sort, and mine gargantuan amounts of text and communications (whether or not the rest of us like it).

“As it turns out, ‘good enough’ is precisely what multiple-choice exams are all about. Easy, mechanical grading is made possible by restricting possible answers, akin to a translator’s receiving four possible translations for a sentence. Not only would those four possibilities make the work of the translator much easier, but a smart translator—even one with a novice understanding of the translated language—could home in on the correct answer by recognizing awkward (or proper) sounding pieces in each possible answer. By restricting the answers to certain possibilities, multiple-choice questions provide a circumscribed realm of information, where subtle clues in both the question and the few answers allow shrewd test takers to make helpful associations and rule out certain answers (for decades, test-preparation companies like Kaplan Inc. have made a good living teaching students that trick). The ‘gaming’ of a question can occur even when the test taker doesn’t know the correct answer and is not entirely familiar with the subject matter…

“By the time today’s elementary-school students enter college, it will probably seem as odd to them to be forbidden to use digital devices like cellphones, connected to an Internet service like H-Bot, to find out when Nelson Mandela was born as it would be to tell students now that they can’t use a calculator to do the routine arithmetic in an algebra equation. By providing much more than just an open-ended question, multiple-choice tests give students—and, perhaps more important in the future, their digital assistants—more than enough information to retrieve even a fairly sophisticated answer from the Web. The genie will be out of the bottle, and we will have to start thinking of more meaningful ways to assess historical knowledge or ‘ignorance.’”

Wikipedia vs. Encyclopaedia Britannica Keyword Shootout Results

Monday, February 6th, 2006

In my post “Wikipedia vs. Encyclopaedia Britannica for Digital Research”, I asked you to compare two lists of significant keywords and phrases, derived from matching articles on George H. W. Bush in Wikipedia and the Encyclopaedia Britannica. Which one is a better keyword profile—a data mining list that could be used to find other documents on the first President Bush in a sea of documents—and which list do you think was derived from Wikipedia? The people have spoken and it’s time to open the envelope.

Incredibly, as of this writing everyone who has voted has chosen list #2 as being the better of the two, with 79% of the voters believing that this list was extracted from Wikipedia. Well, the majority is half right.

First, a couple of caveats. For some reason Yahoo’s Term Extraction service returned more terms for the second article than the first (I’m not sure why, but my experience has been that the service is fickle in this way). In addition, the second article is much shorter than the first, and Yahoo has a maximum character length for documents it will process. I suspect that the first article was truncated on its way to Yahoo’s server. Regardless, I agree that the second list is better (though it may have been helped by these factors).

But it may surprise some that list #2 comes from the Encyclopaedia Britannica rather than Wikipedia. There are clearly a lot of Wikipedia true believers out there (including, at times, myself). Despite its flaws, however, I still think Wikipedia will probably do just as well for keyword profiling of documents as the Encyclopaedia Britannica. And qualitative considerations are essentially moot since the Encyclopaedia Britannica has rendered itself useless anyway for data-mining purposes by gating its content.

Wikipedia vs. Encyclopaedia Britannica for Digital Research

Monday, January 30th, 2006

In a prior post I argued that the recent coverage of Wikipedia has focused too much on one aspect of the online reference source’s openness—the ability of anyone to edit any article—and not enough on another aspect of Wikipedia’s openness—the ability of anyone to download or copy the entire contents of its database and use it in virtually any way they want (with some commercial exceptions). I speculated that, as I discovered in my data-mining work with H-Bot, which uses Wikipedia in its algorithms, having an open and free resource such as this could be very important for future digital research—e.g., finding all of the documents about the first President Bush in a giant, untagged corpus on the American presidency. For a piece I’m writing for D-Lib Magazine, I decided to test this theory by pulling out significant keywords and phrases from matching articles in Wikipedia and the Encyclopaedia Britannica on George H. W. Bush to see if one was better than the other for this purpose. Which resource is better? Here are the unedited term lists, derived by running plain text versions of each article through Yahoo’s Term Extraction web service. Vote on which one you think is a better profile, and I’ll reveal which list belongs to which reference work later this week.

Article #1
president bush
saddam hussein
fall of the berlin wall
tiananmen square
thanksgiving day
american troops
manuel noriega
halabja
invasion of panama
gulf war
help
saudi arabia
united nations
berlin wall

Article #2
president george bush
george bush
mikhail gorbachev
soviet union
collapse
reunification of germany
thurgood marshall
union
clarence thomas
joint chiefs of staff
cold war
manuel antonio noriega
iraq
george
nonaggression pact
david h souter
antonio noriega
president george

How Much Google Knows About You

Thursday, January 26th, 2006

As the U.S. Justice Department put pressure on Google this week to hand over their search records in a questionable pursuit of evidence for an overturned pornography law, I wondered: How much information does Google really know about us? Strangely, at nearly the same time an email arrived from Google (one of the Google Friends Newsletters) telling me that they had just launched Google Personal Search Trends. Someone in the legal department must not have vetted that email: Google Personal Search Trends reveals exactly how much they know about you. So, how much?

A lot. If you have a Google account (you have one if you have a software developer’s username, a Gmail account, or other Google service account), you can login to your Personal Search Trends page and find out. I logged in and even though I’ve never checked a box or filled out a consent form saying that I don’t mind if Google collects information about my search habits, there appeared a remarkable and slightly unsettling series of charts and tables about me and what I’m interested in.

You can discover not only your top 10 search phrases but also the top 10 sites you visit and the top 10 links you click on. Like Santa, Google knows when you are awake and when you are sleeping—amazingly, no searches for me between midnight and 6 AM ET over the past 12 months. And comparing my search habits with its vast database of users, Google Personal Search Trends tells me that I might also like go to websites on RSS, Charles Dickens, Frankenstein, search engine optimization, and Virginia Tech football. (It’s very wrong about that last one, which I hope it only derives from my search terms and websites visited and not also from the IP address of my laptop in an office on the campus of a Virginia state university.)

Of course, you begin to wonder: wouldn’t someone else like to see this same set of charts and tables? Couldn’t they glean a tremendous amount of information about me? This disturbing feeling grows when you do some more investigation of what Google’s storing on your hard drive in addition to theirs. For instance, if you use Google’s Book Search, they know through a cookie stored on your computer which books you’ve looked at—as well as how many pages of each book (so they can block you from reading too much of a copyrighted book).

Seems like the time is ripe for Google to offer its users a similar deal to the one TiVo has had for years: If you want us to provide the “best” search experience—extras in addition to the basic web search such as personalized search results and recommendations based on what you seem to like—you must provide us with some identifying information; if you want to search the web without these extras, then so be it—we’ll only save your searches on a fully anonymous basis for our internal research. Surely when government entities and private investigators hear about Google Personal Search Trends, they’ll want to have a look. One suspects that in China and perhaps the United States too, someone’s already doing just that.

10 Most Popular History Syllabi

Wednesday, January 11th, 2006

My Syllabus Finder search engine has been in use for three years now, and I thought it would be interesting to look back at the nearly half-million searches and 640,000 syllabi it has handled to see which syllabi have been the most popular. The following list was compiled by running a series of calculations to determine the number of times Syllabus Finder users glanced at a syllabus (had it turn up in a search), read a syllabus (actually went from the Syllabus Finder website to the website of the syllabus to do further reading), and “attractiveness” of a syllabus (defined as the ratio of full reads to mere glances). Here are the most popular history syllabi on the web.

#1 - U.S. History to 1870 (Eric Mayer, Victor Valley College, total of 6104 points)

#2 - America in the Progressive Era (Robert Bannister, Swarthmore College, 6000 points)

#3 - The American Colonies (Bruce Dorsey, Swarthmore College, 5589 points)

#4 - The American Civil War (Sheila Culbert, Dartmouth College, 5521 points)

#5 - Early Modern Europe (Andrew Plaa, Columbia University, 5485 points)

#6 - The United States since 1945 (Robert Griffith, American University, 5109 points)

#7 - American Political and Social History II (Robert Dykstra, University at Albany, State University of New York, 5048 points)

#8 - The World Since 1500 (Sarah Watts, Wake Forest University, 4760 points)

#9 - The Military and War in America (Nicholas Pappas, Sam Houston State University, 4740 points)

#10 - World Civilization I (Jim Jones, West Chester University of Pennsylvania, 4636 points)

This is, of course, a completely unscientific study. It obviously gives an advantage to older syllabi, since those courses have been online longer and thus could show up in search results for several years. On the other hand, the ten syllabi listed here range almost uniformly from 1998 to 2005.

Whatever its faults, the study does provide a good sense of the most visible and viewed syllabi on the web (high Google rankings help these syllabi get into a lot of Syllabus Finder search results), and I hope it provides a sense of the kinds of syllabi people frequently want to consult (or crib)—mostly introductory courses in American history. The variety of institutions represented is also notable (and holds true beyond the top ten; no domination by, e.g., Ivy League schools). I’ll probably do some more sophisticated analyses when I have the time; if there’s interest from this blog’s audience I’ll calculate the most popular history syllabi from 2005 courses, or the top ten for other topics. If you would like to read a far more elaborate (and scientific) data-mining study I did using the Syllabus Finder, please take a look at “By the Book: Assessing the Place of Textbooks in U.S. Survey Courses.”

[How the rankings were determined: 1 point was awarded for each time a syllabus showed up in a Syllabus Finder search result; 10 points were awarded for each time a Syllabus Finder user clicked through to view the entire syllabus; 100 points were awarded for each percent of "attractiveness," where 100% attractive meant that every time a syllabus made an appearance in a search result it was clicked on for further information. For instance, the top syllabus appeared in 1211 searches and was clicked on 268 times (22.13% of the searches), for a point total of 1211 + (268 X 10) + (22.13 X 100) = 6104.]

The Wikipedia Story That’s Being Missed

Tuesday, December 20th, 2005

With all of the hoopla over Wikipedia in the recent weeks (covered in two prior posts), most of the mainstream as well as tech media coverage has focused on the openness of the democratic online encyclopedia. Depending on where you stand, this openness creates either a Wild West of publishing, where anything goes and facts are always changeable, or an innovative mode of mostly anonymous collaboration that has managed to construct in just a few years an information resource that is enormous, often surprisingly good, and frequently referenced. But I believe there is another story about Wikipedia that is being missed, a story unrelated to its (perhaps dubious) openness. This story is about Wikipedia being free, in the sense of the open source movement—the fact that anyone can download the entirety of Wikipedia and use it and manipulate it as they wish. And this more hidden story begins when you ask, Why would Google and Yahoo be so interested in supporting Wikipedia?

This year Google and Yahoo pledged to give various freebies to Wikipedia, such as server space and bandwidth (the latter can be the most crippling expense for large, highly trafficked sites with few sources of income). To be sure, both of these behemoth tech companies are filled with geeks who appreciate the anti-authoritarian nature of the Wikipedia project, and probably a significant portion of the urge to support Wikipedia comes from these common sentiments. Of course, it doesn’t hurt that Google and Yahoo buy their bandwidth in bulk and probably have some extra lying around, so to speak.

But Google and Yahoo, as companies at the forefront of search and data-mining technologies and business models, undoubtedly get an enormous benefit from an information resource that is not only open and editable but also free—not just free as in beer but free as in speech. First of all, affiliate companies that Yahoo and Google use to respond to queries, such as Answers.com, primarily use Wikipedia as their main source, benefiting greatly from being able to repackage Wikipedia content (free speech) and from using it without paying (free beer). And Google has recently introduced an automated question-answering service that I suspect will use Wikipedia as one of its resources (if it doesn’t already).

But in the longer term, I think that Google and Yahoo have additional reasons for supporting Wikipedia that have more to do with the methodologies behind complex search and data-mining algorithms, algorithms that need full, free access to fairly reliable (though not necessarily perfect) encyclopedia entries.

Let me provide a brief example that I hope will show the value of having such a free resource when you are trying to scan, sort, and mine enormous corpora of text. Let’s say you have a billion unstructured, untagged, unsorted documents related to the American presidency in the last twenty years. How would you differentiate between documents that were about George H. W. Bush (Sr.) and George W. Bush (Jr.)? This is a tough information retrieval problem because both presidents are often referred to as just “George Bush” or “Bush.” Using data-mining algorithms such as Yahoo’s remarkable Term Extraction service, you could pull out of the Wikipedia entries for the two Bushes the most common words and phrases that were likely to show up in documents about each (e.g., “Berlin Wall” and “Barbara” vs. “September 11″ and “Laura”). You would still run into some disambiguation problems (”Saddam Hussein,” “Iraq,” “Dick Cheney” would show up a lot for both), but this method is actually quite a powerful start to document categorization.

I’m sure Google and Yahoo are doing much more complex processes with the tens of gigabtyes of text on Wikipedia than this, but it’s clear from my own work on H-Bot (which uses its own cache of Wikipedia) that having a constantly updated, easily manipulated encyclopedia-like resource is of tremendous value, not just to the millions of people who access Wikipedia every day, but to the search companies that often send traffic in Wikipedia’s direction.

Update [31 Jan 2006]: I’ve run some tests on the data mining example given here in a new post. See Wikipedia vs. Encyclopaedia Britannica for Digital Research.