Category Archives: Text Mining

Wikipedia vs. Encyclopaedia Britannica Keyword Shootout Results

In my post “Wikipedia vs. Encyclopaedia Britannica for Digital Research”, I asked you to compare two lists of significant keywords and phrases, derived from matching articles on George H. W. Bush in Wikipedia and the Encyclopaedia Britannica. Which one is a better keyword profile—a data mining list that could be used to find other documents on the first President Bush in a sea of documents—and which list do you think was derived from Wikipedia? The people have spoken and it’s time to open the envelope.

Incredibly, as of this writing everyone who has voted has chosen list #2 as being the better of the two, with 79% of the voters believing that this list was extracted from Wikipedia. Well, the majority is half right.

First, a couple of caveats. For some reason Yahoo’s Term Extraction service returned more terms for the second article than the first (I’m not sure why, but my experience has been that the service is fickle in this way). In addition, the second article is much shorter than the first, and Yahoo has a maximum character length for documents it will process. I suspect that the first article was truncated on its way to Yahoo’s server. Regardless, I agree that the second list is better (though it may have been helped by these factors).

But it may surprise some that list #2 comes from the Encyclopaedia Britannica rather than Wikipedia. There are clearly a lot of Wikipedia true believers out there (including, at times, myself). Despite its flaws, however, I still think Wikipedia will probably do just as well for keyword profiling of documents as the Encyclopaedia Britannica. And qualitative considerations are essentially moot since the Encyclopaedia Britannica has rendered itself useless anyway for data-mining purposes by gating its content.

Wikipedia vs. Encyclopaedia Britannica for Digital Research

In a prior post I argued that the recent coverage of Wikipedia has focused too much on one aspect of the online reference source’s openness—the ability of anyone to edit any article—and not enough on another aspect of Wikipedia’s openness—the ability of anyone to download or copy the entire contents of its database and use it in virtually any way they want (with some commercial exceptions). I speculated that, as I discovered in my data-mining work with H-Bot, which uses Wikipedia in its algorithms, having an open and free resource such as this could be very important for future digital research—e.g., finding all of the documents about the first President Bush in a giant, untagged corpus on the American presidency. For a piece I’m writing for D-Lib Magazine, I decided to test this theory by pulling out significant keywords and phrases from matching articles in Wikipedia and the Encyclopaedia Britannica on George H. W. Bush to see if one was better than the other for this purpose. Which resource is better? Here are the unedited term lists, derived by running plain text versions of each article through Yahoo’s Term Extraction web service. Vote on which one you think is a better profile, and I’ll reveal which list belongs to which reference work later this week.

Article #1
president bush
saddam hussein
fall of the berlin wall
tiananmen square
thanksgiving day
american troops
manuel noriega
invasion of panama
gulf war
saudi arabia
united nations
berlin wall

Article #2
president george bush
george bush
mikhail gorbachev
soviet union
reunification of germany
thurgood marshall
clarence thomas
joint chiefs of staff
cold war
manuel antonio noriega
nonaggression pact
david h souter
antonio noriega
president george

How Much Google Knows About You

As the U.S. Justice Department put pressure on Google this week to hand over their search records in a questionable pursuit of evidence for an overturned pornography law, I wondered: How much information does Google really know about us? Strangely, at nearly the same time an email arrived from Google (one of the Google Friends Newsletters) telling me that they had just launched Google Personal Search Trends. Someone in the legal department must not have vetted that email: Google Personal Search Trends reveals exactly how much they know about you. So, how much?

A lot. If you have a Google account (you have one if you have a software developer’s username, a Gmail account, or other Google service account), you can login to your Personal Search Trends page and find out. I logged in and even though I’ve never checked a box or filled out a consent form saying that I don’t mind if Google collects information about my search habits, there appeared a remarkable and slightly unsettling series of charts and tables about me and what I’m interested in.

You can discover not only your top 10 search phrases but also the top 10 sites you visit and the top 10 links you click on. Like Santa, Google knows when you are awake and when you are sleeping—amazingly, no searches for me between midnight and 6 AM ET over the past 12 months. And comparing my search habits with its vast database of users, Google Personal Search Trends tells me that I might also like go to websites on RSS, Charles Dickens, Frankenstein, search engine optimization, and Virginia Tech football. (It’s very wrong about that last one, which I hope it only derives from my search terms and websites visited and not also from the IP address of my laptop in an office on the campus of a Virginia state university.)

Of course, you begin to wonder: wouldn’t someone else like to see this same set of charts and tables? Couldn’t they glean a tremendous amount of information about me? This disturbing feeling grows when you do some more investigation of what Google’s storing on your hard drive in addition to theirs. For instance, if you use Google’s Book Search, they know through a cookie stored on your computer which books you’ve looked at—as well as how many pages of each book (so they can block you from reading too much of a copyrighted book).

Seems like the time is ripe for Google to offer its users a similar deal to the one TiVo has had for years: If you want us to provide the “best” search experience—extras in addition to the basic web search such as personalized search results and recommendations based on what you seem to like—you must provide us with some identifying information; if you want to search the web without these extras, then so be it—we’ll only save your searches on a fully anonymous basis for our internal research. Surely when government entities and private investigators hear about Google Personal Search Trends, they’ll want to have a look. One suspects that in China and perhaps the United States too, someone’s already doing just that.

10 Most Popular History Syllabi

My Syllabus Finder search engine has been in use for three years now, and I thought it would be interesting to look back at the nearly half-million searches and 640,000 syllabi it has handled to see which syllabi have been the most popular. The following list was compiled by running a series of calculations to determine the number of times Syllabus Finder users glanced at a syllabus (had it turn up in a search), read a syllabus (actually went from the Syllabus Finder website to the website of the syllabus to do further reading), and “attractiveness” of a syllabus (defined as the ratio of full reads to mere glances). Here are the most popular history syllabi on the web.

#1 – U.S. History to 1870 (Eric Mayer, Victor Valley College, total of 6104 points)

#2 – America in the Progressive Era (Robert Bannister, Swarthmore College, 6000 points)

#3 – The American Colonies (Bruce Dorsey, Swarthmore College, 5589 points)

#4 – The American Civil War (Sheila Culbert, Dartmouth College, 5521 points)

#5 – Early Modern Europe (Andrew Plaa, Columbia University, 5485 points)

#6 – The United States since 1945 (Robert Griffith, American University, 5109 points)

#7 – American Political and Social History II (Robert Dykstra, University at Albany, State University of New York, 5048 points)

#8 – The World Since 1500 (Sarah Watts, Wake Forest University, 4760 points)

#9 – The Military and War in America (Nicholas Pappas, Sam Houston State University, 4740 points)

#10 – World Civilization I (Jim Jones, West Chester University of Pennsylvania, 4636 points)

This is, of course, a completely unscientific study. It obviously gives an advantage to older syllabi, since those courses have been online longer and thus could show up in search results for several years. On the other hand, the ten syllabi listed here range almost uniformly from 1998 to 2005.

Whatever its faults, the study does provide a good sense of the most visible and viewed syllabi on the web (high Google rankings help these syllabi get into a lot of Syllabus Finder search results), and I hope it provides a sense of the kinds of syllabi people frequently want to consult (or crib)—mostly introductory courses in American history. The variety of institutions represented is also notable (and holds true beyond the top ten; no domination by, e.g., Ivy League schools). I’ll probably do some more sophisticated analyses when I have the time; if there’s interest from this blog’s audience I’ll calculate the most popular history syllabi from 2005 courses, or the top ten for other topics. If you would like to read a far more elaborate (and scientific) data-mining study I did using the Syllabus Finder, please take a look at “By the Book: Assessing the Place of Textbooks in U.S. Survey Courses.”

[How the rankings were determined: 1 point was awarded for each time a syllabus showed up in a Syllabus Finder search result; 10 points were awarded for each time a Syllabus Finder user clicked through to view the entire syllabus; 100 points were awarded for each percent of “attractiveness,” where 100% attractive meant that every time a syllabus made an appearance in a search result it was clicked on for further information. For instance, the top syllabus appeared in 1211 searches and was clicked on 268 times (22.13% of the searches), for a point total of 1211 + (268 X 10) + (22.13 X 100) = 6104.]

The Wikipedia Story That’s Being Missed

With all of the hoopla over Wikipedia in the recent weeks (covered in two prior posts), most of the mainstream as well as tech media coverage has focused on the openness of the democratic online encyclopedia. Depending on where you stand, this openness creates either a Wild West of publishing, where anything goes and facts are always changeable, or an innovative mode of mostly anonymous collaboration that has managed to construct in just a few years an information resource that is enormous, often surprisingly good, and frequently referenced. But I believe there is another story about Wikipedia that is being missed, a story unrelated to its (perhaps dubious) openness. This story is about Wikipedia being free, in the sense of the open source movement—the fact that anyone can download the entirety of Wikipedia and use it and manipulate it as they wish. And this more hidden story begins when you ask, Why would Google and Yahoo be so interested in supporting Wikipedia?

This year Google and Yahoo pledged to give various freebies to Wikipedia, such as server space and bandwidth (the latter can be the most crippling expense for large, highly trafficked sites with few sources of income). To be sure, both of these behemoth tech companies are filled with geeks who appreciate the anti-authoritarian nature of the Wikipedia project, and probably a significant portion of the urge to support Wikipedia comes from these common sentiments. Of course, it doesn’t hurt that Google and Yahoo buy their bandwidth in bulk and probably have some extra lying around, so to speak.

But Google and Yahoo, as companies at the forefront of search and data-mining technologies and business models, undoubtedly get an enormous benefit from an information resource that is not only open and editable but also free—not just free as in beer but free as in speech. First of all, affiliate companies that Yahoo and Google use to respond to queries, such as, primarily use Wikipedia as their main source, benefiting greatly from being able to repackage Wikipedia content (free speech) and from using it without paying (free beer). And Google has recently introduced an automated question-answering service that I suspect will use Wikipedia as one of its resources (if it doesn’t already).

But in the longer term, I think that Google and Yahoo have additional reasons for supporting Wikipedia that have more to do with the methodologies behind complex search and data-mining algorithms, algorithms that need full, free access to fairly reliable (though not necessarily perfect) encyclopedia entries.

Let me provide a brief example that I hope will show the value of having such a free resource when you are trying to scan, sort, and mine enormous corpora of text. Let’s say you have a billion unstructured, untagged, unsorted documents related to the American presidency in the last twenty years. How would you differentiate between documents that were about George H. W. Bush (Sr.) and George W. Bush (Jr.)? This is a tough information retrieval problem because both presidents are often referred to as just “George Bush” or “Bush.” Using data-mining algorithms such as Yahoo’s remarkable Term Extraction service, you could pull out of the Wikipedia entries for the two Bushes the most common words and phrases that were likely to show up in documents about each (e.g., “Berlin Wall” and “Barbara” vs. “September 11” and “Laura”). You would still run into some disambiguation problems (“Saddam Hussein,” “Iraq,” “Dick Cheney” would show up a lot for both), but this method is actually quite a powerful start to document categorization.

I’m sure Google and Yahoo are doing much more complex processes with the tens of gigabtyes of text on Wikipedia than this, but it’s clear from my own work on H-Bot (which uses its own cache of Wikipedia) that having a constantly updated, easily manipulated encyclopedia-like resource is of tremendous value, not just to the millions of people who access Wikipedia every day, but to the search companies that often send traffic in Wikipedia’s direction.

Update [31 Jan 2006]: I’ve run some tests on the data mining example given here in a new post. See Wikipedia vs. Encyclopaedia Britannica for Digital Research.