Wikipedia vs. Encyclopaedia Britannica for Digital Research

In a prior post I argued that the recent coverage of Wikipedia has focused too much on one aspect of the online reference source’s openness—the ability of anyone to edit any article—and not enough on another aspect of Wikipedia’s openness—the ability of anyone to download or copy the entire contents of its database and use it in virtually any way they want (with some commercial exceptions). I speculated that, as I discovered in my data-mining work with H-Bot, which uses Wikipedia in its algorithms, having an open and free resource such as this could be very important for future digital research—e.g., finding all of the documents about the first President Bush in a giant, untagged corpus on the American presidency. For a piece I’m writing for D-Lib Magazine, I decided to test this theory by pulling out significant keywords and phrases from matching articles in Wikipedia and the Encyclopaedia Britannica on George H. W. Bush to see if one was better than the other for this purpose. Which resource is better? Here are the unedited term lists, derived by running plain text versions of each article through Yahoo’s Term Extraction web service. Vote on which one you think is a better profile, and I’ll reveal which list belongs to which reference work later this week.

Article #1
president bush
saddam hussein
fall of the berlin wall
tiananmen square
thanksgiving day
american troops
manuel noriega
halabja
invasion of panama
gulf war
help
saudi arabia
united nations
berlin wall

Article #2
president george bush
george bush
mikhail gorbachev
soviet union
collapse
reunification of germany
thurgood marshall
union
clarence thomas
joint chiefs of staff
cold war
manuel antonio noriega
iraq
george
nonaggression pact
david h souter
antonio noriega
president george

One thought on “Wikipedia vs. Encyclopaedia Britannica for Digital Research

  1. Pingback: Dan Cohen’s Digital Humanities Blog » Blog Archive » Wikipedia and Artificial Intelligence

Leave a Reply

Your email address will not be published. Required fields are marked *