Category Archives: Wikis

Sarah Palin, Crowdsourced

Views of Wikipedia are decidedly mixed in academia, though perhaps trending slowly from mostly negative to grudgingly positive. But regardless of your view of Wikipedia—or your political persuasion—you can’t help but be impressed with the activity that occurs on the site for current events. (The same holds only slightly less true for non-current events, as Roy Rosenzweig pointed out.)

It’s instructive, for instance, to follow at this moment the collaborative production on the open encyclopedia for the entry on Sarah Palin, John McCain’s pick for Vice President. My best guess is that there are currently around 1,000 edits being made each day, by several hundred people. I actually started tracking this before Palin revealed the pregnancy of her teenage daughter, so the frenzy has probably increased, but here’s the schematic I came up with for the progress of the “Sarah Palin” Wikipedia article.

The graphic below shows every edit from 8am EDT on Sunday, August 31, 2008, to 8am EDT on Monday, September 1, 2008. These 24 hours (on a holiday weekend in the U.S.) produced over 500 edits, many of them quite large. The blocks show individual edits, ranging from a single word to three paragraphs. At the same time these edits were being made, scores of Wikipedians were also debating 80 distinct points for inclusion (or exclusion) from the article. They also added over a hundred footnotes pointing to print, Web, and other non-Wikipedia sources (seen at the end of the graphic, right after the “finished” article).

Digital Campus #30 – Live From Egypt!

On this week’s podcast, we were lucky to have a live link to Liam Wyatt in Alexandria, Egypt. Liam is a co-host of Wikipedia Weekly and was attending Wikimania 2008. Tom, Mills, and I covered Wikipedia in the very first episode of Digital Campus, and if anything it has become an even hotter topic on campuses since then. Liam gives us a valuable insider’s view of some of the issues Wikipedia and its community are facing, including questions over authority and internationalization. [Subscribe to this podcast.]

Wikipedia and Artificial Intelligence

Two years ago in this space, I mused about the potential of Wikipedia as the foundation of high quality data mining and search tools, including for historical research. (I also ran a test, the results of which were mixed, but promising.) Now comes a workshop entitled “Wikipedia and Artificial Intelligence: An Evolving Synergy,” at the annual meeting of the Association for the Advancement of Artificial Intelligence. From the call for papers:

As a large-scale repository of structured knowledge, Wikipedia has become a valuable resource for a diverse set of Artificial Intelligence (AI) applications. Major conferences in natural language processing and machine learning have recently witnessed a significant number of approaches that use Wikipedia for tasks ranging from text categorization and clustering to word sense disambiguation, information retrieval, information extraction and question answering…

The goal of the workshop is to foster the research and dissemination of ideas on the mutually beneficial interaction between Wikipedia and AI. The workshop is intended to be highly interdisciplinary…We also encourage participation of researchers from other areas who might benefit from the use of a large body of machine-readable knowledge.

The American Historical Association’s Archives Wiki

The American Historical Association has come up with a great idea for a wiki: a website that details the contents of historical archives around the world and includes information about visiting and using those archives. As with any wiki, historians and other researchers can improve the contents of the site by collaboratively editing pages. The site should prove to be an important resource for scholars to consult before making expensive and time-consuming trips. It launches with information about nearly 100 archives.

The Perfect and the Good Enough: Books and Wikis

As you may have noticed, I haven’t posted to my blog for an entire month. I have a good excuse: I just finished the final edits on my forthcoming book, Equations from God: Pure Mathematics and Victorian Faith, due out early next year. (I realized too late that I could have capitalized on Da Vinci Code fever and called the book The God Code, thus putting an intellectual and cultural history of Victorian mathematics in the hands of numerous unsuspecting Barnes & Noble shoppers.) The process of writing a book has occasionally been compared to pregnancy and childbirth; as the awe-struck husband of a wife who bore twins, I suspect this comparison is deeply flawed. But on a more superficial level, I guess one can say that it’s a long process that produces something of which one can be very proud, but which can involve some painful moments. These labor pains are especially pronounced (at least for me) in the final phase of book production, in which all of the final adjustments are made and tiny little errors (formatting, spelling, grammar) are corrected. From the “final” draft of a manuscript until its appearance in print, this process can take an entire year. Reading Roy Rosenzweig’s thought-provoking article on the production of the Wikipedia, just published in the Journal of American History, was apropos: it got me thinking about the value of this extra year of production work on printed materials and its relationship to what’s going on online now.

Is the time spent getting books as close to perfection as possible worth it? Of course it is. The value of books comes from an implicit contract between the reader and those who produce the book, the author and publisher. The producers ensure, through many cycles of revision, editing, and double checking, that the book contains as few errors as possible and is as cogent and forceful as possible. And the reader comes to a book with an understanding that the pages they are reading entail a tremendous amount of effort to reach near-perfection—thus making the book worthy of careful attention and consideration.

On the other hand, I’ve become increasingly fond of Voltaire’s dictum that “the perfect is the enemy of the good”; that is, in human affairs the (often nearly endless) search for perfection often means you fail to produce a good-enough solution. Roy Rosenzweig and I use the aphorism in Digital History, because there’s so much to learn and tinker with in trying to put history online that if you obsess about it all you will never even get started with a basic website. As it turns out, the history of computing includes many examples of this dynamic. For instance, Ethernet was not as “perfect” a technology as IBM’s Token-Ring, which, as its name implies, passed a “token” around so that every item on a network wouldn’t talk at once and get in each other’s way. But Ethernet was good enough, had decent (but not perfect) solutions to the problems that IBM’s top-notch engineers had elegantly solved, and was cheaper to implement. I suspect you know which technology triumphed.

Roy’s article, “Can History Be Open Source? Wikipedia and the Future of the Past,” suggests that we professional historians (and academics who produce books in general) may be underestimating good-enough online publishing like Wikipedia. Yes, Wikipedia has errors—though not as many as the ivory tower believes. Moreover, it is slowly figuring out how to deal with its imperfections, such as the ability of anyone to come along and edit a topic about which they know nothing, by using fairly sophisticated social and technological methods. Will it ever be as good as a professionally produced book? Probably not. But maybe that’s not the point. (And of course many books are far from perfect too.) Professors need to think carefully about the nature of what they produce given new forms of online production like wikis, rather than simply disparaging them as the province of cranks and amateurs. Finishing a book is as good a time to do that as any.

What Would You Do With a Million Books?

What would you do with a million digital books? That’s the intriguing question this month’s D-Lib Magazine asked its contributors, as an exercise in understanding what might happen when massive digitization projects from Google, the Open Content Alliance, and others reach their fruition. I was lucky enough to be asked to write one of the responses, “From Babel to Knowledge: Data Mining Large Digital Collections,” in which I discuss in much greater depth the techniques behind some of my web-based research tools. (A bonus for readers of the article: learn about the secret connection between cocktail recipes and search engines.) Most important, many of the contributors make recommendations for owners of any substantial online resource. My three suggestions, summarized here, focus on why openness is important (beyond just “free beer” and “free speech” arguments), the relatively unexplored potential of application programming interfaces (APIs), and the curious implications of information theory.

1. More emphasis needs to be placed on creating APIs for digital collections. Readers of this blog have seen this theme in several prior posts, so I won’t elaborate on it again here, though it’s a central theme of the article.

2. Resources that are free to use in any way, even if they are imperfect, are more valuable than those that are gated or use-restricted, even if those resources are qualitatively better. The techniques discussed in my article require the combination of dispersed collections and programming tools, which can only happen if each of these services or sources is openly available on the Internet. Why use Wikipedia (as I do in my H-Bot tool), which can be edited—or vandalized—by anyone? Not only can one send out a software agent to scan entire articles on the Wikipedia site (whereas the same spider is turned away by the gated Encyclopaedia Britannica), one can instruct a program to download the entire Wikipedia and store it on one’s server (as we have done at the Center for History and New Media), and then subject that corpus to more advanced manipulations. While flawed, Wikipedia is thus extremely valuable for data-mining purposes. For the same reason, the Open Content Alliance digitization project (involving Yahoo, Microsoft, and the Internet Archive, among others) will likely prove more useful for advanced digital research than Google’s far more ambitious library scanning project, which only promises a limited kind of search and retrieval.

3. Quantity may make up for a lack of quality. We humanists care about quality; we greatly respect the scholarly editions of texts that grace the well-tended shelves of university research libraries and disdain the simple, threadbare paperback editions that populate the shelves of airport bookstores. The former provides a host of helpful apparatuses, such as a way to check on sources and an index, while the latter merely gives us plain, unembellished text. But the Web has shown what can happen when you aggregate a very large set of merely decent (or even worse) documents. As the size of a collection grows, you can begin to extract information and knowledge from it in ways that are impossible with small collections, even if the quality of individual documents in that giant corpus is relatively poor.

Wikipedia vs. Encyclopaedia Britannica Keyword Shootout Results

In my post “Wikipedia vs. Encyclopaedia Britannica for Digital Research”, I asked you to compare two lists of significant keywords and phrases, derived from matching articles on George H. W. Bush in Wikipedia and the Encyclopaedia Britannica. Which one is a better keyword profile—a data mining list that could be used to find other documents on the first President Bush in a sea of documents—and which list do you think was derived from Wikipedia? The people have spoken and it’s time to open the envelope.

Incredibly, as of this writing everyone who has voted has chosen list #2 as being the better of the two, with 79% of the voters believing that this list was extracted from Wikipedia. Well, the majority is half right.

First, a couple of caveats. For some reason Yahoo’s Term Extraction service returned more terms for the second article than the first (I’m not sure why, but my experience has been that the service is fickle in this way). In addition, the second article is much shorter than the first, and Yahoo has a maximum character length for documents it will process. I suspect that the first article was truncated on its way to Yahoo’s server. Regardless, I agree that the second list is better (though it may have been helped by these factors).

But it may surprise some that list #2 comes from the Encyclopaedia Britannica rather than Wikipedia. There are clearly a lot of Wikipedia true believers out there (including, at times, myself). Despite its flaws, however, I still think Wikipedia will probably do just as well for keyword profiling of documents as the Encyclopaedia Britannica. And qualitative considerations are essentially moot since the Encyclopaedia Britannica has rendered itself useless anyway for data-mining purposes by gating its content.

Wikipedia vs. Encyclopaedia Britannica for Digital Research

In a prior post I argued that the recent coverage of Wikipedia has focused too much on one aspect of the online reference source’s openness—the ability of anyone to edit any article—and not enough on another aspect of Wikipedia’s openness—the ability of anyone to download or copy the entire contents of its database and use it in virtually any way they want (with some commercial exceptions). I speculated that, as I discovered in my data-mining work with H-Bot, which uses Wikipedia in its algorithms, having an open and free resource such as this could be very important for future digital research—e.g., finding all of the documents about the first President Bush in a giant, untagged corpus on the American presidency. For a piece I’m writing for D-Lib Magazine, I decided to test this theory by pulling out significant keywords and phrases from matching articles in Wikipedia and the Encyclopaedia Britannica on George H. W. Bush to see if one was better than the other for this purpose. Which resource is better? Here are the unedited term lists, derived by running plain text versions of each article through Yahoo’s Term Extraction web service. Vote on which one you think is a better profile, and I’ll reveal which list belongs to which reference work later this week.

Article #1
president bush
saddam hussein
fall of the berlin wall
tiananmen square
thanksgiving day
american troops
manuel noriega
halabja
invasion of panama
gulf war
help
saudi arabia
united nations
berlin wall

Article #2
president george bush
george bush
mikhail gorbachev
soviet union
collapse
reunification of germany
thurgood marshall
union
clarence thomas
joint chiefs of staff
cold war
manuel antonio noriega
iraq
george
nonaggression pact
david h souter
antonio noriega
president george

The Wikipedia Story That’s Being Missed

With all of the hoopla over Wikipedia in the recent weeks (covered in two prior posts), most of the mainstream as well as tech media coverage has focused on the openness of the democratic online encyclopedia. Depending on where you stand, this openness creates either a Wild West of publishing, where anything goes and facts are always changeable, or an innovative mode of mostly anonymous collaboration that has managed to construct in just a few years an information resource that is enormous, often surprisingly good, and frequently referenced. But I believe there is another story about Wikipedia that is being missed, a story unrelated to its (perhaps dubious) openness. This story is about Wikipedia being free, in the sense of the open source movement—the fact that anyone can download the entirety of Wikipedia and use it and manipulate it as they wish. And this more hidden story begins when you ask, Why would Google and Yahoo be so interested in supporting Wikipedia?

This year Google and Yahoo pledged to give various freebies to Wikipedia, such as server space and bandwidth (the latter can be the most crippling expense for large, highly trafficked sites with few sources of income). To be sure, both of these behemoth tech companies are filled with geeks who appreciate the anti-authoritarian nature of the Wikipedia project, and probably a significant portion of the urge to support Wikipedia comes from these common sentiments. Of course, it doesn’t hurt that Google and Yahoo buy their bandwidth in bulk and probably have some extra lying around, so to speak.

But Google and Yahoo, as companies at the forefront of search and data-mining technologies and business models, undoubtedly get an enormous benefit from an information resource that is not only open and editable but also free—not just free as in beer but free as in speech. First of all, affiliate companies that Yahoo and Google use to respond to queries, such as Answers.com, primarily use Wikipedia as their main source, benefiting greatly from being able to repackage Wikipedia content (free speech) and from using it without paying (free beer). And Google has recently introduced an automated question-answering service that I suspect will use Wikipedia as one of its resources (if it doesn’t already).

But in the longer term, I think that Google and Yahoo have additional reasons for supporting Wikipedia that have more to do with the methodologies behind complex search and data-mining algorithms, algorithms that need full, free access to fairly reliable (though not necessarily perfect) encyclopedia entries.

Let me provide a brief example that I hope will show the value of having such a free resource when you are trying to scan, sort, and mine enormous corpora of text. Let’s say you have a billion unstructured, untagged, unsorted documents related to the American presidency in the last twenty years. How would you differentiate between documents that were about George H. W. Bush (Sr.) and George W. Bush (Jr.)? This is a tough information retrieval problem because both presidents are often referred to as just “George Bush” or “Bush.” Using data-mining algorithms such as Yahoo’s remarkable Term Extraction service, you could pull out of the Wikipedia entries for the two Bushes the most common words and phrases that were likely to show up in documents about each (e.g., “Berlin Wall” and “Barbara” vs. “September 11” and “Laura”). You would still run into some disambiguation problems (“Saddam Hussein,” “Iraq,” “Dick Cheney” would show up a lot for both), but this method is actually quite a powerful start to document categorization.

I’m sure Google and Yahoo are doing much more complex processes with the tens of gigabtyes of text on Wikipedia than this, but it’s clear from my own work on H-Bot (which uses its own cache of Wikipedia) that having a constantly updated, easily manipulated encyclopedia-like resource is of tremendous value, not just to the millions of people who access Wikipedia every day, but to the search companies that often send traffic in Wikipedia’s direction.

Update [31 Jan 2006]: I’ve run some tests on the data mining example given here in a new post. See Wikipedia vs. Encyclopaedia Britannica for Digital Research.