On this week’s podcast, we were lucky to have a live link to Liam Wyatt in Alexandria, Egypt. Liam is a co-host of Wikipedia Weekly and was attending Wikimania 2008. Tom, Mills, and I covered Wikipedia in the very first episode of Digital Campus, and if anything it has become an even hotter topic on campuses since then. Liam gives us a valuable insider’s view of some of the issues Wikipedia and its community are facing, including questions over authority and internationalization. [Subscribe to this podcast.]
We’re excited to have two terrific guests on the podcast this week, Sunil Iyengar of the National Endowment for the Arts and Matt Kirschenbaum of the University of Maryland. Sunil and Matt debate the NEA’s recent report, To Read or Not To Read, which generated a lot of headlines and hand-wringing when it was released last month. (Blog subscribers may remember my critique of the report.) We also cover Microsoft’s courtship of Yahoo and what it means (if anything) for campuses, provide an update on a problematic U.S. House of Representatives bill, and dissect the new Horizon Report on digital technologies that will affect universities in the coming five years.
What would you do with a million digital books? That’s the intriguing question this month’s D-Lib Magazine asked its contributors, as an exercise in understanding what might happen when massive digitization projects from Google, the Open Content Alliance, and others reach their fruition. I was lucky enough to be asked to write one of the responses, “From Babel to Knowledge: Data Mining Large Digital Collections,” in which I discuss in much greater depth the techniques behind some of my web-based research tools. (A bonus for readers of the article: learn about the secret connection between cocktail recipes and search engines.) Most important, many of the contributors make recommendations for owners of any substantial online resource. My three suggestions, summarized here, focus on why openness is important (beyond just “free beer” and “free speech” arguments), the relatively unexplored potential of application programming interfaces (APIs), and the curious implications of information theory.
1. More emphasis needs to be placed on creating APIs for digital collections. Readers of this blog have seen this theme in several prior posts, so I won’t elaborate on it again here, though it’s a central theme of the article.
2. Resources that are free to use in any way, even if they are imperfect, are more valuable than those that are gated or use-restricted, even if those resources are qualitatively better. The techniques discussed in my article require the combination of dispersed collections and programming tools, which can only happen if each of these services or sources is openly available on the Internet. Why use Wikipedia (as I do in my H-Bot tool), which can be edited—or vandalized—by anyone? Not only can one send out a software agent to scan entire articles on the Wikipedia site (whereas the same spider is turned away by the gated Encyclopaedia Britannica), one can instruct a program to download the entire Wikipedia and store it on one’s server (as we have done at the Center for History and New Media), and then subject that corpus to more advanced manipulations. While flawed, Wikipedia is thus extremely valuable for data-mining purposes. For the same reason, the Open Content Alliance digitization project (involving Yahoo, Microsoft, and the Internet Archive, among others) will likely prove more useful for advanced digital research than Google’s far more ambitious library scanning project, which only promises a limited kind of search and retrieval.
3. Quantity may make up for a lack of quality. We humanists care about quality; we greatly respect the scholarly editions of texts that grace the well-tended shelves of university research libraries and disdain the simple, threadbare paperback editions that populate the shelves of airport bookstores. The former provides a host of helpful apparatuses, such as a way to check on sources and an index, while the latter merely gives us plain, unembellished text. But the Web has shown what can happen when you aggregate a very large set of merely decent (or even worse) documents. As the size of a collection grows, you can begin to extract information and knowledge from it in ways that are impossible with small collections, even if the quality of individual documents in that giant corpus is relatively poor.
In my post “Wikipedia vs. Encyclopaedia Britannica for Digital Research”, I asked you to compare two lists of significant keywords and phrases, derived from matching articles on George H. W. Bush in Wikipedia and the Encyclopaedia Britannica. Which one is a better keyword profile—a data mining list that could be used to find other documents on the first President Bush in a sea of documents—and which list do you think was derived from Wikipedia? The people have spoken and it’s time to open the envelope.
Incredibly, as of this writing everyone who has voted has chosen list #2 as being the better of the two, with 79% of the voters believing that this list was extracted from Wikipedia. Well, the majority is half right.
First, a couple of caveats. For some reason Yahoo’s Term Extraction service returned more terms for the second article than the first (I’m not sure why, but my experience has been that the service is fickle in this way). In addition, the second article is much shorter than the first, and Yahoo has a maximum character length for documents it will process. I suspect that the first article was truncated on its way to Yahoo’s server. Regardless, I agree that the second list is better (though it may have been helped by these factors).
But it may surprise some that list #2 comes from the Encyclopaedia Britannica rather than Wikipedia. There are clearly a lot of Wikipedia true believers out there (including, at times, myself). Despite its flaws, however, I still think Wikipedia will probably do just as well for keyword profiling of documents as the Encyclopaedia Britannica. And qualitative considerations are essentially moot since the Encyclopaedia Britannica has rendered itself useless anyway for data-mining purposes by gating its content.
In a prior post I argued that the recent coverage of Wikipedia has focused too much on one aspect of the online reference source’s openness—the ability of anyone to edit any article—and not enough on another aspect of Wikipedia’s openness—the ability of anyone to download or copy the entire contents of its database and use it in virtually any way they want (with some commercial exceptions). I speculated that, as I discovered in my data-mining work with H-Bot, which uses Wikipedia in its algorithms, having an open and free resource such as this could be very important for future digital research—e.g., finding all of the documents about the first President Bush in a giant, untagged corpus on the American presidency. For a piece I’m writing for D-Lib Magazine, I decided to test this theory by pulling out significant keywords and phrases from matching articles in Wikipedia and the Encyclopaedia Britannica on George H. W. Bush to see if one was better than the other for this purpose. Which resource is better? Here are the unedited term lists, derived by running plain text versions of each article through Yahoo’s Term Extraction web service. Vote on which one you think is a better profile, and I’ll reveal which list belongs to which reference work later this week.
fall of the berlin wall
invasion of panama
president george bush
reunification of germany
joint chiefs of staff
manuel antonio noriega
david h souter
With all of the hoopla over Wikipedia in the recent weeks (covered in two prior posts), most of the mainstream as well as tech media coverage has focused on the openness of the democratic online encyclopedia. Depending on where you stand, this openness creates either a Wild West of publishing, where anything goes and facts are always changeable, or an innovative mode of mostly anonymous collaboration that has managed to construct in just a few years an information resource that is enormous, often surprisingly good, and frequently referenced. But I believe there is another story about Wikipedia that is being missed, a story unrelated to its (perhaps dubious) openness. This story is about Wikipedia being free, in the sense of the open source movement—the fact that anyone can download the entirety of Wikipedia and use it and manipulate it as they wish. And this more hidden story begins when you ask, Why would Google and Yahoo be so interested in supporting Wikipedia?
This year Google and Yahoo pledged to give various freebies to Wikipedia, such as server space and bandwidth (the latter can be the most crippling expense for large, highly trafficked sites with few sources of income). To be sure, both of these behemoth tech companies are filled with geeks who appreciate the anti-authoritarian nature of the Wikipedia project, and probably a significant portion of the urge to support Wikipedia comes from these common sentiments. Of course, it doesn’t hurt that Google and Yahoo buy their bandwidth in bulk and probably have some extra lying around, so to speak.
But Google and Yahoo, as companies at the forefront of search and data-mining technologies and business models, undoubtedly get an enormous benefit from an information resource that is not only open and editable but also free—not just free as in beer but free as in speech. First of all, affiliate companies that Yahoo and Google use to respond to queries, such as Answers.com, primarily use Wikipedia as their main source, benefiting greatly from being able to repackage Wikipedia content (free speech) and from using it without paying (free beer). And Google has recently introduced an automated question-answering service that I suspect will use Wikipedia as one of its resources (if it doesn’t already).
But in the longer term, I think that Google and Yahoo have additional reasons for supporting Wikipedia that have more to do with the methodologies behind complex search and data-mining algorithms, algorithms that need full, free access to fairly reliable (though not necessarily perfect) encyclopedia entries.
Let me provide a brief example that I hope will show the value of having such a free resource when you are trying to scan, sort, and mine enormous corpora of text. Let’s say you have a billion unstructured, untagged, unsorted documents related to the American presidency in the last twenty years. How would you differentiate between documents that were about George H. W. Bush (Sr.) and George W. Bush (Jr.)? This is a tough information retrieval problem because both presidents are often referred to as just “George Bush” or “Bush.” Using data-mining algorithms such as Yahoo’s remarkable Term Extraction service, you could pull out of the Wikipedia entries for the two Bushes the most common words and phrases that were likely to show up in documents about each (e.g., “Berlin Wall” and “Barbara” vs. “September 11” and “Laura”). You would still run into some disambiguation problems (“Saddam Hussein,” “Iraq,” “Dick Cheney” would show up a lot for both), but this method is actually quite a powerful start to document categorization.
I’m sure Google and Yahoo are doing much more complex processes with the tens of gigabtyes of text on Wikipedia than this, but it’s clear from my own work on H-Bot (which uses its own cache of Wikipedia) that having a constantly updated, easily manipulated encyclopedia-like resource is of tremendous value, not just to the millions of people who access Wikipedia every day, but to the search companies that often send traffic in Wikipedia’s direction.
Update [31 Jan 2006]: I’ve run some tests on the data mining example given here in a new post. See Wikipedia vs. Encyclopaedia Britannica for Digital Research.
I’m currently working on an article for D-Lib Magazine explaining in greater depth how some of my tools that use search engine APIs work (such as the Syllabus Finder and H-Bot). These APIs, such as the services from Google and Yahoo, allow somewhat more direct access to mammoth web databases than you can get through these companies’ more public web interfaces. I thought it would be helpful for the article to discuss some of the advantages and drawbacks of these services, and was just outlining one of my major disappointments with their programming interfaces—namely, that you can’t run sophisticated text analysis on their servers, but have to do post-processing on your own server once you get a set of results back—when it was announced that Alexa released its Web Search Platform. The AWSP allows you to do just what I’ve been wanting to do on an extremely large (4 billion web page) corpus: scan through it in the same way that employees at Yahoo and Google can do, using advanced algorithms and manipulating as large a set of results as you can handle, rather than mere dozens of relevant pages. Here’s what’s notable about AWSP for researchers and digital humanists.
- Yahoo and Google hobble their APIs by only including a subset of their total web crawl. They seem leery of giving the entire 8 billion pages (in the case of the Google index) to developers. My calculation is that only about 1 in 5 pages in the main Google index makes it into their API index. AWSP provides access to the full crawl on their servers, plus the prior crawl and any crawl in progress. This means that AWSP probably provides the largest dataset researchers can presently access, about 3 times larger than Google or Yahoo (my rough guess from using their APIs is that those datasets are only about 1.5 billion pages, versus about 4 billion for AWSP). It seems ridiculous that this could make a difference (do I really need 250 terabytes of text rather than 75?), but when you’re searching for low-ranking documents like syllabi it could make a big difference. Moreover, with at least two versions of every webpage, it’s conceivable you could write a vertical search engine to compare differences across time on the web.
- They seem to be using a similar setup to the Ning web application environment to allow nonprogrammers to quickly create a specialized search by cloning a similar search that someone else has already developed. No deep knowledge of a programming language needed (possibly…stay tuned).
- You can download entire datasets, no matter how large, something that’s impossible on Yahoo and Google. So rather than doing my own crawl for 600,000 syllabi—which broke our relatively high-powered server—you can have AWSP do it for you and then grab the dataset.
- You can also have AWSP host any search engine you create, which removes a lot of the hassle of setting up a search engine (database software, spider, scripting languages, etc.).
- OK, now the big drawback. As economists say, there’s no such thing as a free lunch. In the case of AWSP, their business model differs from the Google and Yahoo APIs. Google and Yahoo are trying to give developers just enough so that they create new and interesting applications that rely on but don’t compete directly with Google and Yahoo. AWSP charges (unlike Google and Yahoo) for use, though the charges seem modest for a digital humanities application. While a serious new search engine that would data-mine the entire web might cost in the thousands of dollars, my back of the envelope calculation is that it would cost less than $100 (that is, paid to Alexa, aside from the programming time) to reproduce the Syllabus Finder, plus about $100 per year to provide it to users on their server.
I’ll report more details and thoughts as I test the service further.
Since the 1960s, computer scientists have used application programming interfaces (APIs) to provide colleagues with robust, direct access to their databases and digital tools. Access via APIs is generally far more powerful than simple web-based access. APIs often include complex methods drawn from programming languages—precise ways of choosing materials to extract, methods to generate statistics, ways of searching, culling, and pulling together disparate data—that enable outside users to develop their own tools or information resources based on the work of others. In short, APIs hold great promise as a method for combining and manipulating various digital resources and tools in a free-form and potent way.
Unfortunately, even after four decades APIs remain much more common in the sciences and the commercial realm—for example, the APIs provided by search behemoths Google and Yahoo—than in the humanities. There are some obvious reasons for this disparity. By supplying an API, the owners of a resource or tool generally bear most of the cost (on their taxed servers, in technical support and staff time) while receiving little or no (immediate) benefit. Moreover, by essentially making an end-run around the common or “official” ways of accessing a tool or project (such as a web search form for a digital archive), an API may devalue the hard work and thoughtfulness put into the more public front end for a digital project. It is perhaps unsurprising that given these costs even Google and Yahoo, which have the financial strength and personnel to provide APIs for their search engines, continue to keep these programs hobbled—after all, programmers can use their APIs to create derivative search engines that compete directly with Google’s or Yahoo’s results pages, with none of the diverting (and profitable) text advertising.
So why should projects in the digital humanities provide APIs, especially given their often limited (or nonexistent) funding compared to a Google or Yahoo? The reason IBM conceived APIs in the first place, and still today the reason many computer scientists find APIs highly beneficial, is that unlike other forms of access they encourage the kind of energetic and creative grass-roots and third-party development that in the long run—after the initial costs borne by the API’s owner—maximize the value and utility of a digital resource or tool. Motivated by many different goals and employing many different methodologies, users of APIs often take digital resources or tools in directions completely unforeseen by their owners. APIs have provided fertile ground for thousands of developers to experiment with the tremendous indices and document caches maintained by Google and Yahoo. New resources based on these APIs appear weekly, some of them hinting at new methods for digital research, data visualization techniques, and novel ways to data-mine texts and synthesize knowledge.
Is it possible—and worthwhile—for digital humanities projects to provide such APIs for their resources and tools? Which resources or tools would be best suited for an API, and how will the creators of these projects sustain such an additional burden? And are there other forms of access or interoperability that have equal or greater benefits with fewer associated costs?