Information Theory – Dan Cohen

The Vision of ORE

ORE logo One form of serious intellectual work that could use much more respect and appreciation within the humanities is the often unglamorous—but occasionally revolutionary—work of creating technical standards. At their best, such standards transcend the code itself to envision new forms of human interaction or knowledge creation that would not be possible without a lingua franca. We need only think of the web; look at what the modest HTML 1.0 spec has wrought.

The Object Reuse and Exchange (ORE) specification that was unveiled today at Johns Hopkins University has, beyond all of the minute technical details, a very clear and powerful vision of scholarly research and communication in a digital age. It is thus worth following the specification as it moves toward a final version in the fall of 2008, and to begin thinking about how we might use it in the humanities (even though it will undoubtedly be adopted faster in the sciences).

The vision put forth by Carl Lagoze, Herbert Van de Sompel, and others in the ORE working group for the first time tries to map the true nature of contemporary scholarship onto the web. The ORE community realized in 2006 that neither basic web pages nor advanced digital repositories truly capture today’s scholarship.

This scholarship cannot be contained by web pages or PDFs put into an institutional repository, but rather consists of what the ORE team has termed “aggregates,” or constellations of digital objects that often span many different web servers and repositories. For instance, a contemporary astronomy article might consist of a final published PDF, its metadata (author, title, publication info, etc.), some internal images, and then—here’s the important part—datasets, telescope imagery, charts, several publicly available drafts, and other matter (often held by third parties) that does not end up in the PDF. Similarly, an article in art history might consist of the historian’s text, paintings that were consulted in a museum, low-resolution copies of those paintings that are available online (perhaps a set of photos on Flickr of the referenced paintings), citations to other works, and perhaps an associated slide show.

How can one reliably reference and take full advantage of such scholarly constellations given the current state of the web? As Herbert Van de Sompel put it, ORE tries to identify in a commonsensical way “identified, bounded aggregations of related objects that form a logical whole.” In other words, ORE attempts to shift the focus from repositories for scholarship to the complex products of scholarship themselves.

By forging semantic links between pieces entailed in a work of scholarship it keeps those links active and dynamic and allows for humans, as well as machines that wish to make connections, to easily find these related objects. It also allows for a much better preservation path for digital scholarship because repositories can use ORE to get the entirety of a work and its associated constellation rather than grabbing just a single published instantiation of the work.

The implementation of ORE is perhaps less commonsensical for those who do not wish to dive into lots of semantic web terms and markup languages, but put simply, the approach the ORE group has taken is to provide a permanent locator (i.e., a URI, like a web address) that links to what they call a “resource map,” which in turn describes an aggregation. Think of a constellation in the night’s sky. We have Orion, which consists of certain stars; a star map specifies which stars comprise Orion and where to find each of them. The creators of ORE have chosen to use widely adopted formats like RDF and Atom to “serialize” (or make available in a machine-readable and easily exchangeable text format) their resource maps. [Geeks can read the full specification in their user guide.]

In the afternoon today several compelling examples of ORE in action were presented. Ray Plante of the NCSA and National Virtual Observatory showed how astronomers could use ORE and a wiki to create aggregates and updates about unusual events like supernovas, as different observatories add links to images and findings about each event (again, think of Van de Sompel’s “logical whole”). Several presenters mentioned our Zotero project as an ideal use case for ORE, since it already downloads associated objects as part of a single parent item (e.g., it stores metadata, a link to the page it got an item from, and perhaps a PDF or web snapshot). Zotero is already ORE Lite, in a way, and it will be good to try out a full Zotero translator for ORE resource maps that would permit Zotero users to grab aggregates for their research and subsequently publish aggregates back onto the web—object reuse and exchange in action.

Obviously it’s still very early and the true impact of ORE remains to be seen. But it would be a shame if humanities scholars fail to participate in the creation of scholarly standards like ORE, or to help envision their uses in research, communication, and collaboration.

There has been much talk recently of the social graph, the network of human connections that sites like Facebook bring to light and take advantage of. If widely adopted, ORE could help create the scholarly graph, the networked relations of scholars, publications, and resources.

March 3, 2008 4 Comments

What Would You Do With a Million Books?

What would you do with a million digital books? That’s the intriguing question this month’s D-Lib Magazine asked its contributors, as an exercise in understanding what might happen when massive digitization projects from Google, the Open Content Alliance, and others reach their fruition. I was lucky enough to be asked to write one of the responses, “From Babel to Knowledge: Data Mining Large Digital Collections,” in which I discuss in much greater depth the techniques behind some of my web-based research tools. (A bonus for readers of the article: learn about the secret connection between cocktail recipes and search engines.) Most important, many of the contributors make recommendations for owners of any substantial online resource. My three suggestions, summarized here, focus on why openness is important (beyond just “free beer” and “free speech” arguments), the relatively unexplored potential of application programming interfaces (APIs), and the curious implications of information theory.

1. More emphasis needs to be placed on creating APIs for digital collections. Readers of this blog have seen this theme in several prior posts, so I won’t elaborate on it again here, though it’s a central theme of the article.

2. Resources that are free to use in any way, even if they are imperfect, are more valuable than those that are gated or use-restricted, even if those resources are qualitatively better. The techniques discussed in my article require the combination of dispersed collections and programming tools, which can only happen if each of these services or sources is openly available on the Internet. Why use Wikipedia (as I do in my H-Bot tool), which can be edited—or vandalized—by anyone? Not only can one send out a software agent to scan entire articles on the Wikipedia site (whereas the same spider is turned away by the gated Encyclopaedia Britannica), one can instruct a program to download the entire Wikipedia and store it on one’s server (as we have done at the Center for History and New Media), and then subject that corpus to more advanced manipulations. While flawed, Wikipedia is thus extremely valuable for data-mining purposes. For the same reason, the Open Content Alliance digitization project (involving Yahoo, Microsoft, and the Internet Archive, among others) will likely prove more useful for advanced digital research than Google’s far more ambitious library scanning project, which only promises a limited kind of search and retrieval.

3. Quantity may make up for a lack of quality. We humanists care about quality; we greatly respect the scholarly editions of texts that grace the well-tended shelves of university research libraries and disdain the simple, threadbare paperback editions that populate the shelves of airport bookstores. The former provides a host of helpful apparatuses, such as a way to check on sources and an index, while the latter merely gives us plain, unembellished text. But the Web has shown what can happen when you aggregate a very large set of merely decent (or even worse) documents. As the size of a collection grows, you can begin to extract information and knowledge from it in ways that are impossible with small collections, even if the quality of individual documents in that giant corpus is relatively poor.

March 17, 2006 1 Comment

No Computer Left Behind

In this week’s issue of the Chronicle of Higher Education Roy Rosenzweig and I elaborate on the implications of my H-Bot software, and of similar data-mining services and the web in general. “No Computer Left Behind” (cover story in the Chronicle Review; alas, subscription required, though here’s a copy at CHNM) is somewhat more polemical than our recent article in First Monday (“Web of Lies? Historical Knowledge on the Internet”). In short, we argue that just as the calculator—an unavoidable modern technology—muscled its way into the mathematics exam room, devices to access and quickly scan the vast store of historical knowledge on the Internet (such as PDAs and smart phones) will inevitably disrupt the testing—and thus instruction—of humanities subjects. As the editors of the Chronicle put it in their headline: “The multiple-choice test is on its deathbed.” This development is to be praised; just as the teaching of mathematics should be about higher principles rather than the rote memorization of multiplication tables, the teaching of subjects like history should be freed by new technologies to focus once again (as it was before a century of multiple-choice exams) on more important principles such as the analysis and synthesis of primary sources. Here are some excerpts from the article.

“What if students will have in their pockets a device that can rapidly and accurately answer, say, multiple-choice questions about history? Would teachers start to face a revolt from (already restive) students, who would wonder why they were being tested on their ability to answer something that they could quickly find out about on that magical device?

“It turns out that most students already have such a device in their pockets, and to them it’s less magical than mundane. It’s called a cellphone. That pocket communicator is rapidly becoming a portal to other simultaneously remarkable and commonplace modern technologies that, at least in our field of history, will enable the devices to answer, with a surprisingly high degree of accuracy, the kinds of multiple-choice questions used in thousands of high-school and college history classes, as well as a good portion of the standardized tests that are used to assess whether the schools are properly “educating” our students. Those technological developments are likely to bring the multiple-choice test to the brink of obsolescence, mounting a substantial challenge to the presentation of history—and other disciplines—as a set of facts or one-sentence interpretations and to the rote learning that inevitably goes along with such an approach…

“At the same time that the Web’s openness allows anyone access, it also allows any machine connected to it to scan those billions of documents, which leads to the second development that puts multiple-choice tests in peril: the means to process and manipulate the Web to produce meaningful information or answer questions. Computer scientists have long dreamed of an adequately large corpus of text to subject to a variety of algorithms that could reveal underlying meaning and linkages. They now have that corpus, more than large enough to perform remarkable new feats through information theory.

“For instance, Google researchers have demonstrated (but not yet released to the general public) a powerful method for creating ‘good enough’ translations—not by understanding the grammar of each passage, but by rapidly scanning and comparing similar phrases on countless electronic documents in the original and second languages. Given large enough volumes of words in a variety of languages, machine processing can find parallel phrases and reduce any document into a series of word swaps. Where once it seemed necessary to have a human being aid in a computer’s translating skills, or to teach that machine the basics of language, swift algorithms functioning on unimaginably large amounts of text suffice. Are such new computer translations as good as a skilled, bilingual human being? Of course not. Are they good enough to get the gist of a text? Absolutely. So good the National Security Agency and the Central Intelligence Agency increasingly rely on that kind of technology to scan, sort, and mine gargantuan amounts of text and communications (whether or not the rest of us like it).

“As it turns out, ‘good enough’ is precisely what multiple-choice exams are all about. Easy, mechanical grading is made possible by restricting possible answers, akin to a translator’s receiving four possible translations for a sentence. Not only would those four possibilities make the work of the translator much easier, but a smart translator—even one with a novice understanding of the translated language—could home in on the correct answer by recognizing awkward (or proper) sounding pieces in each possible answer. By restricting the answers to certain possibilities, multiple-choice questions provide a circumscribed realm of information, where subtle clues in both the question and the few answers allow shrewd test takers to make helpful associations and rule out certain answers (for decades, test-preparation companies like Kaplan Inc. have made a good living teaching students that trick). The ‘gaming’ of a question can occur even when the test taker doesn’t know the correct answer and is not entirely familiar with the subject matter…

“By the time today’s elementary-school students enter college, it will probably seem as odd to them to be forbidden to use digital devices like cellphones, connected to an Internet service like H-Bot, to find out when Nelson Mandela was born as it would be to tell students now that they can’t use a calculator to do the routine arithmetic in an algebra equation. By providing much more than just an open-ended question, multiple-choice tests give students—and, perhaps more important in the future, their digital assistants—more than enough information to retrieve even a fairly sophisticated answer from the Web. The genie will be out of the bottle, and we will have to start thinking of more meaningful ways to assess historical knowledge or ‘ignorance.'”

February 20, 2006 1 Comment

Nature Compares Science Entries in Wikipedia with Encyclopaedia Britannica

In an article published tomorrow, but online now, the journal Nature reveals the results of a (relatively small) study it conducted to compare the accuracy of Wikipedia with Encyclopaedia Britannica—at least in the natural sciences. The results may strike some as surprising.

As Jim Giles summarizes in the special report: “Among 42 entries tested, the difference in accuracy was not particularly great: the average science entry in Wikipedia contained around four inaccuracies; Britannica, about three…Only eight serious errors, such as misinterpretations of important concepts, were detected in the pairs of articles reviewed, four from each encyclopaedia. But reviewers also found many factual errors, omissions or misleading statements: 162 and 123 in Wikipedia and Britannica, respectively.”

These results, obtained by sending experts such as the Princeton historian of science Michael Gordin matching entries from the democratic/anarchical online source and the highbrow, edited reference work and having them go over the articles with a fine-toothed comb, should feed into the current debate over the quality of online information. My colleague Roy Rosenzweig has written a much more in-depth (and illuminating) comparison of Wikipedia with print sources in history, due out next year in the Journal of American History, which should spark an important debate in the humanities. I suspect that the Wikipedia articles in history are somewhat different than those in the sciences—it seems from Nature‘s survey that there may be more professional scientists contributing to Wikipedia than professional historians—but couple of the basic conclusions are the same: the prose on Wikipedia is not so terrific but most of its facts are indeed correct, to a far greater extent than Wikipedia’s critics would like to admit.

December 14, 2005 Add Comment

First Monday is Second Tuesday This Month

For those who have been asking about the article I wrote with Roy Rosenzweig on the reliability of historical information on the web (summarized in a previous post), it has just appeared on the First Monday website, perhaps a little belatedly given the name of the journal.

December 13, 2005 Add Comment