Mashups – Dan Cohen

Zotero Needs Your Help, Part II

In my prior post on this topic, I mentioned the (paid) positions now available at the Center for History and New Media to work on and promote Zotero. (By the way, there’s still time to contact us if you’re interested; we just started reviewing applications, but hurry.) But Zotero is moving ahead on so many fronts that its success depends not only on those working on it full time, but also those who appreciate the software and want to help out in other ways. Here are some (unpaid, but feel-good) ways you can get involved.

If you are a librarian, instructional technologist, or anyone else on a campus or at an institution that uses citation software like EndNote or RefWorks, please consider becoming an informal campus representative for Zotero. As part of our effort to provide a free competitor to these other software packages, we need to spread the word, have people give short introductions to Zotero, and generally serve as local “evangelists.” Already, two dozen librarians who have tried Zotero and think it could be a great solution for students, staff, and faculty on their campuses have volunteered to help out in this role. If you’re interested in joining them, please contact campus-reps@zotero.org.

We are currently in the process of writing up instructions (and possibly creating some additional software) to make creating Zotero translators and citation style formatters easier. Translators are small bits of code that enable Zotero to recognize citation information on a web page; we have translators for specific sites (like Amazon.com) as well as broader ones that recognize certain common standards (like MARC records or embedded microformats). Style formatters take items in your Zotero library and reformat them into specific disciplinary or journal standards (e.g., APA, MLA, etc.). Right now creating translators takes a fair amount of technical knowledge (using things like XPath and JavaScript), so if you’re feeling plucky and have some software skills, email translators@zotero.org to get started on a translator for a specific collection or resource (or you can wait until we have better tools for creating translators). If you have some familiarity with XML and citation formatting, please contact styles@zotero.org if you’re interested in contributing a style formatter. We figure that if EndNote can get their users to contribute hundreds of style formatters for free, we should be able to do the same for translators and styles in the coming year.

One of our slogans for Zotero is “Citation management is only the beginning.” That will become increasingly obvious over the coming months as third-party developers (and the Zotero team) begin writing what we’re calling utilities, or little widgets that use Zotero’s location in the web browser to send and receive information across the web. Want to pull out all of the place names in a document and map them on Google Maps? Want to send del.icio.us a notice every time you tag something in Zotero? Want to send text from a Zotero item to an online translation service? All of this functionality will be relatively trivial in the near future. If you’re familiar with some of the browser technologies we use and that are common with Web 2.0 mashups and APIs and would like to write a Zotero utility, please contact utilities@zotero.org.

More generally, if you are a software developer and either would like to help with development or would like to receive news about the technical side of the Zotero project, please contact dev@zotero.org.

With Firefox 2.0 apparently going out of beta into full release next Thursday (October 26, 2006), it’s a great time to start talking up the powerful combination of Firefox 2.0 and Zotero (thanks, Lifehacker and the Examiner!).

October 18, 2006 1 Comment

Mapping Recent History

As the saying goes, imitation is the sincerest form of flattery. So at the Center for History and New Media, we’re currently feeling extremely flattered that our initiatives in collecting and presenting recent history—the Echo Project (covering the history of science, technology, and industry), the September 11 Digital Archive, and the Hurricane Digital Memory Bank—are being imitated by people using a wave of new websites that help them locate recollections, images, and other digital objects on a map. Here’s an example from the mapping site Platial:

And similar map from our 9/11 project:

Of course, we’re delighted to have imitators (and indeed, in turn we have imitated others), since we are trying to disseminate as widely as possible methods for saving the digital record of the present for future generations. It’s great to see new sites like Platial, CommunityWalk, and Wayfaring providing easy-to-use, collaborative maps that scattered groups of people can use to store photos, memories, and other artifacts.

April 11, 2006 Add Comment

What Would You Do With a Million Books?

What would you do with a million digital books? That’s the intriguing question this month’s D-Lib Magazine asked its contributors, as an exercise in understanding what might happen when massive digitization projects from Google, the Open Content Alliance, and others reach their fruition. I was lucky enough to be asked to write one of the responses, “From Babel to Knowledge: Data Mining Large Digital Collections,” in which I discuss in much greater depth the techniques behind some of my web-based research tools. (A bonus for readers of the article: learn about the secret connection between cocktail recipes and search engines.) Most important, many of the contributors make recommendations for owners of any substantial online resource. My three suggestions, summarized here, focus on why openness is important (beyond just “free beer” and “free speech” arguments), the relatively unexplored potential of application programming interfaces (APIs), and the curious implications of information theory.

1. More emphasis needs to be placed on creating APIs for digital collections. Readers of this blog have seen this theme in several prior posts, so I won’t elaborate on it again here, though it’s a central theme of the article.

2. Resources that are free to use in any way, even if they are imperfect, are more valuable than those that are gated or use-restricted, even if those resources are qualitatively better. The techniques discussed in my article require the combination of dispersed collections and programming tools, which can only happen if each of these services or sources is openly available on the Internet. Why use Wikipedia (as I do in my H-Bot tool), which can be edited—or vandalized—by anyone? Not only can one send out a software agent to scan entire articles on the Wikipedia site (whereas the same spider is turned away by the gated Encyclopaedia Britannica), one can instruct a program to download the entire Wikipedia and store it on one’s server (as we have done at the Center for History and New Media), and then subject that corpus to more advanced manipulations. While flawed, Wikipedia is thus extremely valuable for data-mining purposes. For the same reason, the Open Content Alliance digitization project (involving Yahoo, Microsoft, and the Internet Archive, among others) will likely prove more useful for advanced digital research than Google’s far more ambitious library scanning project, which only promises a limited kind of search and retrieval.

3. Quantity may make up for a lack of quality. We humanists care about quality; we greatly respect the scholarly editions of texts that grace the well-tended shelves of university research libraries and disdain the simple, threadbare paperback editions that populate the shelves of airport bookstores. The former provides a host of helpful apparatuses, such as a way to check on sources and an index, while the latter merely gives us plain, unembellished text. But the Web has shown what can happen when you aggregate a very large set of merely decent (or even worse) documents. As the size of a collection grows, you can begin to extract information and knowledge from it in ways that are impossible with small collections, even if the quality of individual documents in that giant corpus is relatively poor.

March 17, 2006 1 Comment

Where Are the Noncommercial APIs?

Readers of this blog know that one of my pet peeves as someone trying to develop software tools for scholars, teachers, and students is the lack of application programming interfaces (APIs) for educational resources. APIs greatly facilitate the use of these resources and allow third parties to create new services on top of them, such as the Google Maps “mashups” that have become a phenomenon in the last year. (Please see my post “Do APIs Have a Place in the Digital Humanities?” as well as the Hurricane Digital Memory Bank for more on APIs and to see what a historical mashup looks like.) Now a clearing house for APIs shows the extent to which noncommercial resources—and especially those in the humanities—have been left out in the cold in this promising new phase of the web. Count with me the total number of noncommercial, educationally-oriented APIs out of the nearly 200 listed on Programmable Web.

That’s right, for the humanities the answer is one: the Library of Congress’s somewhat clunky SRU (Search/Retrieve via URL). Maybe in a broader definition you could count the API from the BBC archive, though it seems to be more about current events. The Internet Archive’s API is currently focused on facilitating uploads into its system rather than, say, historical data mining of the web. A potentially rich API for finding book information, ISBNdb.com, seems promising, but shouldn’t there be a noncommercial entity offering this service (I assume ISSNdb.com will eventually charge or limit this important service)?

By my count the only other noncommercial APIs are from large U.S. government scientific institutions such as NASA, NIH, and NOAA. Surely this long list is missing some other APIs out there, such as one for OAI-PMH. If so, let Programmable Web know—most “Web 2.0” developers are looking here first to get ideas for services, and we don’t need more mashups focusing on the real estate market.

March 10, 2006 Add Comment

Hurricane Digital Memory Bank Featured on CNN

I was interviewed yesterday by CNN about a new project at the Center for History and New Media, the Hurricane Digital Memory Bank, which uses digital technology to record memories, photographs, and other media related to the Hurricanes Katrina, Rita, and Wilma. (CNN is going to feature the project sometime this week on its program The Situation Room.) The HDMB is a democratic historical project similar to our September 11 Digital Archive, which saved the recollections and digital files of tens of thousands of contributors from around the world; this time we’re trying to save thousands of perspectives on what occurred on the Gulf Coast in the fall of 2005. What amazes me is how the interest in online historical projects and collections has exploded recently. Several of the web projects I’ve co-directed over the last five years have engaged in collecting history online. But even a project with as prominent a topic as September 11 took a long time to be picked up by the mass media. This time CNN called us just a few weeks after we launched the website, and before we’ve done any real publicity. Here are three developments from the last two years I think account for this sharply increased interest.

Technologies enabling popular writing (blogs) and image sharing (e.g., Flickr) have moved into the mainstream, creating an unprecedented wave of self-documentation and historicizing. Blogs, of course, have given millions of people a taste for daily or weekly self-documentation unseen since the height of diary use in the late nineteenth century. And it used to be fairly complicated to set up an online gallery of one’s photos. Now you can do it with no technical know-how whatsoever, and it’s become much easier for others to find these photos (partly due to tagging/folksonomies). The result is that millions of photographs are being shared daily and the general public is getting used to the instantaneous documentation of events. Look at what happened in the hours after the London subway bombings— photographic documentation of the event that took place on photo-sharing sites within two days formerly would have taken months or even years for archivists to compile.

New web services are making combinations of these democratic efforts at documentation feasible and compelling. Our big innovation for the HDMB is to locate each contribution on an interactive map (using the Google Maps API), which allows one to compare the experiences and images from one place (e.g. an impoverished parish in New Orleans) with another (e.g., a wealthier suburb of Baton Rouge). (Can someone please come up with a better word for these combinations than the current “mashups”?) Through the savvy use of unique Technorati or Flickr tags, a scattered group of friends or colleagues can now automatically associate a group of documents or photographs to create an instant collection on an event or issue.

The mass media has almost completely reversed its formerly antagonistic posture toward new media. CNN now has at least two dedicated “Internet reporters” who look for new websites and scan blogs for news and commentary—once disparaged as the last refuge of unpublishable amateurs. In the last year the blogosphere has actually broken several stories (e.g., the Dan Rather document scandal), and many journalists have started their own blogs. The Washington Post has just hired its first full-time blogger. Technorati now tracks over 24 million blogs; even if 99% of those are discussing the latest on TomKat (the celebrity marriage) or Tomcat (the Linux server technology for Java), there are still a lot of new, interesting perspectives out there to be recorded for posterity.

January 4, 2006 Add Comment