Dan Cohen

Archive for the ‘Programming’ Category

Creating a Blog from Scratch, Part 1: What is a Blog, Anyway?

Friday, December 16th, 2005

If you look at the bottom of this page, you won’t see any of the telltale signs that it is generated by a blog software package like Blogger, Moveable Type, or WordPress. When I was redesigning this site and wanted to add a blog to it, I made the perhaps foolhardy decision to write my own blogging software. Why, you might ask, would I recreate the proverbial wheel? As I’ll explain in several other columns in this space, writing your own software is one of the best ways to learn—not only about how to write software, but also about genres and to think about (and rethink) some of the assumptions that go into the construction of software written for specific genres. The first question I therefore asked myself was, What is a blog, anyway?

Seems like an easy question. But it’s really not. Blogs began literally as “Web logs,” as logs of links to websites that people thought were interesting and wanted to share with others. Hip readers of this blog will recognize, however, that this task has recently shifted to new “Web 2.0″ services like del.icio.us, Furl, and Digg. Many blogs continue to include links to other websites, of course, perhaps with some commentary added, but the blog is quickly becoming the wrong place to merely list a bunch of links.

Following this initial purpose, the blog became a place for early adopters to write about events in their lives. In other words, the closest cognate to an offline genre was the diary. As I’ll argue in my next post in this series, this phase has made a permanent (and not entirely positive) mark on blogging software. Let me say for now that it led to what I would call “the tryanny of the calendar” (note all of those calendars on blogs).

Many bloggers (including this one), however, aren’t writing diary entries. We’re passing along information to readers, some of it topical and time-based and some of it not. For instance, I found a great post on Wally Grotophorst’s blog about getting rid of the need to type in passwords (which I do a lot). Is this topic essentially about Friday, October 28th, 2005 at 2:43 pm as WordPress emphasizes in Wally’s archive? No. It’s about “Faster SSH Logins,” as his effectively terse title suggests. This led me to my first idea for my own blogging software: emphasize, above all, the subject matter and the content of each post.

It also made me realize something even more basic. At heart a blog is very simple: it’s merely a way to dynamically create a site out of a series of “posts,” in the same way that a website consists of a series of web pages. Blogs now do all kinds of things: rant and rave, sell products, provide useful tips, record profound thoughts. With this in mind I started writing some PHP code and setting up a database that would serve this blog, but in a much simpler way than Blogger, Moveable Type, and WordPress.

In Part II of this series, I’ll discuss the advantages and disadvantages of existing blogging software, and how I tried to retain the positives, remove the negatives, and create a much slimmer but highly useful and flexible piece of software that anyone could write with just a little bit of programming experience.

Part 2: Advantages and Disadvantages of Popular Blog Software

Introduction to Firefox Scholar

Wednesday, December 7th, 2005

This week in the electronic version, and next week in the print version, the Chronicle of Higher Education is running an article (subscription required) on a new software project I’m co-directing, Firefox Scholar, which will be a set of extensions to the popular open source web browser that will help researchers, teachers, and students. My thanks to the many people who have emailed who are interested in the project. For them and for others who would like to know more, here’s a brief summary of Firefox Scholar from our grant proposal to the Institute for Museum and Library Services, which has generously provided $250,000 to initiate the project. Please contact me if you would like occasional updates on the project or would like a beta release of the browser when it is available in the late summer of 2006.

The web browser has become the primary means for accessing information, documents, and artifacts from libraries and museums around the country and the world, thanks in large part to the tremendous commitment these institutions have made to bringing their collections online (as either simple citations or complete text and images). Unfortunately for scholars, while tens of millions of dollars have been spent to create digital resources, far less funding and effort has been allocated for the development of tools to facilitate the use of these resources. The browser remains merely a passive window allowing one to view, but not easily collect, annotate, or manipulate these objects. Moreover, from the user’s perspective individual library and museum collections remain just that—separate websites with distinct designs and different ways of displaying their information, making traditional scholarly practices of bringing together and studying objects of interest from across these collections unnecessarily difficult.

Firefox Scholar, a set of tools incorporated into popular, open, and free web software, will address these major problems by creating a web browser that is “smarter” in two key ways. First, one tool will enable the browser to intelligently sense when its user is viewing a digital library or museum object; this will allow the browser to capture information from the page automatically, such as the creator, title, date of creation, and copyright information. Second, another tool will store and organize this information, as well as full copies of items and web pages (not just their citation information) if so desired by the user and permitted by the institution’s site, allowing the user to sort, annotate, search, and manipulate these individualized collections created for scholarly purposes. Critically, all of this will occur within the web browser itself, not in a separate, standalone application; the web browser will be used not just to discover information, but also to collect, organize, and analyze scholarly materials.

Reliability of Information on the Web

Monday, December 5th, 2005

Given the current obsession with the reliability (or more often in media coverage, the unreliability) of information on the web—the New York Times weighed in on the matter yesterday, and USA Today carried a scathing op-ed last week—I feel lucky that an article Roy Rosenzweig and I wrote entitled “Web of Lies? Historical Information on the Internet” happens to appear today in First Monday. If you’re interested in the subject, it’s probably best to read the full article, but I’ll provide a quick summary of our argument here.

Using my H-Bot software tool, Roy and I scanned the Internet to assess the quality of online information about history. In short, we found that while critics are correct that there are many error-riddled web pages, on the whole the web presents a relatively sound portrayal of historical facts through a process of consensus. With the right tools, these facts can be extracted from the web, leaving the more problematic web pages aside.

Moreover, this process of historical data mining on the web should prompt further discussion about the significance of all of this historical information online. To do some of our own prompting, we had a special multiple-choice test-taking version of H-Bot take the National Assessment of Educational Progress U.S. History exam using nothing but the web and some fancy algorithms borrowed from computer science. [Spoiler alert: it passed.] This raises new questions that move far beyond simple debates over the reliability of information on the web and into the very nature of teaching, learning, and research in our digital age.

Do APIs Have a Place in the Digital Humanities?

Monday, November 21st, 2005

Since the 1960s, computer scientists have used application programming interfaces (APIs) to provide colleagues with robust, direct access to their databases and digital tools. Access via APIs is generally far more powerful than simple web-based access. APIs often include complex methods drawn from programming languages—precise ways of choosing materials to extract, methods to generate statistics, ways of searching, culling, and pulling together disparate data—that enable outside users to develop their own tools or information resources based on the work of others. In short, APIs hold great promise as a method for combining and manipulating various digital resources and tools in a free-form and potent way.

Unfortunately, even after four decades APIs remain much more common in the sciences and the commercial realm—for example, the APIs provided by search behemoths Google and Yahoo—than in the humanities. There are some obvious reasons for this disparity. By supplying an API, the owners of a resource or tool generally bear most of the cost (on their taxed servers, in technical support and staff time) while receiving little or no (immediate) benefit. Moreover, by essentially making an end-run around the common or “official” ways of accessing a tool or project (such as a web search form for a digital archive), an API may devalue the hard work and thoughtfulness put into the more public front end for a digital project. It is perhaps unsurprising that given these costs even Google and Yahoo, which have the financial strength and personnel to provide APIs for their search engines, continue to keep these programs hobbled—after all, programmers can use their APIs to create derivative search engines that compete directly with Google’s or Yahoo’s results pages, with none of the diverting (and profitable) text advertising.

So why should projects in the digital humanities provide APIs, especially given their often limited (or nonexistent) funding compared to a Google or Yahoo? The reason IBM conceived APIs in the first place, and still today the reason many computer scientists find APIs highly beneficial, is that unlike other forms of access they encourage the kind of energetic and creative grass-roots and third-party development that in the long run—after the initial costs borne by the API’s owner—maximize the value and utility of a digital resource or tool. Motivated by many different goals and employing many different methodologies, users of APIs often take digital resources or tools in directions completely unforeseen by their owners. APIs have provided fertile ground for thousands of developers to experiment with the tremendous indices and document caches maintained by Google and Yahoo. New resources based on these APIs appear weekly, some of them hinting at new methods for digital research, data visualization techniques, and novel ways to data-mine texts and synthesize knowledge.

Is it possible—and worthwhile—for digital humanities projects to provide such APIs for their resources and tools? Which resources or tools would be best suited for an API, and how will the creators of these projects sustain such an additional burden? And are there other forms of access or interoperability that have equal or greater benefits with fewer associated costs?