Category Archives: Programming

Creating a Blog from Scratch, Part 2: Advantages and Disadvantages of Popular Blog Software

In the first post in this series I briefly recounted the early history of blogs (all of five years ago) and noted how many of their current uses have diverged from two early incarnations (as a place to store interesting web links and as the online equivalent of a diary). Unfortunately, these early, dominant forms gave rise to existing blog software that, at least in my mind, is problematic. This “encoding” of original purposes into the basic structure of software is common in software development, and it often leads to features and configurations in later releases that are undesirable to a large number of users. In this post, I discuss the advantages and disadvantages of common blog packages—often deeply encoded into the software.

There are many good reasons to use popular blog software like Moveable Type, Blogger, or WordPress, almost too many to mention in this space. Here are some of the most important reasons, many of them obvious and others perhaps less so:

  • From a single download or by signing up with a service, you get a high level of functionality immediately and can focus on the content of your blog rather than its programming.
  • Their web designs make them instantly recognizable as the genre “blog,” thus making new visitors feel comfortable. For instance, most of them list posts in reverse chronological order, with “archives” that contain posts segmented by months and calendars marking days on which you have recently posted.
  • They allow people with little time or technical expertise to generate a site with well-formed, standards-compliant web code (most recently XHTML).
  • They automatically generate an RSS feed.
  • They have large user bases and active developers, which makes for relatively quick responses to annoyances such as blog spam.
  • They have lots of neat “social” features, such as feedback mechanisms (e.g., comments), and tracking (to see who has linked to one of your posts).
  • Some blog software automatically creates relatively good URLs (more on that in a later post in this series, and why good URLs are important to a blog).
  • Some blog services allow you to post via email and phone in addition to using a web browser.

Some of the disadvantages of popular blog software are merely the flip side of some of these advantages:

  • Even with the many templates blog software comes with, their web designs make most blogs look alike. Yes, you can easily figure out that a site is a blog, but on the other hand they begin to blend together in the mind’s eye. The web is made for variety, not sameness. In addition, you really have to work hard to fit a blog seamlessly into a broader site.
  • The tyranny of the calendar. There’s too much attention to chronology rather than content and the associations between that content. You can almost hear your blog software saying,
    “Boy, Dan had a pretty thin November, posting-wise,” taunting you with that empty calendar, or calling attention to the fact that your last post was “56 days ago.” Quality should triumph over quantity or frequency. Taking the emphasis off of time—perhaps not entirely, but a great deal—seemed to me to be a good first step for my own blog software (you’ll note that I only have a greyed-out date below the big red headline and a tiny “date string” in the buttons for each post). Obviously it makes sense to have recent posts highest on the page, but there may also be older posts that are still relevant or popular with visitors that you would like to highlight or reshuffle back into the mix. “Categories” have helped somewhat in this regard, and now post tagging (folksonomy) presents more hope. But I want to have full control over the position of my posts, recategorize them at will, have breakouts (like this series), different visual presentations, etc. And no thank you to the calendars or monthly archives.
  • Large installed user bases, as those who use Microsoft products will tell you, leads to unsavory attacks. Note the enormous proliferation of blog spam in the last year, mostly done by automated programs that know exactly how to find WordPress comment fields, Moveable Type comment fields, etc. Sure, there are now mechanisms for defending against these attacks, but when you really think about it…
  • The comment feature of blogs is vastly overrated anyway. My back-of-the-envelope calculation is that 1% of blog comments are useful to other readers. A truly important comment will be emailed to the writer of the blog, as I encourage readers to do at the end of every post. Moreover, increasingly a better place for you to comment on someone’s blog is on your own blog, with a link to their post. Indeed, that’s what Technorati and other blog search engines have figured out, and now you can acquire of a feed of comments about your blog from these third parties without opening up your blog to comment spam. (This also eliminates the need for trackback technology in your blog software.) So: no comments on my blog. Sorry. Don’t need the hassle of deleting even the occasional blog spam, and as readers of this blog have already done in droves (thanks!) you can email me if you need to. I’ll be happy to post your comments in this space if they help clarify a topic or make important corrections.
  • The search function is often not very good on blogs, even though search is how many people navigate sites. And trying to have a search function that simultaneously searches a blog and a wider site can be very complicated.
  • Like most software, there is a factor of “lock-in” when you choose an existing blog software package or service. It’s not entirely simple to export your material to a different piece of software. And many blog software packages have made this worse by encouraging posts written with non-standard (i.e., non-XHTML) characters that are used for formatting or style (as with Textile) and are converted to XHTML equivalents on the fly. This makes writing blog posts slightly faster. But if you export those posts, you will lose the important character translations.

Following this assessment of the advantages and disadvantages of popular blog software, I set about creating my own basic software that would easily fit into the web design you see here. Of course, I was throwing the baby out with the bath water by writing my own blog code. Couldn’t I just turn off the comments feature? Didn’t I want that easy XHTML compliance? Come on, are the designs so bad (they’re actually not, especially WordPress’s, but they are fairly similar across blogs)? Don’t I want to be able to phone in a post, or email one from a BlackBerry? (OK, the answer is no on both of those counts.)

But as I mentioned at the beginning of this series, I wanted to learn by doing and making. I didn’t know much about RSS. Which kind of RSS feed was best? How do you make an RSS feed, anyhow? I’ve thought a great deal about searching and data-mining, but what was the best way to search a blog? Were there ways to make a blog more searchable?

With these questions and concerns in mind, I started writing a simple PHP/MySQL application, and began to think about how I would make up for the lack of some of the advantages I’ve outlined above (hint: outsourcing). In the next post in this series, I’ll walk you through the basic setup and puzzle at the variety of RSS feeds.

Part 3: The Double Life of Blogs

Creating a Blog from Scratch, Part 1: What is a Blog, Anyway?

If you look at the bottom of this page, you won’t see any of the telltale signs that it is generated by a blog software package like Blogger, Moveable Type, or WordPress. When I was redesigning this site and wanted to add a blog to it, I made the perhaps foolhardy decision to write my own blogging software. Why, you might ask, would I recreate the proverbial wheel? As I’ll explain in several other columns in this space, writing your own software is one of the best ways to learn—not only about how to write software, but also about genres and to think about (and rethink) some of the assumptions that go into the construction of software written for specific genres. The first question I therefore asked myself was, What is a blog, anyway?

Seems like an easy question. But it’s really not. Blogs began literally as “Web logs,” as logs of links to websites that people thought were interesting and wanted to share with others. Hip readers of this blog will recognize, however, that this task has recently shifted to new “Web 2.0” services like, Furl, and Digg. Many blogs continue to include links to other websites, of course, perhaps with some commentary added, but the blog is quickly becoming the wrong place to merely list a bunch of links.

Following this initial purpose, the blog became a place for early adopters to write about events in their lives. In other words, the closest cognate to an offline genre was the diary. As I’ll argue in my next post in this series, this phase has made a permanent (and not entirely positive) mark on blogging software. Let me say for now that it led to what I would call “the tryanny of the calendar” (note all of those calendars on blogs).

Many bloggers (including this one), however, aren’t writing diary entries. We’re passing along information to readers, some of it topical and time-based and some of it not. For instance, I found a great post on Wally Grotophorst’s blog about getting rid of the need to type in passwords (which I do a lot). Is this topic essentially about Friday, October 28th, 2005 at 2:43 pm as WordPress emphasizes in Wally’s archive? No. It’s about “Faster SSH Logins,” as his effectively terse title suggests. This led me to my first idea for my own blogging software: emphasize, above all, the subject matter and the content of each post.

It also made me realize something even more basic. At heart a blog is very simple: it’s merely a way to dynamically create a site out of a series of “posts,” in the same way that a website consists of a series of web pages. Blogs now do all kinds of things: rant and rave, sell products, provide useful tips, record profound thoughts. With this in mind I started writing some PHP code and setting up a database that would serve this blog, but in a much simpler way than Blogger, Moveable Type, and WordPress.

In Part II of this series, I’ll discuss the advantages and disadvantages of existing blogging software, and how I tried to retain the positives, remove the negatives, and create a much slimmer but highly useful and flexible piece of software that anyone could write with just a little bit of programming experience.

Part 2: Advantages and Disadvantages of Popular Blog Software

Introduction to Firefox Scholar

This week in the electronic version, and next week in the print version, the Chronicle of Higher Education is running an article (subscription required) on a new software project I’m co-directing, Firefox Scholar, which will be a set of extensions to the popular open source web browser that will help researchers, teachers, and students. My thanks to the many people who have emailed who are interested in the project. For them and for others who would like to know more, here’s a brief summary of Firefox Scholar from our grant proposal to the Institute for Museum and Library Services, which has generously provided $250,000 to initiate the project. Please contact me if you would like occasional updates on the project or would like a beta release of the browser when it is available in the late summer of 2006.

The web browser has become the primary means for accessing information, documents, and artifacts from libraries and museums around the country and the world, thanks in large part to the tremendous commitment these institutions have made to bringing their collections online (as either simple citations or complete text and images). Unfortunately for scholars, while tens of millions of dollars have been spent to create digital resources, far less funding and effort has been allocated for the development of tools to facilitate the use of these resources. The browser remains merely a passive window allowing one to view, but not easily collect, annotate, or manipulate these objects. Moreover, from the user’s perspective individual library and museum collections remain just that—separate websites with distinct designs and different ways of displaying their information, making traditional scholarly practices of bringing together and studying objects of interest from across these collections unnecessarily difficult.

Firefox Scholar, a set of tools incorporated into popular, open, and free web software, will address these major problems by creating a web browser that is “smarter” in two key ways. First, one tool will enable the browser to intelligently sense when its user is viewing a digital library or museum object; this will allow the browser to capture information from the page automatically, such as the creator, title, date of creation, and copyright information. Second, another tool will store and organize this information, as well as full copies of items and web pages (not just their citation information) if so desired by the user and permitted by the institution’s site, allowing the user to sort, annotate, search, and manipulate these individualized collections created for scholarly purposes. Critically, all of this will occur within the web browser itself, not in a separate, standalone application; the web browser will be used not just to discover information, but also to collect, organize, and analyze scholarly materials.

Reliability of Information on the Web

Given the current obsession with the reliability (or more often in media coverage, the unreliability) of information on the web—the New York Times weighed in on the matter yesterday, and USA Today carried a scathing op-ed last week—I feel lucky that an article Roy Rosenzweig and I wrote entitled “Web of Lies? Historical Information on the Internet” happens to appear today in First Monday. If you’re interested in the subject, it’s probably best to read the full article, but I’ll provide a quick summary of our argument here.

Using my H-Bot software tool, Roy and I scanned the Internet to assess the quality of online information about history. In short, we found that while critics are correct that there are many error-riddled web pages, on the whole the web presents a relatively sound portrayal of historical facts through a process of consensus. With the right tools, these facts can be extracted from the web, leaving the more problematic web pages aside.

Moreover, this process of historical data mining on the web should prompt further discussion about the significance of all of this historical information online. To do some of our own prompting, we had a special multiple-choice test-taking version of H-Bot take the National Assessment of Educational Progress U.S. History exam using nothing but the web and some fancy algorithms borrowed from computer science. [Spoiler alert: it passed.] This raises new questions that move far beyond simple debates over the reliability of information on the web and into the very nature of teaching, learning, and research in our digital age.

Do APIs Have a Place in the Digital Humanities?

Since the 1960s, computer scientists have used application programming interfaces (APIs) to provide colleagues with robust, direct access to their databases and digital tools. Access via APIs is generally far more powerful than simple web-based access. APIs often include complex methods drawn from programming languages—precise ways of choosing materials to extract, methods to generate statistics, ways of searching, culling, and pulling together disparate data—that enable outside users to develop their own tools or information resources based on the work of others. In short, APIs hold great promise as a method for combining and manipulating various digital resources and tools in a free-form and potent way.

Unfortunately, even after four decades APIs remain much more common in the sciences and the commercial realm—for example, the APIs provided by search behemoths Google and Yahoo—than in the humanities. There are some obvious reasons for this disparity. By supplying an API, the owners of a resource or tool generally bear most of the cost (on their taxed servers, in technical support and staff time) while receiving little or no (immediate) benefit. Moreover, by essentially making an end-run around the common or “official” ways of accessing a tool or project (such as a web search form for a digital archive), an API may devalue the hard work and thoughtfulness put into the more public front end for a digital project. It is perhaps unsurprising that given these costs even Google and Yahoo, which have the financial strength and personnel to provide APIs for their search engines, continue to keep these programs hobbled—after all, programmers can use their APIs to create derivative search engines that compete directly with Google’s or Yahoo’s results pages, with none of the diverting (and profitable) text advertising.

So why should projects in the digital humanities provide APIs, especially given their often limited (or nonexistent) funding compared to a Google or Yahoo? The reason IBM conceived APIs in the first place, and still today the reason many computer scientists find APIs highly beneficial, is that unlike other forms of access they encourage the kind of energetic and creative grass-roots and third-party development that in the long run—after the initial costs borne by the API’s owner—maximize the value and utility of a digital resource or tool. Motivated by many different goals and employing many different methodologies, users of APIs often take digital resources or tools in directions completely unforeseen by their owners. APIs have provided fertile ground for thousands of developers to experiment with the tremendous indices and document caches maintained by Google and Yahoo. New resources based on these APIs appear weekly, some of them hinting at new methods for digital research, data visualization techniques, and novel ways to data-mine texts and synthesize knowledge.

Is it possible—and worthwhile—for digital humanities projects to provide such APIs for their resources and tools? Which resources or tools would be best suited for an API, and how will the creators of these projects sustain such an additional burden? And are there other forms of access or interoperability that have equal or greater benefits with fewer associated costs?