Dan Cohen

Archive for the ‘Software’ Category

Shakespeare’s Hard Drive

Monday, August 13th, 2007

Congrats to Matt Kirschenbaum on his thought-provoking article in the Chronicle of Higher Education, “Hamlet.doc? Literature in a Digital Age.” Matt makes two excellent points. First, “born digital” literature presents incredible new opportunities for research, because manuscripts written on computers retain significant metadata and draft tracking that allows for major insights into an author’s thought and writing process. Second, scholars who wish to study such literature in the future need to be proactive in pushing for writing environments, digital standards, and archival storage that will provide accessibility and persistence for these advantages.

Creating a Blog from Scratch, Part 10: The Conclusion

Wednesday, July 25th, 2007

Since its inception until today, this blog was powered by code I had written myself. Some people thought this took a lot of work; to be honest, it was just a few days of simple coding. As I noted at the beginning of this series on “Creating a Blog from Scratch,” rather than using existing software or services, such as WordPress or Blogger, I wanted to write my own blog code so that I could experiment with the form of the blog. In general, I found it to be a great exercise that I would highly recommend. It helped me understand the genre of the blog, challenge long-standing assumptions of form and function (like the tyranny of the calendar, now gone on most blogs), and think about ways one might customize a blog to fit academic needs.

But starting today, this blog will be powered by WordPress, not my own code. Am I a hypocrite? Well, yes and no. Yes, in that by switching to WordPress I have had to abandon some quirks of my original blog that had made it unique and that represented the accumulated wisdom of writing my own code. No, in that I feel I’ve learned enough in the process of the last two years that I can bend WordPress to my will enough to satisfy my need to customize and adapt.

More important, I had other needs that I just didn’t have enough time to implement by writing more of my own code, and there were other features of WordPress–a terrific open-source project–that I really wanted:

  • It took two years, but I’ve decided after initially disparaging comments (sentiments echoed recently by some well-known bloggers), I actually do think they are important to a blog and that my critics were right that the blog suffered without them. So starting today I have comments at the end of each post. (My old posts will remain free of comments since I have left them in their original format.)
  • I had also worried that the blog comments would be a haven for spam, but after the release of the wonderful reCAPTCHA system–which helps the Open Content Alliance transcribe digitized books while preventing spam–I felt that relatively spam-free commenting was possible.
  • As successful open-source software, WordPress has engendered a universe of helpful plugins, modifications, and documentation. For instance, this blog is now Zotero-compatible, thanks to the WordPress COinS plugin by my colleague Sean Takats. And of course reCAPTCHA came with a plugin for WordPress too.
  • WordPress’s system for drafting and editing posts is far more advanced than the basic screens I created. Writing this post is taking me about half the time it would have taken in my old system.
  • For the past six months I have been using ma.gnolia to add small posts to my feed (and to the sidebar of my old blog under “Briefly Noted”). I now can do this just as quickly using WordPress, and plan to post much more frequently starting in September.
  • Despite my best efforts, my old blog code failed to output valid XHTML, which I believe is increasingly important in a world where non-computer devices (such as the iPhone) are browsing the web and RSS feeds. WordPress automatically writes pages in XHTML.

I suppose I should rip off of my sleeve the badge of honor from my home-grown blogging software. But I like to see the switch to WordPress as just another step in the continual improvement of this blog, and look forward to many more years of writing in this space.

Personal WorldCat Lists Now Zotero-Compatible

Thursday, July 5th, 2007

A great example of what I’ve been calling the “fluidity of bibliography.” WorldCat adds a feature that allows registered users to save and share lists of items they find in the WorldCat catalog. We tweak Zotero to work with it. Et voila–easy to find, save, share, grab, and re-share scholarly records.

Nora Project Screencast

Tuesday, June 19th, 2007

The Nora text analysis and visualization project has a screencast out explaining how to use a new web interface to their server-based software.

Social and Semantic Computing for Historical Scholarship

Monday, May 14th, 2007

Under the assumption that many readers of this blog don’t receive the American Historical Association’s magazine Perspectives, you might be interested in this article I wrote for the May 2007 issue. In the piece I discuss the Zotero project’s connection to several recent trends in computing, and think ahead to what the Zotero server might mean for academic fields like history.

2007 Mellon Awards for Technology Collaboration

Wednesday, March 14th, 2007

The Andrew W. Mellon Foundation has launched the nominating process for the second annual Mellon Awards for Technology Collaboration (MATC). The awards, given by tech luminaries such as Tim Berners-Lee and Vint Cerf, honor not-for-profit organizations for leadership in the collaborative development of open source software tools with particular application to higher education and not-for-profit activities.

NINES Officially Launches

Tuesday, February 20th, 2007

As someone keenly interested in the possibilities of digital scholarship as well as nineteenth-century British and American intellectual history, I’m delighted to hear of the official launch of NINES (Networked Infrastructure for Nineteenth-century Electronic Scholarship), which allows researchers to search, organize, and annotate over 60,000 texts and images. A screencast of how to use Collex, their powerful web application, would be helpful for new users.

Intelligence Analysts and Humanities Scholars

Monday, November 13th, 2006

About halfway through the Chicago Colloquium on Digital Humanities and Computer Science last week, the always witty and insightful Martin Mueller humorously interjected: “I will go away from this conference with the knowledge that intelligence analysts and literary scholars are exactly the same.” As the chuckles from the audience died down, the core truth of the joke settled in—for those interested in advancing the still-nascent field of the digital humanities, are academic researchers indeed becoming clones of intelligence analysts by picking up the latter’s digital tools? What exactly is the difference between an intelligence analyst and a scholar who is scanning, sorting, and aggregating information from massive electronic corpora?

Mueller’s remark prods those of us exploring the frontiers of the digital humanities to do a better job describing how our pursuit differs from other fields making use of similar computational means. A good start would be to highlight that while the intelligence analyst sifts through mountains of data looking for patterns, anomalies, and connections that might be (in the euphemistic argot of the military) “actionable” (when policy makers piece together bits of intelligence and decide to take action), the digital humanities scholar should be looking for patterns, anomalies, and connections that strengthen or weaken existing theories in their field, or produce new theories. In other words, we not only uncover evidence, but come to overarching conclusions and make value judgments; we are at once the FBI, the district attorney, the judge, and the jury. (Perhaps the “National Intelligence Estimates” that are the highest form of synthesis in the intelligence community come closest to what academics do.)

The gentle criticism I gave to the Chicago audience at the end of the colloquium was that too many presentations seemed one (important) piece away from completing this interpretive whole. Through extraordinary guile, a series of panelists showed how digital methods can determine the gender of Shakespeare’s interlocutors, show more clearly the repetition of key phrases in Gertrude Stein’s prose, or more clearly map the ideology and interactions of FDR’s advisors during and after Pearl Harbor. But of course the real questions that need to be answered—answers that will make other humanities scholars stand up and take notice of digital methods—are, of course, how the identification of gender reshapes (or reinforces) our views of Shakespeare’s plays, how the use of repetition changes our perspectives on Gertrude Stein’s writings, or how a better understanding of presidential advisors alters our historical narrative of America’s entry into the second World War.

In Chicago, I tried to give this critical, final moment of insight reached through digital means a name—the “John Snow moment”—in honor of the Victorian pharmacist who discovered the cause of cholera by using a novel research tool unfamiliar to traditional medical science. Rather than looking at symptoms or other patient information on a case-by-case basis as a cholera outbreak killed and sickened hundreds of people in London in 1854, Snow instead mapped all incidences of the disease by the street addresses of the patients, thus quickly discovering that the cases clustered around a Soho water pump. The city council removed the water pump’s handle, quickly curtailing the disease and inaugurating a new era of epidemiology. Snow proved that cholera was a waterborne disease. Now that’s actionable intelligence.

What can digital scholars do to reach this level of insight? A key first step, reinforced by my experience in Chicago, is that academics interested in the power of computational methods must work to forge tools that satisfy their interpretive needs rather than simply accepting the tools that are currently available from other domains of knowledge, like intelligence. Ostensibly the Chicago Colloquium was about bringing together computer scientists and humanities scholars to see how we might learn from each other and enable new forms of research in an age of millions of digitized books. But as I noted in my remarks on the closing panel, too often this interaction seemed like a one-way street, with humanities scholars applying existing computer science tools rather than engaging the computer scientists (or programming themselves) to create new tools that would be better suited to their own needs. Hopefully such new tools will lead to more John Snow moments in the humanities in the near future.

Creating a Blog from Scratch, Part 5: What is XHTML, and Why Should I Care?

Thursday, January 5th, 2006

In prior posts in this series (1, 2, 3, and 4), I described with some glee my rash abandonment of common blogging software in favor of writing my own. For my purposes there seemed to be some key disadvantages to these popular packages, including an overemphasis on the calendar (I just saw the definition of a blog at the South by Southwest Interactive Festival—”a page with dated entries”—which, to paraphrase Woody Allen, is like calling War and Peace “a book about Russia”), a sameness to their designs, and comments that are rarely helpful and often filled with spam. But one of the greatest advantages of recent blog software packages is that they generally write standards-compliant code. More specifically, blog software like WordPress automatically produces XHTML. Some of you might be asking, what is XHTML, and who cares? And why would I want to spend a great deal of effort ensuring that this blog complied strictly with this language?

The large digital library contingent that reads this blog could probably enumerate many reasons why XHTML compliance is important, but I had two reasons in mind when I started this blog. (Actually, I had a third, more secretive reason that I’ll mention first: Roy Rosenzweig and I argue in our book Digital History that XHTML will likely be critical for digital humanists to adhere to in the future—don’t want to be accused of being a hypocrite.) For those for whom web acronyms are Greek, XHTML is a sibling of XML, a more rigorously structured and flexible language than the HTML that underlies most of the web. XHTML is better prepared than HTML to be platform-independent; because it separates formatting from content, XHTML (like XML) can be reconfigured easily for very different environments (using, e.g., different style sheets). HTML, with formatting and content inextricably combined, for the most part assumes that you are using a computer screen and a web browser. Theoretically XHTML can be dynamically and instantaneously recast to work on many different devices (including a personal computer). This flexibility is becoming an increasingly important feature as people view websites on a variety of platforms (not just a normal computer screen, e.g., but cell phones or audio browsers for the blind). Indeed, according to the server logs for this blog, 1.6% of visitors are using a smart phone, PDA, or other means to read this blog, a number that will surely grow. In short, XHTML seems better prepared than regular HTML to withstand the technological changes of the coming years, and theoretically should be more easily preserved than older methods of displaying information on the web. For these and other reasons a 2001 report the Smithsonian commissioned recommended the institution move to XHTML from HTML.

Of course, with standards compliance comes extra work. (And extra cost. Just ask webmasters at government agencies trying to make their websites comply with Section 508, the mandatory accessibility rules for federal information resources.) Aside from a brief flirtation with the what-you-see-is-what-you-get, write-the-HTML-for-you program Dreamweaver in the late 1990s, I’ve been composing web pages using a text editor (the superb BBEdit) for over ten years, so my hands are used to typing certain codes in HTML, in the same way you get used to a QWERTY keyboard. XHTML is not that dissimilar from HTML, but it still has enough differences to make life difficult for those used to HTML. You have to remember to close every tag; some attributes related to formating are in strange new locations. One small example of the minor infractions I frequently trip up on writing XHTML: the oft-used break tag to add a line to a web page must “close itself” by adding a slash before the end bracket (not <br>, but <br />). But I figured doing this blog would give me a good incentive to start writing everything in strict XHTML.

Yeah, right. I clearly haven’t been paying enough attention to detail. The page you’re reading likely still has dozens of little coding errors that make it fail strict compliance with the World Wide Web Consortium’s XHTML standard. (If you would like a humbling experience that brings to mind receiving a pop quiz back from your third-grade teacher with lots of red ink on it, try the W3C’s XHTML Validator.) I haven’t had enough time to go back and correct all of those little missing slashes and quotation marks. WordPress users out there can now begin their snickering; their blog software does such mundane things for them, and many proudly (and annoyingly) display little “XHTML 1.0 compliant” badges on their sites. Go ahead, rub it in.

After I realized that it would take serious effort to bring my code up to code, so to speak, I sat back and did the only thing I could do: rationalize. I didn’t really need strict XHTML compliance because through some design slight-of-hand I had already been able to make this blog load well on a wide range of devices. I learned from other blog software that if you put the navigation on the right rather than the more common left you see on most websites, the body of each post shows up first on a PDA or smart phone. It also means that blind visitors don’t have to suffer through a long list of your other posts before getting to the article they want to read.

As far as XHTML is concerned, I’ll be brushing up on that this summer. Unless I move this blog to WordPress by then.

Creating a Blog from Scratch, Part 4: Searching for a Good Search

Monday, December 26th, 2005

It often surprises those who have never looked at server logs (the detailed statistics about a website) that a tremendous percentage of site visitors come from searches. In the case of the Center for History and New Media, this is a staggering 400,000 unique visitors a month out of about one million. Furthermore, many of these visitors ignore a website’s navigation and go right to the site search box to complete their quest for information. While I’m not a big fan of consultants that tell webmasters to sacrifice virtually everything for usability, I do feel that searching has been undervalued by digital humanities projects, in part because so much effort goes into digitization, markup, interpretation, and other time-consuming tasks. But there’s another, technical reason too: it’s actually very hard to create an effective search—one, for instance, that finds phrases as well as single words, that is able to rank matches well, and that is easy to maintain through software and server upgrades. In this installment of “Creating a Blog from Scratch” (for those who missed them, here are parts 1, 2, and 3) I’ll take you behind the scenes to explain the pluses and minuses of the various options for adding a search feature to a blog, or any database-driven website for that matter.

There are basically four options for searching a website that is generated out of a database: 1) have the database do it for you, since it already has indexing and searching built in; 2) install another software package on your server that spiders your site, indices it, and powers your search; 3) use an application programming interface (API) from Google, Yahoo, or MSN to power the search, taking search results from this external source and shoehorning them into your website’s design; 4) outsourcing the search entirely by passing search queries to Google, Yahoo, or MSN’s website, with a modifier that says “only search my site for these words.”

Option #1 seems like the simplest. Just create an SQL statement (a line of code in database lingo) that sends the visitor’s query to the database software—in the case of this blog, the popular MySQL—and have it return a list of entries that match the query. Unfortunately, I’ve been using MySQL extensively for five years now and have found its ability to match such queries less than adequate. First of all, until the most recent version of the MySQL it would not handle phrase searching at all, so you would have to strip quotation marks out of queries and fool the user into believing your site could do something that it couldn’t (that is, do a search like Google could). Secondly, I have found its indexing and ranking schemes to be far behind what you expect from a major search engine. Maybe this has changed in version 5, but for many years it seemed as if MySQL was using search principles from the early 1990s, where the number of times a word appeared on the page signified how well the page matched the query (rather than the importance of the place of each instance of the word on the page, or even better, how important the document was in the constellation of pages that contained that word). MySQL will return a fraction from 0 to 1 for the relevance of a match, but it’s a crude measure. I’m still not convinced, even with the major upgrades in version 5, that MySQL’s searching is acceptable for demanding users.

Option #2 is to install specialized search packages such as the open source ht://Dig on your server, point it to your blog (or website) and let it spider the whole thing, just as Google or Yahoo does from the outside. These software packages can do a decent job indexing and swiftly finding documents that seem more relevant than the rankings in MySQL. But using them obviously requires installing and maintaining another complicated piece of software, and I’ve found that spiders have a way of wandering beyond the parameters you’ve set for them, or flaking out during server upgrades. (Over the last few days, for instance, I’ve had two spiders request hundreds of posts from this blog that don’t exist. Maybe they can see into the future.) Anecdotally, I also think that the search results are better from commercial services such as Google or Yahoo.

I’ve become increasingly enamored of Option #3, which is to use APIs, or direct server-to-server communications, with the indices maintained by Google, Yahoo, or Microsoft. The advantage of these APIs is that they provide you with very high quality search results and query handling (at least for Google and Yahoo; MSN is far behind). Ranking is done properly, with the most important documents (e.g., blog posts that many other bloggers link to or that you have referenced many times on your own site) coming up first if there are multiple hits in the search results. And these search giants have far more sophisticated ways of handling phrase searches (even long ones) and boolean searches than MySQL. The disadvantage of APIs is that for some reason the indices made available to software developers are only a fraction the size of the main indices for these search engines, and are only updated about once a month. So visitors may not find recent material, or some material that is ranked fairly low, through API searches. Another possibility for Option #3 is to use the API for a blog search engine, rather than a broad search engine. For instance, Technorati has a blog-specific search API. Since Technorati automatically receives a ping from my Atom feed every time I post (via FeedBurner), it’s possible that this (or another blog search engine) will ultimately provide a solid API-based search.

I’ve been experimenting with ways of getting new material into the main Google index swiftly (i.e., within a day or two rather than a month or two), and have come up with a good enough solution that I have chosen Option #4: outsourcing the search entirely to Google, by using their free (though unfortunately ad-supported) site-specific search. With little fanfare, this year Google released Google Sitemaps, which provides an easy way for those who maintain websites, especially database-driven ones, to specify where all of their web pages are using an XML schema. (Spiders often miss web pages generated out of a database because there are often so many of them, and some of these pages may not be linked to.) While not guaranteeing that everything in your sitemap will be crawled and indexed, Google does say that it makes it easier for them to crawl your site more effectively. (By the way, Google’s recent acquisition of 5 percent of AOL seems to have been, at least ostensibly, very much about providing AOL with better crawls, thus upping the visibility of their millions of pages without messing with Google’s ranking schemes.) And—here’s the big news if you’ve made it this far—I’ve found that having a sitemap gets new blog posts into the main Google index extremely fast. Indeed, usually within 24 hours of submitting a new post Google downloads my updated sitemap (created automatically by a PHP script I’ve written), sees the new URL for the post, and adds it to its index. This means that I can very effectively use the Google’s main search engine for this blog, although because I’m not using the API I can’t format the results page to match the design of my site exactly.

One final note, and I think an important one for those looking to increase the visibility of their blog posts (or any web page created from a database) in Google’s search results: have good URLs, i.e., ones with important keywords rather than meaningless numbers or letters. Database-driven sites often have such poor URLs featuring an ugly string of variables, which is a shame, since server technology (such as Apache’s mod_rewrite) allows webmasters to replace these variables with more memorable words. Moreover, Google, Yahoo, and other search engines clearly favor keywords in URLs (very apparent when you begin to work with Google’s Web API), assigning them a high value when determining the relevance of a web page to a query. Some blog software automatically creates good URLs (like Blogger, owned by Google), while many other software packages do not—typically emphasizing the date of a post in the URL or the page number in the blog. For my own blogging software, I designed a special field in the database just for URLs, so I can craft a particularly relevant and keyword-laden string. Mod_rewrite takes care of the rest, translating this string into an ID number that’s retrieved by the database to generate the page you’re reading.

For many reasons, including making it accessible to alternative platforms such as audio browsers and cell phones, I wanted to generate this page in strict XHTML, unlike my old website, which had poor coding practices left over from the 1990s. Unfortunately, as the next post in this series details, I failed terribly in the pursuit of this goal, and this floundering made me think twice about writing my own blogging software when existing packages like WordPress will generate XHTML for you, with no fuss.

Part 5: What is XHTML, and Why Should I Care?