Programming – Dan Cohen

Thoughts on One Week | One Tool

Well that just happened. It’s hard to believe that last Sunday twelve scholars and software developers were arriving at the brand-new Mason Inn on our campus and now have created and launched a tool, Anthologize, that created a frenzy on social and mass media.

If you haven’t already done so, you should first read the many excellent reports from those who participated in One Week | One Tool (and watched it from afar). One Week | One Tool was an intense institute sponsored by the National Endowment for the Humanities that strove to convey the Center for History and New Media‘s knowledge about building useful scholarly software. As the name suggests, the participants had to conceive, build, and disseminate their own tool in just one week. To the participants’ tired voices I add a few thoughts from the aftermath.

Less Talk, More Grok

One Week director (and Center for History and New Media managing director) Tom Scheinfeldt and I grew up listening to WAAF in Boston, which had the motto (generally yelled, with reverb) “Less Talk, More Rock!” (This being Boston, it was actually more like “Rahwk!”) For THATCamp I spun that call-to-action into “Less Talk, More Grok!” since it seemed to me that the core of THATCamp is its antagonism toward the deadening lectures and panels of normal academic conferences and its attempt to maximize knowledge transfer with nonhierarchical, highly participatory, hands-on work. THATCamp is exhausting and exhilarating because everyone is engaged and has something to bring to the table.

Not to over-philosophize or over-idealize THATCamp, but for academic doubters I do think the unconference is making an argument about understanding that should be familiar to many humanists: the importance of “tacit knowledge.” For instance, in my field, the history of science, scholars have come to realize in the last few decades that not all of science consists of cerebral equations and concepts that can be taught in a textbook; often science involves techniques and experiential lessons that must be acquired in a hands-on way from someone already capable in that realm.

This is also true for the digital humanities. I joked with emissaries from the National Endowment for the Humanities, which took a huge risk in funding One Week, that our proposal to them was like Jerry Seinfeld’s and George Costanza’s pitch to NBC for a “show about nothing.” I’m sure it was hard for reviewers of our proposal to see its slightly sketchy syllabus. (“You don’t know what will be built ahead of time?!”) But this is the way in which the digital humanities is close to the lab sciences. There can of course be theory and discussion, but there will also have to be a lot of doing if you want to impart full knowledge of the subject. Many times during the week I saw participants and CHNMers convey things to each other—everything from little shortcuts to substantive lessons—that wouldn’t have occurred to us ahead of time, without the team being engaged in actually building something.

MTV Cops

The low point of One Week was undoubtedly my ham-fisted attempt at something of a keynote while the power was out on campus, killing the lights, the internet, and (most seriously) the air conditioning. Following “Less Talk, More Grok,” I never should have done it. But one story I told at the beginning did seem to have modest continuing impact over the week (if frequently as the source of jokes).

Hollywood is famous for great (and laughable) idea pitches—which is why that Seinfeld episode was amusing—but none is perhaps better than Brandon Tartikoff’s brilliantly concise pitch for Miami Vice: “MTV cops.” I’m a firm believer that it’s important to be able to explain a digital tool with something close to the precision of “MTV cops” if you want a significant number of people to use it. Some might object that we academics are smart folks, capable of understanding sophisticated, multivalent tools, but people are busy, and with digital tools there are so many clamoring for attention and each entails a huge commitment (often putting your scholarship into an entirely new system). Scholars, like everyone else, are thus enormously resistant to tools that are hard to grasp. (Case in point: Google Wave.)

I loved the 24 hours of One Week from Monday afternoon to Tuesday afternoon where the group brainstormed potential tools to build and then narrowed them down to “MTV Cops” soundbites. Of course the tools were going to be more complex than these reductionistic soundbites, but those soundbites gave the process some focus and clarity. It also allowed us to ask Twitter followers to vote on general areas of interest (e.g., “Better timelines”) to gauge the market. We tweeted “Blog->Book” for idea #1, which is what became Anthologize.

And what were most of the headlines on launch day? Some variant on the crystal-clear ReadWriteWeb headline: “Scholars Build Blog-to-eBook Tool in One Week.”

Speed Doesn’t Kill

We’ve gotten occasional flak at the Center for History and New Media for some recent efforts that seem more carnival than Ivory Tower, because they seem to throw out the academic emphasis on considered deliberation. (However, it should be noted that we also do many multi-year, sweat-and-tears, time-consuming projects like the National History Education Clearinghouse, putting online the first fifteen years of American history, and creating software used by millions of people.)

But the experience of events like One Week makes me question whether the academic default to deliberation is truly wise. One Weekers could have sat around for a week, a month, a year, and still I suspect that the tool they decided to build was the best choice, with the greatest potential impact. As programmers in the real world know, it’s much better to have partial, working code than to plan everything out in advance. Just by launching Anthologize in alpha and generating all that excitement, the team opened up tremendous reserves of good will, creativity, and problem-solving from users and outside developers. I saw at least ten great new use cases for Anthologize on Twitter in the first day. How are you supposed to come up with those ideas from internal deliberation or extensive planning?

There was also something special about the 24/7 focus the group achieved. The notion that they had to have a tool in one week (crazy on the face of it) demanded that the participants think about that tool all of the time (even in their sleep, evidently). I’ll bet there was the equivalent of several months worth of thought that went on during One Week, and the time limit meant that participants didn’t have the luxury of overthinking certain choices that were, at the end of the day, either not that important or equally good options. Eric Johnson, observing One Week on Twitter, called this the power of intense “singular worlds” to get things done. Paul Graham has similarly noted the importance of environments that keep one idea foremost in your mind.

There are probably many other areas where focus, limits, and, yes, speed might help us in academia. Dissertations, for instance, often unhealthily drag on as doctoral students unwisely aim for perfection, or feel they have to write 300 pages even though their breakthrough thesis is contained in a single chapter. I wonder if a targeted writing blitz like the successful National Novel Writing Month might be ported to the academy.

Start Small, Dream Big

As dissertations become books through a process of polish and further thought, so should digital tools iterate toward perfection from humble beginnings. I’ve written in this space about the Center for History and New Media’s love of Voltaire’s dictum that “the perfect is the enemy of the good [enough],” and we communicated to One Week attendees that it was fine to start with a tool that was doable in a week. The only caveat was that tool should be conceived with such modularity and flexibility that it could grow into something very powerful. The Anthologize launch reminds me of what I said in this space about Zotero on its launch: it was modest, but it had ambition. It was conceived not just as a reference manager but as an extensible platform for research. The few early negative comments about Anthologize similarly misinterpreted it myopically as a PDF-formatter for blogs. Sure, it will do that, as can other services. But like Zotero (and Omeka) Anthologize is a platform that can be broadly extended and repurposed. Most people thankfully got that—it sparked the imagination of many, even though it’s currently just a rough-around-the-edges alpha.

Congrats again to the whole One Week team. Go get some rest.

August 5, 2010 10 Comments

Digital Campus #25 – Get With the Program

We were incredibly lucky to get two of the most sophisticated programming gurus in the humanities, Bill Turkel and Steve Ramsey, on the podcast this week. Bill and Steve are both committed to teaching other humanities scholars how to get started with programming, and they provide a number of terrific points and insights into the process in our feature story. If you’ve ever wanted to pick up programming or know someone who does, it’s definitely worth a listen (or worth passing on the link). We also take a look at the launch of Google App Engine, which raises questions about outsourcing, and myLOC.gov, which raises questions about whether digital collections should have their own personalization tools. [Subscribe to this podcast.]

April 21, 2008 Add Comment

Boggs on the Digital Humanities Design and Development Process

It’s time to subscribe to the blog of CHNM‘s Creative Lead, Jeremy Boggs, if you haven’t done so already. Jeremy is ramping up for what promises to be a very important blog series on how to create and execute a digital humanities project, from conception to design to coding to maintenance.

April 8, 2008 Add Comment

The First Principle of Writing Academic Facebook Applications

If you really must line the pockets of Mark Zuckerberg by writing a Facebook application, be sure the application takes advantage of the nature of Facebook. First and foremost, it’s a social networking site, so your application should have some social aspect to it. Many academic Facebook applications are merely search boxes or other non-social search and information services transposed to Facebook (e.g., JSTOR Search or the countless library search widgets). Study Groups, on the other hand, gets it right by emphasizing the networking and collaboration possible within Facebook.

January 23, 2008 1 Comment

MacEachern and Turkel, The Programming Historian

Bill Turkel, the always creative mind behind Digital History Hacks (logrolling disclosure: Bill is a friend of CHNM, a collaborator on various fronts, and was the thought-provoking guest on Digital Campus #9; still, he deserves the compliments), and his colleague at the University of Western Ontario, Alan MacEachern, are planning to write a book entitled The Programming Historian. Better yet, the book will be open access and hosted on the Network in Canadian History & Environment (NiCHE) site. Bill’s summary of the book on his blog sounds terrific. Can’t wait to read it and use it in my classes.

January 14, 2008 Add Comment

Creating a Blog from Scratch, Part 6: One Year Later

Well, it’s been over a year since I started this blog with a mix of trepidation, ambivalence, and faint praise for the genre—not exactly promising stuff—and so it’s with a mixture of relief and a smidgen of smug self-satisfaction that I’m writing this post. I’m extremely glad that I started this blog last fall and have kept it going. (Evidently the half-life of blogs is about three months, so an active year-old blog is, I suppose, some kind of accomplishment in our attention-deficit age.) I thought it would be a good idea (and several correspondents have prodded me in this direction) to return to my series of posts about starting this blog, “Creating a Blog from Scratch.” (For latecomers, this blog is not powered by Blogger, TypePad, or WordPress, but rather by my own feeble concoction of programming and design.) Over the next few posts I’ll be revisiting some of the decisions I made, highlighting some good things that have happened and some regrets. And at the end of the series I’ll be introducing some adjustments to my blog that I hope will make it better. But first, in something of a sequel to my call to my colleagues to join me in this endeavor, “Professors, Start Your Blogs,” some of the triumphs and tribulations I’ve encountered over the last year.

As the five-part series on creating this blog detailed, I took the masochistic step of writing my own blog software (that’s probably a little too generous; it’s really just a set of simple PHP scripts with a MySQL database) because I wanted to learn about how blogs were put together and see if I agreed with all of the assumptions that went into the genre. That learning experience was helpful (and judging by the email still I get about the series others have found it helpful), but I think I have paid a price in some ways. I will readily admit I’m jealous of other bloggers with their fancy professional blogging software with all of the bells and whistles. Worse, much of the blogosphere is driven by the big mainstream software packages like Blogger, TypePad, and WordPress; having your own blog software means you can’t take advantage of cutting-edge features, like new forms of searching or linking between blogs. But I’m also able to tweak the blog format more readily because I wrote every line of the code that powers this blog.

As I wrote in “Welcome to My Blog,” and as regular readers of this blog know well, I’m not a frequent poster. Sometimes I lament this fact when I see blogs I respect maintain a frantic pace. I’ve written a little over 60 posts (barely better than one per week, although with the Zotero crunch this fall the delays between posts has grown). Many times I’ve felt I had something to post to the blog but just didn’t get around to writing it up. I’m sure other bloggers know that feeling of missed opportunity, which is of course a little silly considering that we’re doing this for free, in our spare time, in most cases without a gun to our heads. But you do begin to feel a responsibility to your audience, and there’s no one to pawn that responsibility off on—you’re simultaneously the head writer, editor, and publisher.

On the other hand, I just did a quick database query and was astonished to discover I’ve written almost 40,000 words in this space (about 160 pages, double-spaced) in the last twelve months. Most posts were around 500-1000 words, with the longest post (Professors, Start Your Blogs) at close to 2000 words. Had you told me that I would write the equivalent of half a book in this space last fall, a) I wouldn’t have believed it, and b) I probably wouldn’t have started this blog.

One of the reasons bloggers feel pressure to post, as I’ve discovered over the last year, is that it’s fairly simple to quantify your audience, often in excruciating detail. As of this writing this blog is ranked 34,181 out of 55 million blogs tracked by Technorati. (This sounds pretty good—the top 1/100th of a percent of all blogs!—until you realize that there are millions of abandoned and spam blogs, and that like most Internet statistics, the rankings are effectively logarithmic rather than linear. That is, the blog that is ranked 34th is probably a thousand times more prominent than mine; on the other hand, this blog is approximately a thousand times more prominent than the poor blogger at 34,000,000.) Because of that kind of quantification, temptations abound for courting popularity in a way that goes against your (or at least my) blog’s mission. I’ve undoubtedly done some posts that were a little unnecessary and gratuitously attention-seeking. For instance, the most-read post over the last year covered the fingers that have crept into Google’s book scanning project, which of course in its silliness got a lot of play on the popular social news site Digg.com and led to thousands of visitors on the day I posted it and an instant tripling of subscribers to this blog’s feed. But I’m proud to say that my subsequent more serious posts immediately alienated the segment of Digg who are overly fond of exclamation points and my numbers quickly returned to a more modest—but I hope better targeted— audience.

Surely the happiest and most unexpected outcome of creating this blog has been the way that it has gotten me in touch with dozens of people whom I probably would not have met otherwise. I meet other professional historians all the time, but the blog has introduced me to brilliant and energetic people in libraries, museums, and archives, literary studies, computer science, people within and outside of academia. Given the balkanization of the academy and its distance from “the real world” I have no idea how I would have met these fascinating people otherwise, or profited from their comments and suggestions. I have never been to a conference where someone has come up to me out of the blue and said, “Hi Dan, I’m so-and-so and I wanted to introduce myself because I loved the article you wrote for such-and-such journal.” Yet I regularly have readers of this blog approach me out of the blue, and in turn I seek out others at meetings merely because of their blogs. These experiences have made me feel that blogging has the potential to revitalize academia by creating more frequent interactions between those in a field and, perhaps more important, between those in different fields. So: thanks for reading the blog and for getting in touch!

Next up in the anniversary edition of “Creating a Blog from Scratch”: it’s taken me a year, but I finally weigh in on tagging.

Part 7: Tags, What Are They Good For?

December 11, 2006 Add Comment

Using AJAX Wisely

Since its name was coined on February 18, 2005, AJAX (for Asynchronous JavaScript and XML) has been a much-discussed new web technology. For those not involved in web production, essentially AJAX is a method for dynamically changing parts of a web page without reloading the entire thing; like other dynamic technologies such as Flash, it makes the web browser seem more like a desktop application than a passive window for reading documents. Unlike Flash, however, AJAX applications have generally focused less on interactive graphics (and the often cartoony elements that are now associated with Flash) and more on advanced presentation of text and data, making it attractive to those in academia, libraries, and museums. It’s easy to imagine, for instance, an AJAX-based online library catalog that would allow for an easy refinement of a book search (reordering or adding new possibilities) without a new query submission for each iteration. Despite such promise, or perhaps because of the natural lag between commercial and noncommercial implementations of web technologies, AJAX has not been widely used in academia. That’s fine. Unlike the dot-coms, we should first be asking: What are appropriate uses for AJAX?

As with all technologies, it’s important that AJAX be used in a way that advances the pedagogical, archival, or analytical goals of a project, and with a recognition of its advantages and disadvantages. Such sober assessment is often difficult, however, in the face of hype. Let me put one prick in the AJAX bubble, though, which can help us orient the technology properly: AJAX often scrubs away useful URLs—the critical web addresses students, teachers, and scholars rely on to find and cite web pages and digital objects. For some, the ability to reference documents accurately over time is less of a concern compared to functionality and fancy design—but the lack of URLs for specific “documents” (in the broad sense of the word) on some AJAX sites make it troubling for academic use. Brewster Kahle, the founder of the Internet Archive, surmised that his archive may hold the blog of a future president; if she’s using some of the latest AJAX-based websites, we historians will have a very hard time finding her early thoughts because they won’t have a fixed (and indexable) address.

If not implemented carefully, AJAX (like Flash) could end up like the lamentable 1990s web technology “frames,” which could, for instance, hide the exact address of a scanned medieval folio in a window distinct from the site’s navigation, as in the Koninklijke Bibliotheek’s Medieval Illuminated Manuscripts site—watch how the URL at the top of your browser never changes as you click on different folios, frustrating anyone who wants to reference a specific page. Accurate citations are a core requirement for academic work. We need to be able to reference URLs that aren’t simply a constantly changing, fluid environment.

At the Center for History and New Media, our fantastic web developers Jim Safley and Nate Agrin have implemented AJAX in the right way, I believe, for our Hurricane Digital Memory Bank. In prior projects that gathered recollections and digital objects like photographs for future researchers, such as the September 11 Digital Archive, we worried about making the contribution form too long. We wanted as many people as possible to contribute, but we also knew that itchy web surfers are often put off by multi-page forms to fill out.

Jim and Nate solved this tension brilliantly by making the contribution form for the Hurricane Digital Memory Bank dynamic using AJAX. The form is relatively short but certain sections can change or expand to accept different kinds of objects, text, or geographical information depending on the interactions of the user with the form and accompanying map. It is simultaneously rich and unimposing. When you click on a link that says “Provide More Information” a new section of the form extends beyond the original.

Once a contribution has been accepted, however, it’s assigned a useful, permanent web address that can be referenced easily. Each digital object in the archive, from video to audio to text, has its own unique identifier, which is made explicit at the bottom of the window for that object (e.g., “Cite as: Object #139, Hurricane Digital Memory Bank: Preserving the Stories of Katrina, Rita, and Wilma, 17 November 2005, <http://www.hurricanearchive.org/details.php?id=139>”).

AJAX will likely have a place in academic digital projects—just a more narrow place than out on the wild web.

May 2, 2006 Add Comment

Creating a Blog from Scratch, Part 5: What is XHTML, and Why Should I Care?

In prior posts in this series (1, 2, 3, and 4), I described with some glee my rash abandonment of common blogging software in favor of writing my own. For my purposes there seemed to be some key disadvantages to these popular packages, including an overemphasis on the calendar (I just saw the definition of a blog at the South by Southwest Interactive Festival—”a page with dated entries”—which, to paraphrase Woody Allen, is like calling War and Peace “a book about Russia”), a sameness to their designs, and comments that are rarely helpful and often filled with spam. But one of the greatest advantages of recent blog software packages is that they generally write standards-compliant code. More specifically, blog software like WordPress automatically produces XHTML. Some of you might be asking, what is XHTML, and who cares? And why would I want to spend a great deal of effort ensuring that this blog complied strictly with this language?

The large digital library contingent that reads this blog could probably enumerate many reasons why XHTML compliance is important, but I had two reasons in mind when I started this blog. (Actually, I had a third, more secretive reason that I’ll mention first: Roy Rosenzweig and I argue in our book Digital History that XHTML will likely be critical for digital humanists to adhere to in the future—don’t want to be accused of being a hypocrite.) For those for whom web acronyms are Greek, XHTML is a sibling of XML, a more rigorously structured and flexible language than the HTML that underlies most of the web. XHTML is better prepared than HTML to be platform-independent; because it separates formatting from content, XHTML (like XML) can be reconfigured easily for very different environments (using, e.g., different style sheets). HTML, with formatting and content inextricably combined, for the most part assumes that you are using a computer screen and a web browser. Theoretically XHTML can be dynamically and instantaneously recast to work on many different devices (including a personal computer). This flexibility is becoming an increasingly important feature as people view websites on a variety of platforms (not just a normal computer screen, e.g., but cell phones or audio browsers for the blind). Indeed, according to the server logs for this blog, 1.6% of visitors are using a smart phone, PDA, or other means to read this blog, a number that will surely grow. In short, XHTML seems better prepared than regular HTML to withstand the technological changes of the coming years, and theoretically should be more easily preserved than older methods of displaying information on the web. For these and other reasons a 2001 report the Smithsonian commissioned recommended the institution move to XHTML from HTML.

Of course, with standards compliance comes extra work. (And extra cost. Just ask webmasters at government agencies trying to make their websites comply with Section 508, the mandatory accessibility rules for federal information resources.) Aside from a brief flirtation with the what-you-see-is-what-you-get, write-the-HTML-for-you program Dreamweaver in the late 1990s, I’ve been composing web pages using a text editor (the superb BBEdit) for over ten years, so my hands are used to typing certain codes in HTML, in the same way you get used to a QWERTY keyboard. XHTML is not that dissimilar from HTML, but it still has enough differences to make life difficult for those used to HTML. You have to remember to close every tag; some attributes related to formating are in strange new locations. One small example of the minor infractions I frequently trip up on writing XHTML: the oft-used break tag to add a line to a web page must “close itself” by adding a slash before the end bracket (not <br>, but <br />). But I figured doing this blog would give me a good incentive to start writing everything in strict XHTML.

Yeah, right. I clearly haven’t been paying enough attention to detail. The page you’re reading likely still has dozens of little coding errors that make it fail strict compliance with the World Wide Web Consortium’s XHTML standard. (If you would like a humbling experience that brings to mind receiving a pop quiz back from your third-grade teacher with lots of red ink on it, try the W3C’s XHTML Validator.) I haven’t had enough time to go back and correct all of those little missing slashes and quotation marks. WordPress users out there can now begin their snickering; their blog software does such mundane things for them, and many proudly (and annoyingly) display little “XHTML 1.0 compliant” badges on their sites. Go ahead, rub it in.

After I realized that it would take serious effort to bring my code up to code, so to speak, I sat back and did the only thing I could do: rationalize. I didn’t really need strict XHTML compliance because through some design slight-of-hand I had already been able to make this blog load well on a wide range of devices. I learned from other blog software that if you put the navigation on the right rather than the more common left you see on most websites, the body of each post shows up first on a PDA or smart phone. It also means that blind visitors don’t have to suffer through a long list of your other posts before getting to the article they want to read.

As far as XHTML is concerned, I’ll be brushing up on that this summer. Unless I move this blog to WordPress by then.

Part 6: One Year Later

January 5, 2006 Add Comment

Creating a Blog from Scratch, Part 4: Searching for a Good Search

It often surprises those who have never looked at server logs (the detailed statistics about a website) that a tremendous percentage of site visitors come from searches. In the case of the Center for History and New Media, this is a staggering 400,000 unique visitors a month out of about one million. Furthermore, many of these visitors ignore a website’s navigation and go right to the site search box to complete their quest for information. While I’m not a big fan of consultants that tell webmasters to sacrifice virtually everything for usability, I do feel that searching has been undervalued by digital humanities projects, in part because so much effort goes into digitization, markup, interpretation, and other time-consuming tasks. But there’s another, technical reason too: it’s actually very hard to create an effective search—one, for instance, that finds phrases as well as single words, that is able to rank matches well, and that is easy to maintain through software and server upgrades. In this installment of “Creating a Blog from Scratch” (for those who missed them, here are parts 1, 2, and 3) I’ll take you behind the scenes to explain the pluses and minuses of the various options for adding a search feature to a blog, or any database-driven website for that matter.

There are basically four options for searching a website that is generated out of a database: 1) have the database do it for you, since it already has indexing and searching built in; 2) install another software package on your server that spiders your site, indices it, and powers your search; 3) use an application programming interface (API) from Google, Yahoo, or MSN to power the search, taking search results from this external source and shoehorning them into your website’s design; 4) outsourcing the search entirely by passing search queries to Google, Yahoo, or MSN’s website, with a modifier that says “only search my site for these words.”

Option #1 seems like the simplest. Just create an SQL statement (a line of code in database lingo) that sends the visitor’s query to the database software—in the case of this blog, the popular MySQL—and have it return a list of entries that match the query. Unfortunately, I’ve been using MySQL extensively for five years now and have found its ability to match such queries less than adequate. First of all, until the most recent version of the MySQL it would not handle phrase searching at all, so you would have to strip quotation marks out of queries and fool the user into believing your site could do something that it couldn’t (that is, do a search like Google could). Secondly, I have found its indexing and ranking schemes to be far behind what you expect from a major search engine. Maybe this has changed in version 5, but for many years it seemed as if MySQL was using search principles from the early 1990s, where the number of times a word appeared on the page signified how well the page matched the query (rather than the importance of the place of each instance of the word on the page, or even better, how important the document was in the constellation of pages that contained that word). MySQL will return a fraction from 0 to 1 for the relevance of a match, but it’s a crude measure. I’m still not convinced, even with the major upgrades in version 5, that MySQL’s searching is acceptable for demanding users.

Option #2 is to install specialized search packages such as the open source ht://Dig on your server, point it to your blog (or website) and let it spider the whole thing, just as Google or Yahoo does from the outside. These software packages can do a decent job indexing and swiftly finding documents that seem more relevant than the rankings in MySQL. But using them obviously requires installing and maintaining another complicated piece of software, and I’ve found that spiders have a way of wandering beyond the parameters you’ve set for them, or flaking out during server upgrades. (Over the last few days, for instance, I’ve had two spiders request hundreds of posts from this blog that don’t exist. Maybe they can see into the future.) Anecdotally, I also think that the search results are better from commercial services such as Google or Yahoo.

I’ve become increasingly enamored of Option #3, which is to use APIs, or direct server-to-server communications, with the indices maintained by Google, Yahoo, or Microsoft. The advantage of these APIs is that they provide you with very high quality search results and query handling (at least for Google and Yahoo; MSN is far behind). Ranking is done properly, with the most important documents (e.g., blog posts that many other bloggers link to or that you have referenced many times on your own site) coming up first if there are multiple hits in the search results. And these search giants have far more sophisticated ways of handling phrase searches (even long ones) and boolean searches than MySQL. The disadvantage of APIs is that for some reason the indices made available to software developers are only a fraction the size of the main indices for these search engines, and are only updated about once a month. So visitors may not find recent material, or some material that is ranked fairly low, through API searches. Another possibility for Option #3 is to use the API for a blog search engine, rather than a broad search engine. For instance, Technorati has a blog-specific search API. Since Technorati automatically receives a ping from my Atom feed every time I post (via FeedBurner), it’s possible that this (or another blog search engine) will ultimately provide a solid API-based search.

I’ve been experimenting with ways of getting new material into the main Google index swiftly (i.e., within a day or two rather than a month or two), and have come up with a good enough solution that I have chosen Option #4: outsourcing the search entirely to Google, by using their free (though unfortunately ad-supported) site-specific search. With little fanfare, this year Google released Google Sitemaps, which provides an easy way for those who maintain websites, especially database-driven ones, to specify where all of their web pages are using an XML schema. (Spiders often miss web pages generated out of a database because there are often so many of them, and some of these pages may not be linked to.) While not guaranteeing that everything in your sitemap will be crawled and indexed, Google does say that it makes it easier for them to crawl your site more effectively. (By the way, Google’s recent acquisition of 5 percent of AOL seems to have been, at least ostensibly, very much about providing AOL with better crawls, thus upping the visibility of their millions of pages without messing with Google’s ranking schemes.) And—here’s the big news if you’ve made it this far—I’ve found that having a sitemap gets new blog posts into the main Google index extremely fast. Indeed, usually within 24 hours of submitting a new post Google downloads my updated sitemap (created automatically by a PHP script I’ve written), sees the new URL for the post, and adds it to its index. This means that I can very effectively use the Google’s main search engine for this blog, although because I’m not using the API I can’t format the results page to match the design of my site exactly.

One final note, and I think an important one for those looking to increase the visibility of their blog posts (or any web page created from a database) in Google’s search results: have good URLs, i.e., ones with important keywords rather than meaningless numbers or letters. Database-driven sites often have such poor URLs featuring an ugly string of variables, which is a shame, since server technology (such as Apache’s mod_rewrite) allows webmasters to replace these variables with more memorable words. Moreover, Google, Yahoo, and other search engines clearly favor keywords in URLs (very apparent when you begin to work with Google’s Web API), assigning them a high value when determining the relevance of a web page to a query. Some blog software automatically creates good URLs (like Blogger, owned by Google), while many other software packages do not—typically emphasizing the date of a post in the URL or the page number in the blog. For my own blogging software, I designed a special field in the database just for URLs, so I can craft a particularly relevant and keyword-laden string. Mod_rewrite takes care of the rest, translating this string into an ID number that’s retrieved by the database to generate the page you’re reading.

For many reasons, including making it accessible to alternative platforms such as audio browsers and cell phones, I wanted to generate this page in strict XHTML, unlike my old website, which had poor coding practices left over from the 1990s. Unfortunately, as the next post in this series details, I failed terribly in the pursuit of this goal, and this floundering made me think twice about writing my own blogging software when existing packages like WordPress will generate XHTML for you, with no fuss.

Part 5: What is XHTML, and Why Should I Care?

December 26, 2005 Add Comment

Creating a Blog from Scratch, Part 3: The Double Life of Blogs

In the first two posts in this series, I discussed the origins of blogs and how they led to certain elements in popular blog software that were in some cases good and in others bad for my own purposes—to start a blog that consisted of short articles on the intersection of digital technology, the humanities, and related topics (rather than my personal life or links with commentary). What I didn’t realize as I set about writing my own blog software from scratch for this project was that in truth a blog leads two lives: one as a website and another as a feed, or syndicated digest of the more complete website. Understanding this double life and the role and details of RSS feeds led to further thoughts about how to design a blog, and how certain choices are encoded into blogging software. Those choices, as I’ll explain in this post, determine to a large extent what kind of blog you are writing.

Creating a blog from scratch is a great first project for someone new to web programming and databases. It involves putting things into a database (writing a post), taking them out to insert into a web page, and perhaps coding some secondary scripts such as a search engine or a feed generator. It stretches all of the scripting muscles without pulling them. And in my case, I had already decided (as discussed in the prior post) I didn’t need (or want) a lot of bells and whistles, like a system for allowing comments or trackback/ping features. So I began to write my blogging application with the assumption that it would be a very straightforward affair.

Indeed, at first designing the database couldn’t have been simpler. The columns (as database fields are called in MySQL) were naturally an ID number, a title, the body of the post (the text you’re reading right now), and the time posted (while I disparaged time in my last post I thought it was important enough to have this field for reference). But then I began to consider two things that were critical and that I thought popular blogging software did well: automatically generating an RSS feed (or feeds) and—for some blog software—creating URLs for each post that nicely contain keywords from the title. This latter feature is very important for visibility in search engines (the topic of my next post in this series).

I know a lot about search engines, but knew very little about RSS feeds, and when I started to think about what my blogging application would need to autogenerate its own feed, I realized the database had to be slightly more elaborate than my original schema. I had of course heard of RSS before, and the idea of syndication. But before I began to think about this blog and write the software that drives it I hadn’t really understood its significance or its complexity. It’s supposed to be “Really Simple Syndication” as one of the definitions for the acronym RSS asserts, and indeed the XML schemas for various RSS feeds are relatively simple.

But they also contain certain assumptions. For instance, all RSS feeds have a description of each blog post (in the Atom feed, it’s called the “summary”), but these descriptions vary from feed to feed and among different RSS feed types. For some it is the first paragraph of the post, for others the first 100 characters, and for still others it’s a specially crafted “teaser” (to use the TV news lingo for lines like “Coming up, what every parent needs to know to prevent a dingo from eating their baby”). Along with the title to the post this snippet is generally the only thing many people see—the people who don’t visit your blog’s website but only scan it in syndication.

So what kind of description did I want, and how to create it? I liked the idea of an autogenerated snippet—press “post” and you’re done. On the other hand, choosing a random number of characters out of a hat that would work well for every post seemed silly. I’m used to writing abstracts for publications, so that was another possibility. But then I would have to do even more work after I finished writing a post. And I also needed to factor in that the description had to prod people to look at the whole post on my website, something the “teaser” people understood. So I decided to compromise and leave part to the code and part to me: I would have the software take the first paragraph of the entire post and use that as the default summary for the feed, but I had the software put a copy of this paragraph in the database so I could edit it if I wanted to. Automated but alterable. And I also decided that I needed to change my normal academic writing style to have more of an enticing end to every first paragraph. The first paragraph would be somewhere between an abstract and a teaser.

Having made this decision and others like it for the other fields in the feed (should I have “channels”? why does there have to be an “author” field in a feed for a solo blog?), I had to choose which feed to create: RSS 1.0, RSS 2.0, Atom? Why there are so many feed types somewhat eludes me; I find it weird to go to some blogs that have little “chicklets” for every feed under the sun. I’ve now read a lot about the history of RSS but it seems like something that should have become a single international standard early on. Anyway, I wanted the maximum “subscribability” for my blog but didn’t want to suffer writing a PHP script to generate every single kind of feed.

So, time for outsourcing. I created a single script to generate an Atom 1.0 feed (I think the best choice if you’re just starting out), which is picked up by FeedBurner and recast into all of those slightly incompatible other feed types depending on what the subscriber needs.

This would not be the first time I would get lazy and outsource. In the next part of this series, I discuss the different ways one can provide search for a blog, and why I’m currently letting Google do the heavy lifting.

Part 4: Searching for a Good Search

December 22, 2005 1 Comment