Dan Cohen

Digital Humanities: Theory & Practice
Covering issues of importance in the methodology and technology of humanities computing.



The Perils of Anonymity
Posted to Digital Humanities: Theory & Practice on 10 July 2007, 10:47 AM EDT

PhDinHistory, we hardly knew ye. A blogger who came out of nowhere to write interesting, thorough analyses of the state of academia and trends in history, captured my attention from the first post and eventually garnered a much wider audience. Then suddenly, this weekend, PhDinHistory deleted his or her WordPress account. No goodbye post and static archive of the blog, but rather a full deletion that made it impossible to read or link to the blog forever. I didn't always agree with PhDinHistory, but as a blogger who also wanted to write more in-depth pieces rather than quick blogish ones (although more recently I have cheated by adding into my feed smaller posts from ma.gnolia), I truly respected the effort that went into this new blog. From the beginning, however, I thought there was one major problem with PhDinHistory's blog: its anonymity... [Read on...]


Million Books Workshop Wrap-up
Posted to Digital Humanities: Theory & Practice on 24 May 2007, 10:11 AM EDT

May has been a month of travel for me (thus the light posting in this space). I gave a talk about Zotero and related developments in the humanities and technology at the Stanford Humanities Center, and spoke at the annual meeting of the American Council of Learned Societies about how digital research is a major emerging theme in scholarship. Finally, I participated in the Tufts "Million Books" Workshop, which explored the technical feasibility and theoretical validity of extracting evidence and meaning from the large new corpora of online texts. The three main topics were how to get from scanned documents (especially the complicated ones that scholars sometimes encounter, like Sanskrit manuscripts or early modern broadsides, rather than simply formatted texts like modern English books) to machine-readable text that can be searched and analyzed; machine translation of texts; and moving from text to actionable data (e.g., extraction all of the place names from a document or summarizing large masses of text). Some developments worth noting from the workshop:.. [Read on...]


Digital Humanities Summit Wrap-up
Posted to Digital Humanities: Theory & Practice on 19 April 2007, 10:31 PM EDT

I've been going to digital humanities conferences of one kind or another for many years now, but last week's summit of digital humanities centers at the National Endowment of the Humanities showed that finally there is extraordinary interest in the field. People in positions of power and influence showed up for the first time. Vint Cerf (one of the founders of the internet and now evangelist for Google) was there. Many funders came as well, which is important since the work I and my colleagues do at the Center for History and New Media is not inexpensive. In general, an excitement permeated the room—the feeling that with the exponential growth in digitization and the rise of digital tools, we are on the cusp of a new age of scholarship. Here are my rough notes from the meeting... [Read on...]


It's About Russia
Posted to Digital Humanities: Theory & Practice on 6 March 2007, 10:38 PM EST

One of my favorite Woody Allen quips from his tragically short period as a stand-up comic is the punch line to his hyperbolic story about taking a speed-reading course and then digesting all of War and Peace in twenty minutes. The audience begins to giggle at the silliness of reading Tolstoy's massive tome in a brief sitting. Allen then kills them with his summary of the book: "It's about Russia." The joke came to mind recently as I read the self-congratulatory blog post by IBM's Many Eyes visualization project, applauding their first month on the web. (And I'm feeling a little embarrassed by my post on the one-year anniversary of this blog.) The Many Eyes researchers point to successes such as this groundbreaking visualization of the New Testament:.. [Read on...]


A Closer Look at the National Archives-Footnote Agreement
Posted to Digital Humanities: Theory & Practice on 5 February 2007, 2:09 PM EST

I've spent the past two weeks trying to get a better understanding of the agreement signed by the National Archives and Footnote, about which I raised several concerns in my last post. Before making further (possibly unfounded) criticisms I thought it would a good idea to talk to both NARA and Footnote. So I picked up the phone and found several people eager to clarify things. At NARA, Jim Hastings, director of access programs, was particularly helpful in explaining their perspective. (Alas, NARA's public affairs staff seemed to have only the sketchiest sense of key details.) Most helpful—and most eager to rebut my earlier post—were Justin Schroepfer and Peter Drinkwater, the marketing director and product lead at Footnote. Much to their credit, Justin and Peter patiently answered most of my questions about the agreement and the operation of the Footnote website... [Read on...]


The Flawed Agreement between the National Archives and Footnote, Inc.
Posted to Digital Humanities: Theory & Practice on 15 January 2007, 9:18 PM EST

I suppose it's not breaking news that libraries and archives aren't flush with cash. So it must be hard for a director of such an institution when a large corporation, or even a relatively small one, comes knocking with an offer to digitize one's holdings in exchange for some kind of commercial rights to the contents. But as a historian worried about open access to our cultural heritage, I'm a little concerned about the new agreement between Footnote, Inc. and the United States National Archives. And I'm surprised that somehow this agreement has thus far flown under the radar of all of those who attacked the troublesome Smithsonian/Showtime agreement. Guess what? From now until 2012 it will cost you $100 a year, or even more offensively, $1.99 a page, for online access to critical historical documents such as the Papers of the Continental Congress... [Read on...]


Creating a Blog from Scratch, Part 8: Full Feeds vs. Partial Feeds
Posted to Digital Humanities: Theory & Practice on 11 January 2007, 11:18 AM EST

One seemingly minor aspect of blogs I failed to consider carefully when I programmed this site was the composition of its feed. (Frankly, I was more concerned with the merely technical question of how to write code that spits out a valid RSS or Atom feed.) Looking at a lot of blogs and their feeds, I just assumed that the standard way of doing it was to put a small part of the full post in the feed—e.g., the first 50 words or the first paragraph—and then let the reader click through to the full post on your site. I noticed that some bloggers put their entire blog in their feed, but as a new blogger—one who had just spent a lot of time redesigning his old website to accommodate a blog—I couldn't figure out why one would want to do that since it rendered your site irrelevant. It may seem minor, but a year later I've realized that there is, in part, a philosophical difference between a full and partial feed. Choosing which type of feed you are going to use means making a choice about the nature of your blog—and, surprisingly, the nature of your ego too. Subscribers to this blog's feed have probably noticed that as of my last post I've switched from a partial feed to a full feed, so you already know the outcome of the debate I've had in my head about this distinction, but let me explain my reasoning and the advantages and disadvantages of full and partial feeds... [Read on...]


Intelligence Analysts and Humanities Scholars
Posted to Digital Humanities: Theory & Practice on 13 November 2006, 8:58 PM EST

About halfway through the Chicago Colloquium on Digital Humanities and Computer Science last week, the always witty and insightful Martin Mueller humorously interjected: "I will go away from this conference with the knowledge that intelligence analysts and literary scholars are exactly the same." As the chuckles from the audience died down, the core truth of the joke settled in—for those interested in advancing the still-nascent field of the digital humanities, are academic researchers indeed becoming clones of intelligence analysts by picking up the latter's digital tools? What exactly is the difference between an intelligence analyst and a scholar who is scanning, sorting, and aggregating information from massive electronic corpora?.. [Read on...]


NEH Digital Humanities Start-Up Grants
Posted to Digital Humanities: Theory & Practice on 21 September 2006, 2:43 PM EDT

Brett Bobley, the CIO at the National Endowment for the Humanities and the chair of the new (and very exciting) Digital Humanities Initiative, wrote to me to ask for some publicity for their programs, especially for the Digital Humanities Start-Up Grants. Happy to do so. (Undoubtedly I'll apply for this at some point in the future and could use less competition, so I probably should keep quiet...but duty and dedication to this blog's audience calls.) The Start-Up Grants seem like a great way to initiate a project like Zotero. From Brett:.. [Read on...]


Raw Archives and Hurricane Katrina
Posted to Digital Humanities: Theory & Practice on 28 August 2006, 12:24 PM EDT

Several weeks ago during my talk on the "Possibilities and Problems of Digital History and Digital Collections" at the joint meeting of the Council of State Archivists, the National Association of Government Archives and Records Administrators, and the Society of American Archivists (CoSA, NAGARA, and SAA), I received a pointed criticism from an audience member during the question-and-answer period. Having just shown the September 11 Digital Archive, the questioner wanted to know how this qualified as an "archive," since archives are generally based upon rigorous principles of value, selection, and provenance. It's a valid critique—though a distinction that might be lost on a layperson who is unaware of archival science and might consider their shoebox of photos an "archive." Maybe it's time for a new term: the raw archive. On the Internet, these raw archives are all around us... [Read on...]


Professors, Start Your Blogs
Posted to Digital Humanities: Theory & Practice on 21 August 2006, 3:20 PM EDT

With a new school year about to begin, I want to reach out to other professors (and professors-to-be, i.e., graduate students) to try to convince more of them to start their own blogs. It's the perfect time to start a blog, and many of the reasons academics state for not having a blog are, I believe, either red herrings or just plain false. So first, let me counter some biases and concerns I hear from a lot of my peers (and others in the ivory tower) when the word "blog" is mentioned... [Read on...]


The Perfect and the Good Enough: Books and Wikis
Posted to Digital Humanities: Theory & Practice on 21 June 2006, 9:10 AM EDT

As you may have noticed, I haven't posted to my blog for an entire month. I have a good excuse: I just finished the final edits on my forthcoming book, Equations from God: Pure Mathematics and Victorian Faith, due out early next year. (I realized too late that I could have capitalized on Da Vinci Code fever and called the book The God Code, thus putting an intellectual and cultural history of Victorian mathematics in the hands of numerous unsuspecting Barnes & Noble shoppers.) The process of writing a book has occasionally been compared to pregnancy and childbirth; as the awe-struck husband of a wife who bore twins, I suspect this comparison is deeply flawed. But on a more superficial level, I guess one can say that it's a long process that produces something of which one can be very proud, but which can involve some painful moments. These labor pains are especially pronounced (at least for me) in the final phase of book production, in which all of the final adjustments are made and tiny little errors (formatting, spelling, grammar) are corrected. From the "final" draft of a manuscript until its appearance in print, this process can take an entire year. Reading Roy Rosenzweig's thought-provoking article on the production of the Wikipedia, just published in the Journal of American History, was apropos: it got me thinking about the value of this extra year of production work on printed materials and its relationship to what's going on online now... [Read on...]


Mapping Recent History
Posted to Digital Humanities: Theory & Practice on 11 April 2006, 2:15 PM EDT

As the saying goes, imitation is the sincerest form of flattery. So at the Center for History and New Media, we're currently feeling extremely flattered that our initiatives in collecting and presenting recent history—the Echo Project (covering the history of science, technology, and industry), the September 11 Digital Archive, and the Hurricane Digital Memory Bank—are being imitated by people using a wave of new websites that help them locate recollections, images, and other digital objects on a map. Here's an example from the mapping site Platial:.. [Read on...]


Measuring the Audience of a Digital Humanities Project
Posted to Digital Humanities: Theory & Practice on 4 April 2006, 2:11 PM EDT

Karen Motylewski of the Institute of Museum and Library Services recently pressed an audience of recent IMLS grantees to think about how they might measure the success of their digital projects. As she was well aware, academics often bristle at the quantitative measurement of the audience for their websites because it smacks of commercialism. Also, we professors and librarians and curators generally avoid taking classes in such base topics as marketing. But Karen has a point. Indeed, Roy Rosenzweig and I devote an entire chapter in Digital History to how to build an audience—not for commercial or narcissistic reasons, but because an academic digital project should be, as we say, "useful and used." I started this blog to explain in greater depth some of the projects and research I'm working on in the digital humanities, but I also did it (as readers of my five-part series on "Creating a Blog from Scratch" will know; 1, 2, 3, 4, 5) to learn first-hand about the composition of blogs and the technologies behind them. Writing my own code for this blog forced me to examine in detail—and occasionally rethink—some blogging conventions (technical, design, and content). And one of the benefits of doing so has been a realization that I have significantly underestimated the power of RSS. I now think it may be the best measurement of utility for an academic website, far better than server logs or other quantitative measurements. Let me explain why... [Read on...]


Search Engine Optimization for Smarties
Posted to Digital Humanities: Theory & Practice on 26 March 2006, 10:00 PM EST

A Google search for "Sputnik" gives you an authoritative site from NASA in the top ten search results, but also a web page from the skydiver and ballroom-dancing enthusiast Michael Wright. This wildly democratic mix of sources perennially leads some educators to wring their hands about the state of knowledge, as yet another op-ed piece in the New York Times does today ("Searching for Dummies" by Edward Tenner). It's a strange moment for the Times to publish this kind of lament; it seems like an op-ed left over from 1997, and as I've previously written in this space (and elsewhere with Roy Rosenzweig), contrary to Tenner's one example of searching in vain for "World History," online historical information is actually getting better, not worse (especially if you assess the web as a whole rather than complain about a few top search results). Anyway, Tenner does make one very good point: "More owners of free high-quality content should learn the tradecraft of tweaking their sites to improve search engine rankings." This "tradecraft" is generally called "search engine optimization," and I've long thought I should let those in academia (and other creators of reliable, noncommercial digital resources) in on the not-so-secret ways you can move your website higher up in the Google rankings (as well as in the rankings of other search engines)... [Read on...]


What Would You Do With a Million Books?
Posted to Digital Humanities: Theory & Practice on 17 March 2006, 11:56 AM EST

What would you do with a million digital books? That's the intriguing question this month's D-Lib Magazine asked its contributors, as an exercise in understanding what might happen when massive digitization projects from Google, the Open Content Alliance, and others reach their fruition. I was lucky enough to be asked to write one of the responses, "From Babel to Knowledge: Data Mining Large Digital Collections," in which I discuss in much greater depth the techniques behind some of my web-based research tools. (A bonus for readers of the article: learn about the secret connection between cocktail recipes and search engines.) Most important, many of the contributors make recommendations for owners of any substantial online resource. My three suggestions, summarized here, focus on why openness is important (beyond just "free beer" and "free speech" arguments), the relatively unexplored potential of application programming interfaces (APIs), and the curious implications of information theory... [Read on...]


When Machines Are the Audience
Posted to Digital Humanities: Theory & Practice on 2 March 2006, 11:24 AM EST

I recently received an email from someone at the Woodrow Wilson Center that began in the following way: "Dear Sir/Madam: I was wondering if you might share the following fellowship opportunity with the members of your list...The Africa Program is pleased to announce that it is now accepting applications..." The email was, of course, tagged as spam by my email software, since it looked suspiciously like what the U.S. Secret Service calls a 419 fraud scheme, or a scam where someone (generally from Africa) asks you to send them your bank account information so they can smuggle cash out of their country (the transfer then occurs in the opposite direction, in case you were wondering). Checking the email against a statistical list of high-likelihood spam triggers identified the repeated use of words such as "application," "generous," "Africa," and "award," as well as the phrases "submitted electronically" and the opening "Dear Sir/Madam." The email piqued my curiosity because over the past year I've started altering some of my email writing to avoid precisely this problem of a "false positive" spam label, e.g., never sending just an attachment with no text (a class spam trigger) and avoiding the use of phrases such as "Hey, you've got to look at this." In other words, I've semi-consciously started writing for a new audience: machines. One of the central theories of humanities disciplines such as literature and history is that our subjects write for an audience (or audiences). What happens when machines are part of this audience?.. [Read on...]


No Computer Left Behind
Posted to Digital Humanities: Theory & Practice on 20 February 2006, 9:37 AM EST

In this week's issue of the Chronicle of Higher Education Roy Rosenzweig and I elaborate on the implications of my H-Bot software, and of similar data-mining services and the web in general. "No Computer Left Behind" (cover story in the Chronicle Review; alas, subscription required, though here's a copy at CHNM) is somewhat more polemical than our recent article in First Monday ("Web of Lies? Historical Knowledge on the Internet"). In short, we argue that just as the calculator—an unavoidable modern technology—muscled its way into the mathematics exam room, devices to access and quickly scan the vast store of historical knowledge on the Internet (such as PDAs and smart phones) will inevitably disrupt the testing—and thus instruction—of humanities subjects. As the editors of the Chronicle put it in their headline: "The multiple-choice test is on its deathbed." This development is to be praised; just as the teaching of mathematics should be about higher principles rather than the rote memorization of multiplication tables, the teaching of subjects like history should be freed by new technologies to focus once again (as it was before a century of multiple-choice exams) on more important principles such as the analysis and synthesis of primary sources. Here are some excerpts from the article... [Read on...]


"Legal Cheating" in the Wall Street Journal
Posted to Digital Humanities: Theory & Practice on 22 January 2006, 10:03 PM EST

In a forthcoming article in the Chronicle of Higher Education, Roy Rosenzweig and I argue that the ubiquity of the Internet in students' lives and advances in digital information retrieval threaten to erode multiple-choice testing, and much of standardized testing in general. A revealing article in this weekend's Wall Street Journal shows that some schools are already ahead of the curve: "In a wireless age where kids can access the Internet's vast store of information from their cellphones and PDAs, schools have been wrestling with how to stem the tide of high-tech cheating. Now some educators say they have the answer: Change the rules and make it legal. In doing so, they're permitting all kinds of behavior that had been considered off-limits just a few years ago." So which anything-goes schools are permitting this behavior, and what exactly are they doing?.. [Read on...]


Data on How Professors Use Technology
Posted to Digital Humanities: Theory & Practice on 15 January 2006, 2:56 PM EST

Rob Townsend, the Assistant Director of Research and Publications at the American Historical Association and the author of many insightful (and often indespensible) reports about the state of higher education, writes with some telling new data from the latest National Study of Postsecondary Faculty (conducted by the U.S. Department of Education roughly every five years since 1987). Rob focused on several questions about the use of technology in colleges and universities. The results are somewhat surprising and thought-provoking... [Read on...]


Hurricane Digital Memory Bank Featured on CNN
Posted to Digital Humanities: Theory & Practice on 4 January 2006, 9:30 PM EST

I was interviewed yesterday by CNN about a new project at the Center for History and New Media, the Hurricane Digital Memory Bank, which uses digital technology to record memories, photographs, and other media related to the Hurricanes Katrina, Rita, and Wilma. (CNN is going to feature the project sometime this week on its program The Situation Room.) The HDMB is a democratic historical project similar to our September 11 Digital Archive, which saved the recollections and digital files of tens of thousands of contributors from around the world; this time we're trying to save thousands of perspectives on what occurred on the Gulf Coast in the fall of 2005. What amazes me is how the interest in online historical projects and collections has exploded recently. Several of the web projects I've co-directed over the last five years have engaged in collecting history online. But even a project with as prominent a topic as September 11 took a long time to be picked up by the mass media. This time CNN called us just a few weeks after we launched the website, and before we've done any real publicity. Here are three developments from the last two years I think account for this sharply increased interest... [Read on...]


Rough Start for Digital Preservation
Posted to Digital Humanities: Theory & Practice on 2 January 2006, 3:23 PM EST

How hard will it be to preserve today's digital record for tomorrow's historians, researchers, and students? Judging by the preliminary results of some attempts to save for the distant future the September 11 Digital Archive (a project I co-directed), it won't be easy. While there are some bright spots to the reports in D-Lib Magazine last month on the efforts of four groups to "ingest" (or digitally accession) the thousands of files from the 9/11 collection, the overall picture is a little bit sobering. And this is a fairly well-curated (though by no means perfect) collection. Just imagine what ingesting a messy digital collection, e.g., the hard drive of your average professor, would entail. Here are some of the important lessons from these early digital preservation attempts, as I see it... [Read on...]


2006: Crossroads for Copyright
Posted to Digital Humanities: Theory & Practice on 29 December 2005, 9:11 PM EST

The coming year is shaping up as one in which a number of copyright and intellectual property issues will be highly contested or resolved, likely having a significant impact on academia and researchers who wish to use digital materials in the humanities. In short, at stake in 2006 are the ground rules for how professors, teachers, and students may carry out their work using computer technology and the Internet. Here are three major items to follow closely... [Read on...]


Nature Compares Science Entries in Wikipedia with Encyclopaedia Britannica
Posted to Digital Humanities: Theory & Practice on 14 December 2005, 9:02 PM EST

In an article published tomorrow, but online now, the journal Nature reveals the results of a (relatively small) study it conducted to compare the accuracy of Wikipedia with Encyclopaedia Britannica—at least in the natural sciences. The results may strike some as surprising... [Read on...]


Do APIs Have a Place in the Digital Humanities?
Posted to Digital Humanities: Theory & Practice on 21 November 2005, 7:31 PM EST

Since the 1960s, computer scientists have used application programming interfaces (APIs) to provide colleagues with robust, direct access to their databases and digital tools. Access via APIs is generally far more powerful than simple web-based access. APIs often include complex methods drawn from programming languages—precise ways of choosing materials to extract, methods to generate statistics, ways of searching, culling, and pulling together disparate data—that enable outside users to develop their own tools or information resources based on the work of others. In short, APIs hold great promise as a method for combining and manipulating various digital resources and tools in a free-form and potent way... [Read on...]