Category Archives: Research

Second Chicago Colloquium on Digital Humanities and Computer Science

I went to the first of these last November and it’s well worth attending. This year’s theme is “exploring the scholarly query potential of high quality text and image archives in a collaborative environment.” The colloquium will take place on October 21-22, 2007, with proposals due July 31, 2007.

2007 Vectors Summer Fellowships

Vectors: Journal of Culture and Technology in a Dynamic Vernacular has announced its fourth annual summer fellowship program to take place in June 2007 at USC. They are seeking proposals for projects related to “reading” and “noise.” About Vectors: “Vectors publishes work which need necessarily exist online, ranging from archival to experimental projects.”

It’s About Russia

One of my favorite Woody Allen quips from his tragically short period as a stand-up comic is the punch line to his hyperbolic story about taking a speed-reading course and then digesting all of War and Peace in twenty minutes. The audience begins to giggle at the silliness of reading Tolstoy’s massive tome in a brief sitting. Allen then kills them with his summary of the book: “It’s about Russia.” The joke came to mind recently as I read the self-congratulatory blog post by IBM’s Many Eyes visualization project, applauding their first month on the web. (And I’m feeling a little embarrassed by my post on the one-year anniversary of this blog.) The Many Eyes researchers point to successes such as this groundbreaking visualization of the New Testament:

News flash: Jesus is a big deal in the New Testament. Even exploring the “network” of figures who are “mentioned together” (ostensibly the point of this visualization) doesn’t provide the kind of insight that even a first-year student in theology could provide over coffee. I have been slow to appreciate the power of textual visualization—in large part because I’ve seen far too many visualizations like this one, that merely use computational methods to reveal the obvious in fancy ways.

I’ve been doing some research on visualizations of texts recently for my next book (on digital scholarship), and trying to get over this aversion to visualizations. But when I see visualizations like this one, the lesson is clear: Make sure your visualizations expose something new, hidden, non-obvious.

Because War and Peace isn’t about Russia.

Intelligence Analysts and Humanities Scholars

About halfway through the Chicago Colloquium on Digital Humanities and Computer Science last week, the always witty and insightful Martin Mueller humorously interjected: “I will go away from this conference with the knowledge that intelligence analysts and literary scholars are exactly the same.” As the chuckles from the audience died down, the core truth of the joke settled in—for those interested in advancing the still-nascent field of the digital humanities, are academic researchers indeed becoming clones of intelligence analysts by picking up the latter’s digital tools? What exactly is the difference between an intelligence analyst and a scholar who is scanning, sorting, and aggregating information from massive electronic corpora?

Mueller’s remark prods those of us exploring the frontiers of the digital humanities to do a better job describing how our pursuit differs from other fields making use of similar computational means. A good start would be to highlight that while the intelligence analyst sifts through mountains of data looking for patterns, anomalies, and connections that might be (in the euphemistic argot of the military) “actionable” (when policy makers piece together bits of intelligence and decide to take action), the digital humanities scholar should be looking for patterns, anomalies, and connections that strengthen or weaken existing theories in their field, or produce new theories. In other words, we not only uncover evidence, but come to overarching conclusions and make value judgments; we are at once the FBI, the district attorney, the judge, and the jury. (Perhaps the “National Intelligence Estimates” that are the highest form of synthesis in the intelligence community come closest to what academics do.)

The gentle criticism I gave to the Chicago audience at the end of the colloquium was that too many presentations seemed one (important) piece away from completing this interpretive whole. Through extraordinary guile, a series of panelists showed how digital methods can determine the gender of Shakespeare’s interlocutors, show more clearly the repetition of key phrases in Gertrude Stein’s prose, or more clearly map the ideology and interactions of FDR’s advisors during and after Pearl Harbor. But of course the real questions that need to be answered—answers that will make other humanities scholars stand up and take notice of digital methods—are, of course, how the identification of gender reshapes (or reinforces) our views of Shakespeare’s plays, how the use of repetition changes our perspectives on Gertrude Stein’s writings, or how a better understanding of presidential advisors alters our historical narrative of America’s entry into the second World War.

In Chicago, I tried to give this critical, final moment of insight reached through digital means a name—the “John Snow moment”—in honor of the Victorian pharmacist who discovered the cause of cholera by using a novel research tool unfamiliar to traditional medical science. Rather than looking at symptoms or other patient information on a case-by-case basis as a cholera outbreak killed and sickened hundreds of people in London in 1854, Snow instead mapped all incidences of the disease by the street addresses of the patients, thus quickly discovering that the cases clustered around a Soho water pump. The city council removed the water pump’s handle, quickly curtailing the disease and inaugurating a new era of epidemiology. Snow proved that cholera was a waterborne disease. Now that’s actionable intelligence.

What can digital scholars do to reach this level of insight? A key first step, reinforced by my experience in Chicago, is that academics interested in the power of computational methods must work to forge tools that satisfy their interpretive needs rather than simply accepting the tools that are currently available from other domains of knowledge, like intelligence. Ostensibly the Chicago Colloquium was about bringing together computer scientists and humanities scholars to see how we might learn from each other and enable new forms of research in an age of millions of digitized books. But as I noted in my remarks on the closing panel, too often this interaction seemed like a one-way street, with humanities scholars applying existing computer science tools rather than engaging the computer scientists (or programming themselves) to create new tools that would be better suited to their own needs. Hopefully such new tools will lead to more John Snow moments in the humanities in the near future.

Zotero Needs Your Help, Part II

In my prior post on this topic, I mentioned the (paid) positions now available at the Center for History and New Media to work on and promote Zotero. (By the way, there’s still time to contact us if you’re interested; we just started reviewing applications, but hurry.) But Zotero is moving ahead on so many fronts that its success depends not only on those working on it full time, but also those who appreciate the software and want to help out in other ways. Here are some (unpaid, but feel-good) ways you can get involved.

If you are a librarian, instructional technologist, or anyone else on a campus or at an institution that uses citation software like EndNote or RefWorks, please consider becoming an informal campus representative for Zotero. As part of our effort to provide a free competitor to these other software packages, we need to spread the word, have people give short introductions to Zotero, and generally serve as local “evangelists.” Already, two dozen librarians who have tried Zotero and think it could be a great solution for students, staff, and faculty on their campuses have volunteered to help out in this role. If you’re interested in joining them, please contact campus-reps@zotero.org.

We are currently in the process of writing up instructions (and possibly creating some additional software) to make creating Zotero translators and citation style formatters easier. Translators are small bits of code that enable Zotero to recognize citation information on a web page; we have translators for specific sites (like Amazon.com) as well as broader ones that recognize certain common standards (like MARC records or embedded microformats). Style formatters take items in your Zotero library and reformat them into specific disciplinary or journal standards (e.g., APA, MLA, etc.). Right now creating translators takes a fair amount of technical knowledge (using things like XPath and JavaScript), so if you’re feeling plucky and have some software skills, email translators@zotero.org to get started on a translator for a specific collection or resource (or you can wait until we have better tools for creating translators). If you have some familiarity with XML and citation formatting, please contact styles@zotero.org if you’re interested in contributing a style formatter. We figure that if EndNote can get their users to contribute hundreds of style formatters for free, we should be able to do the same for translators and styles in the coming year.

One of our slogans for Zotero is “Citation management is only the beginning.” That will become increasingly obvious over the coming months as third-party developers (and the Zotero team) begin writing what we’re calling utilities, or little widgets that use Zotero’s location in the web browser to send and receive information across the web. Want to pull out all of the place names in a document and map them on Google Maps? Want to send del.icio.us a notice every time you tag something in Zotero? Want to send text from a Zotero item to an online translation service? All of this functionality will be relatively trivial in the near future. If you’re familiar with some of the browser technologies we use and that are common with Web 2.0 mashups and APIs and would like to write a Zotero utility, please contact utilities@zotero.org.

More generally, if you are a software developer and either would like to help with development or would like to receive news about the technical side of the Zotero project, please contact dev@zotero.org.

With Firefox 2.0 apparently going out of beta into full release next Thursday (October 26, 2006), it’s a great time to start talking up the powerful combination of Firefox 2.0 and Zotero (thanks, Lifehacker and the Examiner!).

Mapping What Americans Did on September 11

I gave a talk a couple of days ago at the annual meeting of the Society for American Archivists (to a great audience—many thanks to those who were there and asked such terrific questions) in which I showed how researchers in the future will be able to intelligently search, data mine, and map digital collections. As an example, I presented some preliminary work I’ve done on our September 11 Digital Archive combining text analysis with geocoding to produce overlays on Google Earth that show what people were thinking or doing on 9/11 in different parts of the United States. I promised a follow-up article in this space for those who wanted to learn how I was able to do this. The method provides an overarching view of patterns in a large collection (in the case of the September 11 Digital Archive, tens of thousands of stories), which can then be prospected further to answer research questions. Let’s start with the end product: two maps (a wide view and a detail) of those who were watching CNN on 9/11 (based on a text analysis of our stories database, and colored blue) and those who prayed on 9/11 (colored red).

September 11 Digital Archive stories about CNN and prayer mapped onto Google Earth
Google Earth map of the United States showing stories with CNN viewing (blue) and stories with prayer (red) [view full-size version for better detail]

September 11 Digital Archive stories about CNN and prayer mapped onto Google Earth - detail
Detail of the Eastern United States [view full-size version for better detail]

By panning and zooming, you can see some interesting patterns. Some of these patterns may be obvious to us, but a future researcher with little knowledge of our present could find out easily (without reading thousands of stories) that prayer was more common in rural areas of the U.S. in our time, and that there was especially a dichotomy between the very religious suburbs (or really, the exurbs) of cities like Dallas and the mostly urban CNN-watchers. (I’ll present more surprising data in this space as we approach the fifth anniversary of 9/11.)

OK, here’s how to replicate this. First, a caveat. Since I have direct access to the September 11 Digital Archive database, as well as the ability to run server-to-server data exchanges with Google and Yahoo (through their API programs), I was able to put together a method that may not be possible for some of you without some programming skills and direct access to similar databases. For those in this blog’s audience who do have that capacity, here’s the quick, geeky version: using regular expressions, form an SQL query into the database you are researching to find matching documents; select geographical information (either from the metadata, or, if you are dealing with raw documents, pull identifying data from the main text by matching, say, 5-digit numbers for zip codes); put these matches into an array, and then iterate through the array to send each location to either Yahoo’s or Google’s geocoding service via their maps API; take the latitude and longitude from the result set from Yahoo or Google and add these to your array; iterate again through the array to create a KML (Keynote Markup Language) file by wrapping each field with the appropriate KML tag.

For everyone else, here’s the simplest method I could find for reproducing the maps I created. We’re going to use a web-based front end for Yahoo’s geocoding API, Phillip Holmstrand’s very good free service, and then modify the results a bit to make them a little more appropriate for scholarly research.

First of all, you need to put together a spreadsheet in Excel (or Access or any other spreadsheet program; you can also just create a basic text document with columns and tabs between fields so it looks like a spreadsheet). Hopefully you will not be doing this manually; if you can get a tab-delimited text export from the collection you wish to research, that would be ideal. One or more columns should identify the location of the matching document. Make separate columns for street address, city, state/province, and zip codes (if you only have one or a few of these, that’s totally fine). If you have a distinct URL for each document (e.g., a letter or photograph), put that in another column; same for other information such as a caption or description and the title of the document (again, if any). You don’t need these non-location columns; the only reason to include them is if you wish to click on a dot on Google Earth and bring up the corresponding document in your web browser (for closer reading or viewing).

Be sure to title each column, i.e., use text in the topmost cell with specific titles for the columns, with no spaces. I recommend “street_address,” “city,” “state,” zip_code,” “title,” “description,” and “url” (again, you may only have one or more of these; for the CNN example I used only the zip codes). Once you’re done with the spreadsheet, save it as a tab-delimited text file by using that option in Excel (or Access or whatever) under the menu item “Save as…”

Now open that new file in a text editor like Notepad on the PC or Textedit on the Mac (or BBEdit or anything else other than a word processor, since Word, e.g., will reformat the text). Make sure that it still looks roughly like a spreadsheet, with the title of the columns at the top and each column separated by some space. Use “Select all” from the “Edit” menu and then “Copy.”

Now open your web browser and go to Phillip Holmstrand’s geocoding website and go through the steps. “Step #1″ should have “tab delimited” selected. Paste your columned text into the big box in “Step #2″ (you will need to highlight the example text that’s already there and delete it before pasting so that you don’t mingle your data with the example). Click “Validate Source” in “Step #3.” If you’ve done everything right thus far, you will get a green message saying “validated.”

In “Step #4″ you will need to match up the titles of your columns with the fields that Yahoo accepts, such as address, zip code, and URL. Phillip’s site is very smart and so will try to do this automatically for you, but you may need to be sure that it has done the matching correctly (if you use the column titles I suggest, it should work perfectly). Remember, you don’t need to select each one of these parameters if you don’t have a column for every one. Just leave them blank.

Click “Run Geocoder” in “Step #5″ and watch as the latitudes and longitudes appear in the box in “Step #6.” Wait until the process is totally done. Phillip’s site will then map the first 100 points on a built-in Yahoo map, but we are going to take our data with us and modify it a bit. Select “Download to Google Earth (KML) File” at the bottom of “Step #6.” Remember where you save the file. The default name for that file will be “BatchGeocode.kml”. Feel free to change the name, but be sure to keep “.kml” at the end.

While Phillip’s site takes care of a lot of steps for you, if you try right away to open the KML file in Google Earth you will notice that all of the points are blazing white. This is fine for some uses (show me where the closest Starbucks is right now!), but scholarly research requires the ability to compare different KML files (e.g., between CNN viewers and those who prayed). So we need to implement different colors for distinct datasets.

Open your KML file in a text editor like Notepad or Textedit. Don’t worry if you don’t know XML or HTML (if you do know these languages, you will feel a bit more comfortable). Right near the top of the document, there will be a section that looks like this:

<Style id=”A”><IconStyle><scale>0.8</scale><Icon><href>root://icons/
palette-4.png</href><x>30</x><w>32</w><h>32</h></Icon></IconStyle>
<LabelStyle><scale>0</scale></LabelStyle></Style>

To color the dots that this file produces on Google Earth, we need to add a set of “color tags” between <IconStyle> and <scale>. Using your text editor, insert “<color></color>” at that point. Now you should have a section that looks like this:

<Style id=”A”><IconStyle><color></color><scale>0.8</scale><Icon><href>root:
//icons/palette-4.png</href><x>30</x><w>32</w><h>32</h></Icon></IconStyle>
<LabelStyle><scale>0</scale></LabelStyle></Style>

We’re almost done, but unfortunately things get a little more technical. Google uses what’s called an ABRG value for defining colors in Google Earth files. ABRG stands for “alpha, blue, green, red.” In other words, you will have to tell the program how much blue, green, and red you want in the color, plus the alpha value, which determines how opaque or transparent the dot is. Alas, each of these four parts must be expressed in a two-digit hexidecimal format ranging from “00″ (no amount) to “ff” (full amount). Combining each of these two-digit values gives you the necessary full string of eight characters. (I know, I know—why not just <color>red</color>? Don’t ask.) Anyhow, a fully opaque red dot would be <color>ff00ff00</color>, since that value has full (“ff”) opacity and full (“ff”) red value (opacity being the first and second places of the eight characters and red being the fifth and sixth places of the eight characters). Welcome to the joyous world of ABRG.

Let me save you some time. I like to use 50% opacity so I can see through dots. That helps give a sense of mass when dots are close to or on top of each other, as is often the case in cities. (You can also vary the size of the dots, but let’s wait for another day on that one.) So: semi-transparent red is “7f00ff00″; semi-transparent blue is “7fff0000″; semi-transparent green is “7f0000ff”; semi-transparent yellow is “7f00ffff”. (No, green and red don’t make yellow, but they do in this case. Don’t ask.) So for blue dots that you can see through, as in the CNN example, the final code should have “7fff0000″ inserted between <color> and </color>, resulting in:

<Style id=”A”><IconStyle><color>7fff0000</color><scale>0.8</scale><Icon><href>
root://icons/palette-4.png</href><x>30</x><w>32</w><h>32</h></Icon>
</IconStyle><LabelStyle><scale>0</scale></LabelStyle></Style>

When you’ve inserted your color choice, save the KML document in your text editor and run the Google Earth application. From within that application, choose “Open…” from the “File” menu and select the KML file you just edited. Google Earth will load the data and you will see colored dots on your map. To compare two datasets, as I did with prayer and CNN viewership, simply open more than one KML file. You can toggle each set of dots on and off by clicking the checkboxes next to their filenames in the middle section of the panel on the left. Zoom and pan, add other datasets (such as population statistics), add a third or fourth KML file. Forget about all the tech stuff and begin your research.

[For those who just want to try out using a KML file for research in Google Earth, here are a few from the September 11 Digital Archive. Right-click (or control-click on a Mac) to save the files to your computer, then open them within Google Earth, which you can download from here. These are files mapping the locations of: those who watched CNN; those who watched Fox News (far fewer than CNN since Fox News was just getting off the ground, but already showing a much more rural audience compared to CNN); and those who prayed on 9/11.]