What Scholars Want from the Digital Public Library of America

[A rough transcript of my talk at the Digital Public Library of America meeting at Harvard on March 1, 2011. To permit unguarded, open discussion, we operated under the Chatham House Rule, which prevents attribution of comments, but I believe I’m allowed to violate my own anonymity.]

I was once at a meeting similar to this one, where technologists and scholars were discussing what a large digital library should look like. During a breakout session, the technologists huddled and talked about databases, indices, search mechanisms; the scholars, on the other side of the room, painted a vision of what the archive would look like online, in their view a graphical representation as close to the library as possible, where one could pull down boxes from the shelves, and then open those boxes and leaf through the folios one by one.

While the technologists debated digital infrastructure, the scholars were trying to replicate or maintain what they liked about the analog world they knew: a trusted order, the assurance of the physical, all of the cues they pick up from the shelf and the book. If we want to think about the Digital Public Library of America from the scholar’s point of view, we must think about how to replicate those signals while taking advantage of the technology. In short: the best of the single search box with the trust and feel of the bookshelf.

So how can this group translate those scholarly concerns into elements of the DPLA? I did what any rigorous, traditionally trained scholar would do: I asked my Twitter followers. Here are their thoughts, with my thanks for their help:

First, scholars want reliable metadata about scholarly objects like books. Close enough doesn’t count. Although Google has relatively few metadata errors (given that they handle literally a trillion pieces of metadata), these errors drive scholars mad, and make them skeptical of online collections.

Second, serendipity. Many works of scholarship come from the chance encounter of the scholar with primary sources. How can that be enhanced? Some in my feed suggested a user interface with links to “more like this,” “recent additions in your field,” or “sample collections.” Others advocated social cues, such as user-contributed notes on works in the library.

Third, there are different modes of scholarly research, and the interface has to reflect that: a simple discovery layer with a sophisticated advanced search underneath, faceted search, social search methods for collaborative practice, the ability to search within a collection or subcollection.

Fourth, connection with the physical. We need better representations of books online than the sameness of Google books, where everything looks like a PDF of the same size. Scholars also need the ability to go from the digital to the analog by finding a local copy of a work.

Finally, as I have often said, scholars have uses for libraries that libraries can’t anticipate. So we need the DPLA to enable other parties to build upon, reframe, and reuse the collection. In technical terms, this means open APIs.

Video: The Ivory Tower and the Open Web

Here’s the video of my plenary talk “The Ivory Tower and the Open Web,” given at the Coalition for Networked Information meeting in Washington in December, 2010. A general description of the talk:

The web is now over twenty years old, and there is no doubt that the academy has taken advantage of its tremendous potential for disseminating resources and scholarship. But a full accounting of the academic approach to the web shows that compared to the innovative vernacular forms that have flourished over the past two decades, we have been relatively meek in our use of the medium, often preferring to impose traditional ivory tower genres on the web rather than import the open web’s most successful models. For instance, we would rather digitize the journal we know than explore how blogs and social media might supplement or change our scholarly research and communication. What might happen if we reversed that flow and more wholeheartedly embraced the genres of the open web?

I hope the audience for this blog finds it worthy viewing. I enjoyed talking about burrito websites, Layer Tennis, aggregation and curation services, blog networks, Aaron Sorkin’s touchiness, scholarly uses of Twitter, and many other high- and low-brow topics all in one hour. (For some details in the images I put up on the screen, you might want to follow along with this PDF of the slides.) I’ll be expanding on the ideas in this talk in an upcoming book with the same title.

Web Design Job at CHNM

A great opportunity to join us at the Center for History and New Media:

Do you get as excited about clean mark-up as you do about the latest Photoshop effect? Do you want to be on the cutting edge of web design and digital humanities, and design websites that inform and engage end users?

If so, the Center for History and New Media wants to hear from you.

CHNM, known for innovative work in digital media, is seeking an energetic, well-organized, and creative web designer with front-end development skills or experience to work on a variety of innovative, web-based history projects.

This position is particularly appropriate for someone with a combined interest in technology and history or humanities. The successful applicant will be able to create mockups and wireframes for historical, cultural, and educational websites and bring those ideas to fruition using the latest and highest web development standards.

We are looking for a combination of the following skills:

  • fluency with current web design technologies (including ability to hand code HTML, CSS, and Javascript);
  • fluent in Photoshop and experience with Illustrator;
  • experience with web accessibility and web usability standards;
  • experience with or interest in designing for social media or online communities;
  • experience with common open source content management systems (WordPress, BuddyPress, Drupal, etc.);
  • familiarity with web-database technologies (MySQL, PHP);
  • familiarity with contemporary trends in web development (e.g., AJAX, jquery, Rails, css3/HTML5);
  • prior work in history or the digital humanities is a plus.

CHNM offers a casual, collaborative work environment, with excellent opportunities for professional growth and development.

This is a grant-funded, two-year position at the Center for History and New Media (http://chnm.gmu.edu), located in Fairfax, Virginia, CHNM is 15 miles from Washington, DC, and accessible by public transportation. Apply online (including resume, three references, links to prior web work, and a cover letter describing technology background and any interest in history) at http://jobs.gmu.edu for position #10376z.  We will review applications as they arrive and the job closes on January 31, 2011.

If you have questions, contact us at chnm@gmu.edu with subject line “Web Designer.”

Digital Humanities on the Kojo Nnamdi Show

I really enjoyed being on the Kojo Nnamdi Show today talking about digital humanities for an hour with Kojo, the NEH‘s Brett Bobley, and UVA‘s Bill Ferster. Kojo’s show is produced at Washington’s NPR station, WAMU, and syndicated nationally. It’s also available as an audio stream and a podcast.

Having done podcasts for four years now, I’ve come to understand how difficult it is to do a radio show—to ask the right questions, to not um and er a lot, and to stimulate informative conversation. Kojo really makes it look easy, which is even more impressive given the wide variety of topics he covers. As I left the studio today he immediately prepped to do a show on Eisenhower and the military-industrial complex.

Brett, Bill, and I talked about how to define digital humanities, the use of text mining, visualization, and digital mapping, problems associated with the abundant digital record, collaboration in the digital humanities, and questions of publishing, open access, and tenure. We also took numerous questions from callers. I thought the show had a good vibe.

So, worth a listen: The Kojo Nnamdi Show: “History Meets High-Tech: Digital Humanities”

Today was also a moment to reflect on the fact that the last time I was on the Kojo Nnamdi Show was exactly five years ago, with Roy Rosenzweig. Our book Digital History had just come out. It was just before Roy got sick. Probably said a lot on the broadcast today that Roy would have said.

Initial Thoughts on the Google Books Ngram Viewer and Datasets

First and foremost, you have to be the most jaded or cynical scholar not to be excited by the release of the Google Books Ngram Viewer and (perhaps even more exciting for the geeks among us) the associated datasets. In the same way that the main Google Books site has introduced many scholars to the potential of digital collections on the web, Google Ngrams will introduce many scholars to the possibilities of digital research. There are precious few easy-to-use tools that allow one to explore text-mining patterns and anomalies; perhaps only Wordle has the same dead-simple, addictive quality as Google Ngrams. Digital humanities needs gateway drugs. Kudos to the pushers on the Google Books team.

Second, on the concurrent launch of “Culturomics“: Naming new fields is always contentious, as is declaring precedence. Yes, it was slightly annoying to have the Harvard/MIT scholars behind this coinage and the article that launched it, Michel et al., stake out supposedly new ground without making sufficient reference to prior work and even (ahem) some vaguely familiar, if simpler, graphs and intellectual justifications. Yes, “Culturomics” sounds like an 80s new wave band. If we’re going to coin neologisms, let’s at least go with Sean Gillies’ satirical alternative: Freakumanities. No, there were no humanities scholars in sight in the Culturomics article. But I’m also sure that longtime “humanities computing” scholars consider advocates of “digital humanities” like me Johnnies-come-lately. Luckily, digital humanities is nice, and so let us all welcome Michel et al. to the fold, applaud their work, and do what we can to learn from their clever formulations. (But c’mon, Cantabs, at least return the favor by following some people on Twitter.)

Third, on the quality and utility of the data: To be sure, there are issues. Some big ones. Mark Davies makes some excellent points about why his Corpus of Historical American English (COHA) might be a better choice for researchers, including more nuanced search options and better variety and normalization of the data. Natalie Binder asks some tough questions about Google’s OCR. On Twitter many of us were finding serious problems with the long “s” before 1800 (Danny Sullivan got straight to the naughty point with his discourse on the history of the f-bomb). But the Freakumanities, er, Culturomics guys themselves talk about this problem in their caveats, as does Google.

Moreover, the data will improve. The Google n-grams are already over a year old, and the plan is to release new data as soon as it can be compiled. In addition, unlike text-mining tools like COHA, Google Ngrams is multilingual. For the first time, historians working on Chinese, French, German, and Spanish sources can do what many of us have been doing for some time. Professors love to look a gift horse in the mouth. But let’s also ride the horse and see where it takes us.

So where does it take us? My initial tests on the viewer and examination of the datasets—which, unlike the public site, allow you to count words not only by overall instances but, critically, by number of pages those instances appear on and number of works they appear in—hint at much work to be done:

1) The best possibilities for deeper humanities research are likely in the longer n-grams, not in the unigrams. While everyone obsesses about individuals words (guilty here too of unigramism) or about proper names (which are generally bigrams), more elaborate and interesting interpretations are likelier in the 4- and 5-grams since they begin to provide some context. For instance, if you want to look at the history of marriage, charting the word itself is far less interesting than seeing if it co-occurs with words like “loving” or “arranged.” (This is something we learned in working on our NEH-funded grant on text mining for historians.)

2) We should remember that some of the best uses of Google’s n-grams will come from using this data along with other data. My gripe with the “Culturomics” name was that it implied (from “genomics”) that some single massive dataset, like the human genome, will be the be-all and end-all for cultural research. But much of the best digital humanities work has come from mashing up data from different domains. Creative scholars will find ways to use the Google n-grams in concert with other datasets from cultural heritage collections.

3) Despite my occasional griping about the Culturomists, they did some rather clever things with statistics in the latter part of their article to tease out cultural trends. We historians and humanists should be looking carefully at the more complex formulations of Michel et al., when they move beyond linguistics and unigram patterns to investigate in shrewd ways topics like how fleeting fame is and whether the suppression of authors by totalitarian regimes works. Good stuff.

4) For me, the biggest problem with the viewer and the data is that you cannot seamlessly move from distant reading to close reading, from the bird’s eye view to the actual texts. Historical trends often need to be investigated in detail (another lesson from our NEH grant), and it’s not entirely clear if you move from Ngram Viewer to the main Google Books interface that you’ll get the book scans the data represents. That’s why I have my students use Mark Davies’ Time Magazine Corpus when we begin to study historical text mining—they can easily look at specific magazine articles when they need to.

How do you plan to use the Google Books Ngram Viewer and its associated data? I would love to hear your ideas for smart work in history and the humanities in the comments, and will update this post with my own further thoughts as they occur to me.

New York Times Covers Victorian Books Project

Patricia Cohen of the New York Times has been working on an excellent series on digital humanities, and her second article focuses on our text mining work on Victorian books, which was directly enabled by a grant from Google and more broadly enabled by a previous grant from the National Endowment for the Humanities to explore text mining in history. I’m glad Cohen (no relation) captured the nuances and caveats as well as the potential of digital methods. I also liked how the graphics department did a great job converting and explaining some of our graphs.

I previously posted a rough transcript of my talk on Victorian history and literature that Cohen mentions in the piece. She also covered my work earlier this year in an article on peer review that was much debated in academia.

A Conversation with Richard Stallman about Open Access

[An email exchange with Richard Stallman, father of free software, copyleft, GNU, and the GPL, reprinted here in redacted form with Stallman’s permission. Stallman tutors me in the important details of open access and I tutor him in the peculiarities of humanities publishing.]

RS: [Your] posting [“Open Access Publishing and Scholarly Values”] doesn’t specify which definition of “open access” you’re arguing for — but that is a fundamental question.

When the Budapest Declaration defined open access, the crucial condition was that users be free to redistribute copies of the articles.  That is an ethical imperative in its own right, and a requisite for proper and safe archiving of the work.

People paid more attention to the other condition specified in the Budapest Declaration: that the publication site allow access by anyone.  This is a good thing, but need not be explicitly required, because the other condition (freedom to redistribute) will have this as a consequence.  Many universities and labs to set up mirror sites, and everyone will thus have access.

More recently, some have started using a modified definition of “open access” which omits the freedom to redistribute.  As a result, “open access” is no longer a clear rallying point.  I think we should now campaign for “redistributable publication.”

What are your thoughts on this?

DC: I probably should have been clearer in my post that I’m for the maximal access—and distribution—of which you speak. Alas, the situation is actually worse than you imagine, especially in the humanities, where I work, and which is about a decade behind the sciences in open access. Beyond the muddying of the waters through terms like “Green OA” and “Gold OA” is the fact that academic publishing is horribly wrapped up (again, more so in the humanities) with structural problems related to reputation, promotion, and tenure. So my colleagues worry more about truly open publications “counting” vs. publications that are simply open to reading on a commercial publisher’s website. That is why I think the big question is not the licensing or the technology of decentralized publishing, posting and free distribution of papers, etc., but the social realm in which academic publishing sits. I’m working now on pragmatic ways to change that very conservative realm.

Put another way: when software developers write good (open) code, other developers recognize that quality, independent of where the code resides; in humanities publishing, packaging (including the imprimatur of a press, the sense that a work has jumped some (often mythical) peer-review hurdle) counts for too much right now.

RS: [“Green OA” and “Gold OA”] are new to me — can you tell me what they mean?

So my colleagues worry more about truly open publications “counting” vs. publications that are simply open to reading on a commercial publisher’s website.

I don’t understand that sentence.

That is why I think the big question is not the licensing or the technology of decentralized publishing, posting and free distribution of papers, etc., but the social realm in which academic publishing sits.

Ethically speaking, what matters is the license used. That’s what determines whether the publishing is ethical or not. Are you saying that the social realm contains the obstacle to the adoption of ethical publication methods?

Put another way: when software developers write good (open) code, other developers recognize that quality, independent of where the code resides.

Programmers can tell if code is well-written, assuming they are allowed to read it, but how does that relate? Are you saying that in the humanities people often judge work based on where it is published, and have no other way to determine what is good or bad?

DC: Green O[pen] A[ccess] = when a professor deposits her finished article in a university repository after it is published. Theoretically that article will then be available (if people can find the website for the institution’s repository), even if the journal keeps it gated.

Gold OA = when an author pays a journal (often around $1-3K) to make their submission open access. when the journal itself (rather than the repository) is open access; may involve the author paying a submission fee. Still probably doesn’t have a redistribution license, but it’s not behind a publisher’s digital gates.

Counting = counting in the academic promotion and tenure process. Much of the problem here is (I believe misplaced) concern about the effect of open access on one’s career.

Are you saying that the social realm contains the obstacle to the adoption of ethical publication methods?

Correct. And much of it has to do with the meekness of academics (especially in the humanities, bastion of liberalism in most other ways) to challenge the system to create a more ethical publication system, one controlled by the community of scholars rather than commercial publishers who profit from our work.

Are you saying that in the humanities people often judge work based on where it is published, and have no other way to determine what is good or bad?

Amazing as it may sound, many academics do indeed judge a work that way, especially in tenure and promotion processes. There are some departments that actually base promotion and tenure on the number of pages published in the top (mostly gated) journals.

RS: [Terms like “Green OA” and “Gold OA” provides] even more reason to reject the term “open access” and demand redistributable publication.

Maybe some leading scholars could be recruited to start a redistributable journal.  Their names would make it prestigious.

DC: That’s what PLoS did (http://plos.org) in the sciences. Unclear if the model is replicable in the humanities, but I’m trying.

UPDATE: This was an off-hand conversation with Stallman, and my apologies for the quick (and poor) descriptions of a couple of open access options. But I think the many commenters below who are focusing on the fine differences between kinds of OA are missing the central themes of this conversation.

Frank Turner on the Future of Peer Review

As I mentioned in my memorial post for my mentor Frank Turner, we were having a deep discussion of the future of peer review when he suddenly passed away. I wish we could have finished this discussion; as with so many other things, he brought tremendous insight to the topic. Much of the discussion was about personal experiences with peer review that I can’t recount in this public space, but we also got into “strategic planning” for changing the peer review system.

Here are the powerful last few email messages I received from Frank, redacted of personal matters and some touchy subjects. I think all of us trying to reform the academy through digital means should heed his words.

On the practice of peer review (I limit my thoughts to the Humanities) I have several different and conflicting opinions. Numerous journals are really quite well edited in my experience…In theory and often in practice peer review is a good thing.

But the problems of peer review are also longstanding…Other journals have in the past and still do remain the proprietary reserves of academic cliques or worse. One of the problems of which peer review in the humanities is only one facet is the absence of any agreed-upon and widely accepted understanding of the professional procedures and expectations. (Some would say a lack of “professional ethics.”) This has been exacerbated by the vast expansion of the academy in the second half of the last century; the often undue research expectations put on faculty in institutions that cannot financially support significant research; the necessity of editors sending out all manuscripts no matter how clearly mediocre and or undeveloped and hence expanding peer review expectations…

As you and others think through the peer review process, I would hope that you would keep several things in mind. First, you will need to avoid the appearance of playing tennis with the net down. Groups of friends or overly like-minded folks producing journals or collections of essays may disperse various views but do not necessarily make for tough-minded scholarship. Second, the kind of new reviewing processes you and others are suggesting could provide the opportunity really to establish widely accepted understandings of procedures and expectations. Such would be a major new departure, and it could benefit from the input of the editors of genuinely respected journals. Third, and I will return to this point below on another topic, as journals come to be published online (and I think within five years or less entirely on-line), they should make available to readers the possibility of commenting on articles. Again, there would need to be some kind of template so the comments are not like those on Amazon. But what would emerge would be a kind of scholarly community of commentary, revision, and correction. Fourth, at the end of the day, however, a new, open, collective peer review process will still need to indicate that some work is stronger, more deeply researched, and more profoundly analyzed than other work. I happen to think one of the benefits of studying the various areas of the humanities is achieving the capacity to make judgments. The peer review of humanities scholarship should avoid at all costs the appearance and the reality of not being able to make judgments regarding quality…

Let me expand the purview of what you and others are seeking to accomplish.  The realm of peer reviewing of articles is really quite strong when compared to that of book reviewing in journals. Books reviews are published with essentially no peer review, little or no concern or indication of actual or potential conflict of interest, and little or no concern for factual correctness. Such reviews are then used across the country in promotion processes. Scholarly book reviewing stands in a near scandalous situation. Most people review books in order not to purchase them. Reviews tend to be quite brief and as I have indicated are generally unedited except for style. Many reviewers simply rehash the dust jacket. Your group could again add to your agenda the establishment of professional procedures and expectations regarding book reviewing. These would include all reviewers indicating any conflict of interests, e.g., having taught the author, friendship with the author, residing in the same academic department or institution with the author, having written or edited a similar or competing book, having published with the same press, or having some political, religious, or ideological point of view that informs their thinking. Furthermore, again with the establishment of almost entirely on-line journal publication, all authors could be permitted to comment on and correct reviews and other scholars could similarly comment on the review or the book reviewed…

Most of you who are looking toward new ways of peer reviewing are young or at the entry level of the profession. All of you have a clear interest in reforming the existing reviewing process. I hope you will add that to your agenda as well as the peer reviewing of journals.

Frank Turner: A Great Mentor, Scholar, and Friend

“History isn’t rocket science.” I distinctly remember Frank Turner, my mentor at Yale, saying that to me in 1995 over a beer on Charlotte Street in London after a day looking at documents in the Royal Society archive. “What did you see?” What I had seen was a number of documents showing a famous mathematician trying to solve religious problems using equations. “Well, then that’s what you have to write about.”

Frank suddenly passed away today from a stroke at 66—devastating, incredibly sad news. I’ll miss him for so many reasons—most of all, he was just such a nice, caring individual, and so whip-smart about many things. I’m still deeply influenced by his pragmatic view of history, not as a complex theoretical realm but quite frequently as a process of simply recognizing what’s in front of you.

Frank’s body of work showed the power of simply recognizing what was in front of you. His first book vaulted past stale discussions about the war between science and religion in the Victorian era by showing that there were many intellectuals caught between the two supposed poles—something that should have been obvious to any close reader of Victorian thought but which had been denied by decades of “war between science and religion” talk.

A word that Frank used a lot was “unnoticed”—that is, the past is often lying in plain sight, but our preconceptions prevent us from seeing it. In his groundbreaking essay “The Victorian Crisis of Faith and the Faith That Was Lost,” he noticed that this crisis began during an intensification of religion through evangelicalism, the language of which (a return to purity, an emphasis on reform) was soon turned against existing faith. In other works he noticed the strong effect that debt and bankruptcy had on the supposedly detached thought of Victorian thinkers—they were human, after all.

Frank had seriously good taste in the important things in life: ideas (Hume), art (J.M.W. Turner), architecture (Louis Kahn), the landscape (rural New England, the Cotswolds), wine (BurgundyBordeaux), dogs (English setters). But he approached these pleasures not as landed gentry or a stereotypical Yale professor but as someone who had come from modest means in Ohio and had stumbled, almost giddily, on the joys of history and the senses. He had a historic 300-year-old saltbox in Guilford, Connecticut, that he had tastefully modified (just before the preservationists disallowed such changes) to hold a substantial personal library and writing room, and a large kitchen for dinner parties. Frank had a great laugh, and was not stingy with it. He could tell you rather specifically why a painting was great and then just stand back and smile at its greatness.

Frank was so shrewd about so many things—he had a deft understanding of Victorian history, of modern intellectual history, the workings of the academy, human nature. He also had several major phases to his career, including a stint as provost and more recently the director of the Beinecke rare books library at Yale. Just two months ago he was appointed the overall University Librarian.

I had been talking to him for the last few years about the future of libraries and the humanities in a digital age. A decade ago I think he was slightly depressed that I had veered into the digital humanities. But he called me after seeing Google Books for the first time—what was in front of him—and immediately got that this was something he had to understand better. I have no doubt that he would have been an incredible librarian who would have honored tradition while also moving one of the world’s great library systems forward; yet another facet to this tremendous loss.

This fall Frank and I were having a long back-and-forth about the future of peer review, some of which I may redact and publish in this space. I wish we could have finished that discussion, and that I could have gotten more of his advice, on this and so many other topics.

For now, I just want to honor Frank Turner, may he rest in peace. He was a great mentor, scholar, and friend. I will deeply miss him.

Digital History at the 2011 AHA Meeting

It’s time for my annual report/rant on the lack digital sessions at the American Historical Association annual meeting, a good gauge of what professional historians are interested in. Evidently we historians will just keep on doing what we’re doing how we’re doing it until it seems truly anachronistic. Just one of the main AHA panels, out of nearly three hundred, covers digital matters; perhaps another will touch on digital methods. By my count there are another six digital sessions overall, but these other sessions are put on by affiliate societies or were added by the program committee during lunches or other break times (that is, there were almost no digital panels proposed by historians attending the meeting). Incredibly, there are actually fewer digital sessions at the 2011 annual meeting than in prior years. Because clearly this digital thing is a flash in the pan.

OK, I’ll stop with the sarcasm. I love my colleagues in history, but it’s time for a change, and as a new member of the AHA program committee I suspect the state of affairs will be different at the 2012 meeting. For now, here is this year’s list of digital sessions at the AHA annual meeting:

When Universities Put Dissertations on the Internet: New Practice; New Problem?
[Special session added by the program committee during lunch on Friday]

Critical Issues in Bibliography and Libraries in the Digital Age
[Sponsored by the Association for the Bibliography of History and the American Association for History and Computing]

Digital Tools for Teaching and Learning American History
[CHNM‘s own Rwany Sibaja hosts a 45-minute intro/demo]

Public Media and the Case for Digital History: New Directions and Opportunities for Students, Teachers, and Historians
[Special session added by the program committee during lunch on Saturday]

What’s Next? Patterns and Practices in History in Print and Online
[AHA Session 191, co-sponsored by the American Association for History and Computing]

History and Technology In and Out of the Classroom
[Sponsored by the Coordinating Council for Women in History]

Religious History’s Digital Future
[Sponsored by the American Society of Church History]

Enhancing Historical Thinking Skills Through Teaching American History Grants
[AHA Session 269]