Dan Cohen

Archive for the ‘Open Access’ Category

A Quartet of Open Access Arguments

Tuesday, February 12th, 2008

On the day that Harvard’s faculty votes on a strong open access proposal (I’m still looking for the actual text of the proposal; please add a link in the comments if you are aware of it), here are a few of the better arguments this week about the open access movement:

Digital Campus #20 – Open to Change

Wednesday, January 30th, 2008

Are open educational resources such as iTunes U and thought-provoking dot-coms such as BigThink.com a distraction from the mission of professors and universities, or the wave of the future? We debate the merits of “open access” intellectual content in the feature story on our twentieth Digital Campus podcast. Also, I report on the mostly good (if a little odd) experience of buying a book from PublicDomainReprints.org, and we discuss the MacBook Air, Flickr Commons, and a variety of tools for manipulating RSS feeds.

The Case for Open Access Books

Thursday, January 24th, 2008

Open BookThis month’s First Monday has one of the most pragmatic, sensible articles I’ve read about the promise and perils of open access books. In “Open access book publishing in writing studies: A case study,” by Charles Bazerman, David Blakesley, Mike Palmquist, and David Russell, the authors describe their experience deciding to eschew a traditional publication arrangement with an academic press (what supposedly gives our monographs the sheen of value and gets us tenure). Instead they publish an edited volume straight to the web.

Along the way the authors discover that many of the concerns that humanities scholars have about publishing in a free and open way are either overblown or simply myths. Only one junior scholar (out of the 20 scholars asked to contribute) worries about promotion and tenure. And indeed all of the scholars who contribute to the edited volume receive credit for their chapters. More important, the editors and contributors are surprised to discover that the book makes its way rapidly and powerfully into the consciousness of their field:

[The] initial reaction [to the book] did not prepare us for the acceptance the book ultimately received from the academic communities to which it was addressed.

Since its publication, the Writing Selves/Writing Societies Web page has been visited more than 85,000 times by more than 36,000 unique visitors. The trend, interestingly, has been a steady increase in visits over the past four years, with more than 30,000 occurring in the past 12 months. Since its publication, the book has been downloaded in its entirety more than 36,000 times. Individual essays have been downloaded more than 108,000 times. In terms of perceived quality of the scholarly work in the collection, the book has been well received by the field. Within six months of publication, the book was positively reviewed by four journals: two print and two electronic. One year after its publication, in the keynote address to the Conference on College Composition and Communication, the major annual conference in writing studies, Kathleen Blake Yancey quoted extensively from chapters in the book. And the book has continued to figure prominently in scholarly work subsequently published in the field of composition and rhetoric.

According to a search of Google Scholar, which indexes scholarly publications available on the Web (29 September 2006), the book or individual chapters in it has been cited 68 times, according to a search of Google Scholar. Although we do not have comprehensive comparison data for print publications, we suspect that this is a higher rate. A print–only collection with about the same number of chapters (15) published in the same year as Writing Selves/Writing Societies (and winner of a best book award given by a leading journal in the field), had far fewer citations: 10. Our experience suggests that open access scholarly books follow a pattern of citation similar to journals, which indicate that open access journal articles in a wide range of fields are both more likely to be cited and likely to be cited more quickly. Our experience with Writing Selves/Writing Societies supports this…

Overall, Writing Selves/Writing Societies appears to have entered into the system of book publishing neatly, in spite of the fact that it was not published by a traditional academic publisher and was being offered at no charge.

Beyond the questions of business models, scholarly influence, and promotion and tenure, there is also the nagging question Roy Rosenzweig posed in “Should Historical Scholarship Be Free?” At the time Roy was the Vice President for Research at the American Historical Association, and was pushing for open access to the American Historical Review. (Ultimately he got the powers that be to agree to put AHR articles online for free, although the book reviews remain behind gates.)

Besides the ethical good of publishing in an open access model—sharing educational and scholarly materials—Roy noted that the work of most scholars is funded, directly or indirectly, by the public. Noting the National Institutes of Health‘s recent mandate that grantees share their work openly with the public, Roy wrote:

The new policy affects few historians, but its implications ought to give us serious pause. After all, historical research also benefits directly (albeit considerably less generously) through grants from federal agencies like the National Endowment for the Humanities; even more of us are on the payroll of state universities, where research support makes it possible for us to write our books and articles. If we extend the notion of “public funding” to private universities and foundations (who are, of course, major beneficiaries of the federal tax codes), it can be argued that public support underwrites almost all historical scholarship.

Do the fruits of this publicly supported scholarship belong to the public? Should the public have free access to it?

Roy, of course, thought this meant that like NIH grantees we should provide open access to our articles, such as those in the AHR. But doesn’t the same argument hold true for books?

[Postscript: Some scientists have been wondering the same thing.]

[Image credit]

MacEachern and Turkel, The Programming Historian

Monday, January 14th, 2008

Bill Turkel, the always creative mind behind Digital History Hacks (logrolling disclosure: Bill is a friend of CHNM, a collaborator on various fronts, and was the thought-provoking guest on Digital Campus #9; still, he deserves the compliments), and his colleague at the University of Western Ontario, Alan MacEachern, are planning to write a book entitled The Programming Historian. Better yet, the book will be open access and hosted on the Network in Canadian History & Environment (NiCHE) site. Bill’s summary of the book on his blog sounds terrific. Can’t wait to read it and use it in my classes.

Washington Post on Zotero, Open Academia

Tuesday, January 1st, 2008

It was nice to see the Zotero project covered on the front page of the Washington Post yesterday in the article “Internet Access Is Only Prerequisite For More and More College Classes.” Also nice to see a quotation at the end of the article from yours truly about the movement in higher ed toward open tools and resources.

Two Misconceptions about the Zotero-IA Alliance

Friday, December 14th, 2007

Thanks to everyone for their helpful (and thankfully, mostly positive) feedback on the new Zotero-IA alliance. I wanted to try to clear up a couple of things that the press coverage and my own writing failed to communicate. (Note to self: finally get around to going to one of those media training courses so I can learn how to communicate all of the elements of a complex project well in three minutes, rather than lapsing into my natural academic long-windedness.)

1. Zotero + IA is not simply the Zotero Commons

Again, this is probably my fault for not communicating the breadth of the project better. The press has focused on items #1 and 2 in my original post—they are the easiest to explain—but while the project does indeed try to aggregate scholarly resources, it is also trying to solve another major problem with contemporary scholarship: scholars are increasingly using and citing web resources but have no easy way to point to stable URLs and cached web pages. In particular, I encourage everyone to read item #3 in my original post again, since I consider it extremely important to the project.

Items #4 and 5 also note that we are going to leverage IA for better collaboration, discovery, and recommendation systems. So yes, the Commons, but much more too.

2. Zotero + IA is not intended to put institutional repositories out of business, nor are they excluded from participation

There has been some hand-wringing in the library blogosphere this week (see, e.g., Library 2.0) that this project makes an end-run around institutional repositories. These worries were probably exacerbated by the initial press coverage that spoke of “bypassing” the libraries. However, I want to emphasize that this project does not make IA the exclusive back end for contributions. Indeed, I am aware of several libraries that are already experimenting with using Zotero as an input device for institutional repositories. There is already an API for the Zotero client that libraries can extract data and files from, and the server will have an even more powerful API so that libraries can (with their users’ permission, of course) save materials into an archive of their own.

Zotero and the Internet Archive Join Forces

Wednesday, December 12th, 2007

IA LogoZotero LogoI’m pleased to announce a major alliance between the Zotero project at the Center for History and New Media and the Internet Archive. It’s really a match made in heaven—a project to provide free and open source software and services for scholars joining together with the leading open library. The vision and support of the Andrew W. Mellon Foundation has made this possible, as they have made possible the major expansion of the Zotero project over the last year.

You will hear much more about this alliance in the coming months on this blog, but I wanted to outline five key elements of the project.

1. Exposing and Sharing the “Hidden Archive”

The Zotero-IA alliance will create a “Zotero Commons” into which scholarly materials can be added simply via the Zotero client. Almost every scholar and researcher has documents that they have scanned (some of which are in the public domain), finding aids they have created, or bibliographies on topics of interest. Currently there is no easy way to share these; giving them a central home at the Internet Archive will archive them permanently (before they are lost on personal hard drives) and make them broadly available to others.

We understand that not everyone will be willing to share everything (some may not be willing to share anything, even though almost every university commencement reminds graduates that they are joining a “community of scholars”), but we believe that the Commons will provide a good place for shareable materials to reside. The architectural historian with hundreds of photographs of buildings, the researcher who has scanned in old newspapers, and scholars who wish to publish materials in an open access environment will find this a helpful addition to Zotero and the Internet Archive. Some researchers may of course deposit materials only after finishing, say, a book project; what I have called “secondary scholarly materials” (e.g., bibliographies) will perhaps be more readily shared.

But we hope the second part of the project will further entice scholars to contribute important research materials to the Commons.

2. Searching the Personal Library

Most scholars have not yet figured out how to take full advantage of the digitized riches suddenly available on their computers. Indeed, the abundance of digital documents has actually exacerbated the problems of some researchers, who now find themselves overwhelmed by the sheer quantity of available material. Moreover, the major advantage of digital research—the ability to scan large masses of text quickly—is often unavailable to scholars who have done their own scanning or copying of texts.

A critical second part to this alliance of IA and Zotero is to bring robust and seamless Optical Character Recognition (OCR) to the vast majority of scholars who lack the means or do not know how to convert their scans into searchable text. In addition, this process will let others search through such newly digitized texts. After a submission to the Commons, the Internet Archive will subsequently return an OCRed version of each donated document to enable searchability. This text will be incorporated into the donor’s local index (on the Zotero client) and thus made searchable in Zotero’s powerful quick search and advanced search panes. In short, this process will provide a tremendous incentive for scholars to donate to the Commons, since it will help them with their own research.

3. Enabling Networked References and Annotations

One of the pillars of scholarship is the ability for distributed scholars to be sure they are referencing the same text or evidence. As noted in #1, one of the great advantages of the Zotero Commons at IA will be the transport of scholarly materials currently residing on personal hard drives to a public space with stable, rather than local, addresses. These addresses will become critical as scholars begin to use, refer to, and cite items in the Commons.

Yet the IA/Zotero partnership has another benefit: as scholars begin to use not only traditional primary sources that have been digitized but also “born digital” materials on the web (blogs, online essays, documents transcribed into HTML), the possibility arises for Zotero users to leverage the resources of IA to ensure a more reliable form of scholarly communication. One of the Internet Archive’s great strengths is that it has not only archived the web but also given each page a permanent URI that includes a time and date stamp in addition to the URL.

Currently when a scholar using Zotero wishes to save a web page for their research they simply store a local copy. For some, perhaps many, purposes this is fine. But for web documents that a scholar believes will be important to share, cite, or collaboratively annotate (e.g., among a group of coauthors of an article or book) we will provide a second option in the Zotero web save function to grab a permanent copy and URI from IA’s web archive. A scholar who shares this item in their library can then be sure that all others who choose to use it will be referring to the exact same document.

Moreover, unlike most research software the sophisticated annotation tools built into Zotero—the ability to highlight passages, add virtual Post-It notes, as well as regular notes on the overall document—maintain these annotations separately from the underlying document. This presents the exciting possibility for collaborative scholarly annotation of web pages.

4. Simplifying Collaborative Sharing

Groups of scholars also have the need to create more private “commons,” e.g., for documents that they would like to share in a limited way. In addition to the fully open Zotero Commons we will establish a mechanism for such restricted sharing. Via the Zotero Server, a user will be able to create a special collection with a distinct icon that shows up in the client interface (left column) for every member of the group.

Files added to these collections will be stored on the Internet Archive but will have restricted access. We believe that having these files reside on the IA server will encourage the donation of documents at the end of a collaborative project. The administrator of a shared collection will be able to move its contents into the fully open Zotero Commons via a single click in the administrative interface on the Zotero Server.

5. Facilitating Scholarly Discovery

The multiple libraries of content created by Zotero users and the multi-petabyte digital collections of the Internet Archive are resources that can potentially be of great use to the scholarly community. We believe that neither has experienced the level of exploration and usage we believe is possible through further development and collaboration.

The combined digital collections present opportunities for scholars to find primary research materials, to discover one another’s work, to identify materials that are already available in digital form and therefore do not need to be located and scanned, to find other scholars with similar interests and to share their own insights broadly. We plan to leverage the combined strengths of the Zotero project and the Internet Archive to work on better discovery tools.

Symposium on the Future of Scholarly Communication

Thursday, November 15th, 2007

For those who missed it, between October 12 and 27, 2007, there was a very thoughtful and insightful online discussion of how the publication of scholarship is changing—or trying to change—in the digital age. Participating in the discussion were Ed Felton, David Robinson, Paul DiMaggio, and Andrew Appel from Princeton University (the symposium was hosted by the Center for Information Technology Policy at Princeton), Ira Fuchs of the Mellon Foundation, Peter Suber of the indispensable Open Access News blog (and philosophy professor at Earlham College), Stan Katz, the President Emeritus of the American Council of Learned Societies, and Laura Brown of Ithaka (and formerly the President of Oxford University Press USA).

The symposium is really worth reading from start to finish. (Alas, one of the drawbacks of hosting a symposium on a blog is that it keeps everything in reverse chronological order; it would be great if CITP could flip the posts now that the discussion has ended.) But for those of us in the humanities the most relevant point is that we are going to have a much harder transition to an online model of scholarship than in the sciences. The main reason for this is that for us the highest form of scholarship is the book, whereas in the sciences it is the article, which is far more easily put online, posted in various forms (including as pre- and e-prints), and networked to other articles (through, e.g., citation analysis). In addition, we’re simply not as technologically savvy. As Paul DiMaggio points out, “every computer scientist who received his or her Ph.D. in computer science after 1980 or so has a website” (on which they can post their scholarly production), whereas the number is about 40% for political scientists and I’m sure far less for historians and literature professors.

I’m planning a long post in this space on the possible ways for humanities professors to move from print to open online scholarship; this discussion is great food for thought.

Why Google Books Should Have an API

Tuesday, September 4th, 2007

No Way Out[This post is a version of a message I sent to the listserv for CenterNet, the consortium of digital humanities centers. Google has expressed interest in helping CenterNet by providing a (limited) corpus of full texts from their Google Books program, but I have been arguing for an API instead. My sense is that this idea has considerable support but that there are also some questions about the utility of an API, including from within Google.]

My argument for an API over an extracted corpus of books begins with a fairly simple observation: how are we to choose a particular dataset for Google to compile for us? I’m a scholar of the Victorian era, so a large corpus from the nineteenth century would be great, but how about those who study the Enlightenment? If we choose novels, what about those (like me) who focus on scientific literature? Moreover, many of us wish to do more expansive horizontal (across genres in a particular age) and vertical (within the same genre but through large spans of time) analyses. How do we accommodate the wishes of everyone who does computational research in the humanities?

Perhaps some of the misunderstanding here is about the kinds of research a humanities scholar might do as opposed to, say, the computational linguist, who might make use of a dataset or corpus (generally a broad and/or normalized one) to assess the nature of (a) language itself, examine frequencies and patterns of words, or address computer science problems such as document classification. Some of these corpora can provide a historian like me with insights as long as the time span involved is long enough and each document includes important metadata such as publication date (e.g., you can trace the rise and fall of certain historical themes using BYU’s Time Magazine corpus).

But there are many other analyses that humanities scholars could undertake with an API, especially one that allowed them to first search for books of possible interest and then to operate on the full texts of that ad hoc corpus. An example from my own research: in my last book I argued that mathematics was “secularized” in the nineteenth century, and part of my evidence was that mathematical treatises, which normally contained religious language in the early nineteenth century, lost such language by the end of the century. By necessity, researching in the pre-Google Books era, my textual evidence was limited–I could only read a certain number of treatises and chose to focus on the writing of high-profile mathematicians.

How would I go about supporting this thesis today using Google Books? I would of course love to have an exhaustive corpus of mathematical treatises. But in my book I also used published books of poems, sermons, and letters about math. In other words, it’s hard to know exactly what to assemble in advance–just treatises would leave out much of the story and evidence.

Ideally, I would like to use an API to find books that matched a complicated set of criteria (it would be even better if I could use regular expressions to find the many variants of religious language and also to find religious language relatively close to mentions of mathematics), and then use get_cache to acquire the full OCRed text of these matching books. From that ad hoc corpus I would want to do some further computational analyses on my own server, such as extracting references to touchstones for the divine vision of mathematics (e.g., Plato’s later works, geometry rather than number theory), and perhaps even do some aggregate analyses (from which works did British mathematicians most often acquire this religious philosophy of mathematics?). I would also want to examine these patterns over time to see if indeed the bond between religion and mathematics declined in the late Victorian era.

This is precisely the model I use for my Syllabus Finder. I first find possible syllabi using an algorithm-based set of searches of Google (via the unfortunately deprecated SOAP Search API) while also querying local Center for History and New Media databases for matches. Since I can then extract the full texts of matching web pages from Google (using the API’s cache function), I can do further operations, such as pulling book assignments out of the syllabi (using regular expressions).

It seems to me that a model is already in place at Google for such an API for Google Books: their special university researcher’s version of the Search API. That kind of restricted but powerful API program might be ideal because 1) I don’t think an API would be useful without the get_OCRed_text function, which (let’s face it) liberates information that is currently very hard to get even though Google has recently released a plain text view of (only some of) its books; and 2) many of us want to ping the Google Books API with more than the standard daily hit limit for Google APIs.

[Image credit: the best double-entendre cover I could find on Google Books: No Way Out by Beverly Hastings.]

A Closer Look at the National Archives-Footnote Agreement

Monday, February 5th, 2007

I’ve spent the past two weeks trying to get a better understanding of the agreement signed by the National Archives and Footnote, about which I raised several concerns in my last post. Before making further (possibly unfounded) criticisms I thought it would a good idea to talk to both NARA and Footnote. So I picked up the phone and found several people eager to clarify things. At NARA, Jim Hastings, director of access programs, was particularly helpful in explaining their perspective. (Alas, NARA’s public affairs staff seemed to have only the sketchiest sense of key details.) Most helpful—and most eager to rebut my earlier post—were Justin Schroepfer and Peter Drinkwater, the marketing director and product lead at Footnote. Much to their credit, Justin and Peter patiently answered most of my questions about the agreement and the operation of the Footnote website.

Surprisingly, everyone I spoke to at both NARA and Footnote emphasized that despite the seemingly set-in-stone language of the legal agreement, there is a great deal of latitude in how it is executed, and they asked me to spread the word about how historians and the general public can weigh in. It has received virtually no publicity, but NARA is currently in a public comment phase for the Footnote (a/k/a iArchives) agreement. Scroll down to the bottom of the “Comment on Draft Policy” page at NARA’s website and you’ll find a request for public comment (you should email your thoughts to Vision@nara.gov). It’s a little odd to have a request for comment after the ink is dry on an agreement or policy, and this URL probably should have been included in the press release of the Footnote agreement, but I do think after speaking with them that both NARA and Footnote are receptive to hearing responses to the agreement. Indeed, in response to this post and my prior post on the agreement, Footnote has set up a web page, “Finding the Right Balance,” to receive feedback from the general public on the issues I’ve raised. They also asked me to round up professional opinion on the deal.

I assume Footnote will explain their policies in greater depth on their blog, but we agreed that it would be helpful to record some important details of our conversations in this space. Here are the answers Justin and Peter gave to a few pointed questions.

When I first went to the Footnote site, I was unpleasantly surprised that it required registration even to look at “milestone” documents like Lincoln’s draft of the Gettysburg Address. (Unfortunately, Footnote doesn’t have a list of all of its free content yet, so it’s hard to find such documents.) Justin and Peter responded that when they launched the site there was an error in the document viewer, so they had to add authentication to all document views. A fix was rolled out on January 23, and it’s now possible to view these important documents without registering.

You do need to register, however, to print or download any document, whether it’s considered “free” or “premium.” Why? Justin and Peter candidly noted that although they have done digitization projects before, the National Archives project, which contains millions of critical—and public domain—documents, is a first for them. They are understandably worried about the “leakage” of documents from their site, and want to take it one step at a time. So to start they will track all downloads to see how much escapes, especially in large batches. I noted that downloading and even reusing these documents (even en masse) very well might be legal, despite Footnote’s terms of service, because the scans are “slavish” copies of the originals, which are not protected by copyright. Footnote lawyers are looking at copyright law and what other primary-source sites are doing, and they say that they view these initial months as a learning experience to see if the terms of service can or should change. Footnote’s stance on copyright law and terms of usage will clearly be worth watching.

Speaking of terms of usage, I voiced a similar concern about Footnote’s policies toward minors. As you’ll recall, Footnote’s terms of service say the site is intended for those 18 and older, thus seeming to turn away the many K-12 classes that could take advantage of it. Justin and Peter were most passionate on this point. They told me that Footnote would like to give free access to the site for the K-12 market, but pointed to the restrictiveness of U.S. child protection laws. Because the Footnote site allows users to upload documents as well as view them, they worry about what youngsters might find there in addition to the NARA docs. These laws also mandate the “over 18″ clause because the site captures personal information. It seems to me that there’s probably a technical solution that could be found here, similar to the one PBS.org uses to provide K-12 teaching materials without capturing information from the students.

Footnote seems willing to explore such a possibility, but again, Justin and Peter chalked up problems to the newness of the agreement and their inexperience running an interactive site with primary documents such as these. Footnote’s lawyers consulted (and borrowed, in some cases) the boilerplate language from terms of service at other sites, like Ancestry.com. But again, the Footnote team emphasized that they are going to review the policies and look into flexibility under the laws. They expect to tweak their policies in the coming months.

So, now is your chance to weigh in on those potential changes. If you do send a comment to either Footnote or NARA, try to be specific in what you would like to see. For instance, at the Center for History and New Media we are exploring the possibility of mining historical texts, which will only be possible to do on these millions of NARA documents if the Archives receives not only the page images from Footnote but also the OCRed text. (The handwritten documents cannot be automatically transcribed using optical character recognition, of course, but there are many typescript documents that have been converted to machine-readable text.) NARA has not asked to receive the text for each document back from Footnote—only the metadata and a combined index of all documents. There was some discussion that NARA is not equipped to handle the flood of data that a full-text database would entail. Regardless, I believe it would be in the best interest of historical researchers to have NARA receive this database, even if they are unable to post it to the web right away.