A Million Syllabi

Today I’m releasing a database of over a million syllabi gathered by my Syllabus Finder tool from 2002 to 2009. My hope is that this unique corpus will be helpful for a broad range of researchers. I’m fairly sure this is the largest collection of syllabi ever gathered, probably by several orders of magnitude.

I created the Syllabus Finder in 2002 when Google released their first API to access their search engine. The initial API included the ability to grab cached HTML from millions of web pages, which I realized could then be scanned using high-relevancy keywords to identify pages that were most likely syllabi. In addition to my lousy PHP code that got it up and running, the brilliant Simon Kornblith wrote some additional code to make it work well. The result was a tool that was quite popular (1.3 million queries) until Google deprecated their original API in 2009 in favor of (what I consider to be) a less useful API. (With the original API you could basically clone google.com, which I’m sure was not popular at the Googleplex.)

If you are interested in the kind of research that can be done on these syllabi, please read my Journal of American History article “By the Book: Assessing the Place of Textbooks in U.S. Survey Courses.” For that article I used regular expressions to pull book titles out of a thousand American history surveys to see how textbooks and other works are used by instructors. Some hidden elements emerged. I’m excited to see what creative ideas other scholars and researchers come up with for this large database.

Some important clarifications and caveats:

1) I’m providing this archive in the same spirit (and under same regulations) that the Internet Archive provides web corpora (indeed, this corpus could probably be recreated from the Internet Archive’s Wayback Machine, albeit after a lot of work). To the best of my knowledge, and because of the way they were obtained, all of the documents this database contains were posted on the open web, and were cached (or not) respecting open-web standards such as robots.txt. It does not contain any syllabi that were posted in private places, such as gated Blackboard installations. Indeed, I suspect that most of these syllabi come from universities where it is expected that professors post syllabi in an open fashion (as is the case here at Mason), or from professors like me who believe that openness is good for scholarship and teaching. But as with the Internet Archive, if you are the creator of a syllabus and really can’t sleep unless it is purged from this research database, contact me.

2) This database is provided as is and without support. I get enough email and unfortunately cannot answer questions. If you are appreciative, you can make a tax-free donation to the Center for History and New Media, for which you will receive a hug from me. The database is intended for non-commercial use of the type seen in my JAH article.

3) The database is an SQL dump consisting of 1.4 million rows. The columns are syllabiID (the Syllabus Finder’s unique identifier), url (web address of the syllabus at the time it was found), title (of the web page the syllabus was on), date_added (when it was added to the Syllabus Finder database), and chnm_cache (the HTML of the page on the date it was added). The database is 804 MB uncompressed. The corpus is heavily U.S.-centric because web pages were matched to English-language words, and for a time the Syllabus Finder only took pages from .edu domains (thus leaving out, e.g., .ac.uk URLs).

4) Because the Syllabus Finder was completely automated, some percentage of the 1.4 million documents are not syllabi (my best guess is about 20%). Most often these incorrect matches are associated course documents such as assignments, which are interesting in their own right. But some are oddball documents that just looked like syllabi to the algorithms. I have made no attempt to weed them out.

If you understand all of this clearly, then here’s a million syllabi for you: CHNM Syllabus Finder Corpus, Version 1.0 (30 March 2011) (265 MB download, zipped SQL file)

UPDATE 1 (11pm 3/30/11): Matt Burton has helpfully provided a torrent for this file. If you can, please use it instead of the direct download.

UPDATE 2 (9pm 3/31/11): Unfortunately I should have checked the exported database before posting. Version 1.0 does indeed have the URLs, titles, and dates of about 1.45 million syllabi but it is missing a majority of the HTML caches of those syllabi. I am working to recreate the full database, which will be much larger and more useful.

March 30, 2011 33 Comments

In Archives, Pedagogy, Text Mining

Comments

A Snarkmarket mini-collaboration: Snarksyllabi « Snarkmarket says:

March 30, 2011 at 10:27 pm

[…] for History and New Media at George Mason University released a really interesting dataset today: a million syllabi culled from the web, from […]

mcburton says:

March 30, 2011 at 11:38 pm

Hey all,
I just got a torrent tracker up and running and am seeding from home. Spread the word & help seed if you can!
http://tweedpiratebay.appspot.com/static/chnm_syllabus_finder_corpus.torrent

Let me know if it breaks
@mcburton

Syllabi Data Mining « Jonathan Tregear says:

March 31, 2011 at 1:32 am

[…] A Million Syllabi http://www.dancohen.org/2011/03/30/a-million-syllabi/ […]

Brian Croxall says:

March 31, 2011 at 9:37 am

What a great project, Dan. Thanks for making this open for everyone to play with!

Lev Manovich says:

March 31, 2011 at 1:00 pm

Great!

I was thinking for a while how nice it will be to get lots of syllabi and then look at stats for books used, terms etc

thank you,

Lev

Jason Priem says:

March 31, 2011 at 4:22 pm

You might want to check out:

Kousha, K., & Thelwall, M. (2008). Assessing the impact of disciplinary research on teaching: An automatic analysis of online syllabuses. Journal of the American Society for Information Science and Technology, 59(13), 2060-2069. doi:10.1002/asi.20920

They suggest inclusion in syllabi constitutes a neglected (and measurable) dimension of scholarly impact, at least in some fields. But they start from articles, and then do web searches to see if they’re in syllabi. With this dataset, you could approach it from the opposite direction, starting with syllabi.

New Million-Syllabi Repository Could Reveal Trends in Teaching - Wired Campus - The Chronicle of Higher Education says:

March 31, 2011 at 5:25 pm

[…] at George Mason University, hopes the repository of one million syllabi he posted today on his Web site will help fuel academic […]

Douglas Knox says:

March 31, 2011 at 8:21 pm

Dan, this is wonderful, creative, innovative almost ten years ago and no less so today. The interest in your release of this as a data set is heartening. I read and accepted the warranty in downloading, and have no regrets. Reluctantly, though, I have to question a detail relating to orders of magnitude.

In the file available for download it looks more like just under 17,000 syllabi. There are indeed more than 1.4 million rows in the database, but for most of them the chnm_cache field is empty — anything with an ID number over 20,823, or anything harvested after 2002. I double-checked this with spot inspection of the unzipped SQL file. Is there more data that didn’t make it through export? If it were a million files, wouldn’t it be more like maybe 20-40 gigabytes? A syllabus is more likely to be 10K or more than it is to be just 1 kilobyte. Have I miscalculated somewhere?

Even on an “as is” basis, more than 16,000 syllabi are plenty already to be interesting. At least 370 “blink” tags in the service of higher education in 2002 are evidence of a near-forgotten world now.

New Million-Syllabi Repository Could Reveal Trends in Teaching « The EdTech News Blog says:

April 1, 2011 at 8:19 am

[…] at George Mason University, hopes the repository of one million syllabi he posted today on his Web site will help fuel academic […]

Friday Quick Hits and Varia « The New Archaeology of the Mediterranean World says:

April 1, 2011 at 8:46 am

[…] fun stuff on teaching this week. First, Dan Cohen released his Million Syllabi into the world. He released it as a .sql file (for obvious and good reasons), but it would be more useful to me […]

Weekly News Roundup | MindShift says:

April 1, 2011 at 12:11 pm

[…] Dan Cohen has just released a database of over one million course syllabi, gathered from the Internet between 2002 and 2009. The data is […]

Weekend Reading: Carnival Edition - ProfHacker - The Chronicle of Higher Education says:

April 1, 2011 at 3:32 pm

[…] probably saw that Dan Cohen has released a million syllabi for text analysis and data-mining; at Snarkmarket, they’re having a […]

Ed-Tech Weekly News Roundup | Hack Education says:

April 2, 2011 at 9:19 am

[…] Dan Cohen has just released a database of over one million course syllabi, gathered from the Internet between 2002 and 2009. The data is […]

Alex Garcia says:

April 3, 2011 at 3:13 pm

Dan,

Awesome data, but as Douglas said above – having the full 1,000,000 records would be even better :). Are there any hopes that you will publish the full database?

Btw, I see that there are few PDFs in the mix, and I could not open a single one of them… Did they get damaged during export?

Alex

Brett Boessen says:

April 3, 2011 at 4:46 pm

I can’t get the torrent file to open — Vuze gives me an error.

This is Something… says:

April 4, 2011 at 6:50 pm

[…] for History and New Media at George Mason University released a really interesting dataset today: a million syllabi culled from the web, from […]

Euromachs Blog » Blog Archive » Web Readings Weekly Roundup says:

April 5, 2011 at 3:42 pm

[…] A Million Syllabi […]

Recent Linkage 12 « Signifying Media says:

April 8, 2011 at 4:07 am

[…] Cohen releases a database of over a million academic syllabi automatically collected […]

Paul Dixon says:

May 9, 2011 at 6:23 pm

Is there any update on this, or is the data too hard to recover cleanly?

Dan Cohen says:

May 10, 2011 at 9:59 am

@Paul: still working on it. Hoping to make some progress soon.

Martha Saavedra says:

May 26, 2011 at 5:40 pm

For a curriculum project, e worked on something similar specifically for African Studies in 2000. We didn’t set up a query, but found syllabi and entered URLs into a searchable database. Many of the links are dead, and of course, there was no resources to update this. Here is the link:
http://africa.berkeley.edu/academics/SyllabiSelector.php
I look forward to browsing your database.

A million syllabi « My History 511 Blog Site says:

February 10, 2012 at 12:00 pm

[…] came across this link for over 1.4 million syllabi, as compiled by Dan Cohen, over at CHNM. Granted, he admits that as […]

Harpreet Singh says:

February 21, 2012 at 7:45 pm

Dan, is the link to the syllabus finder tool broken? Where can I download the full 1 million syllabi? Thank you.

Dan Cohen says:

February 22, 2012 at 10:12 am

@Harpreet: For now, you can get the data set here. We are still working on getting the full text of the majority of the syllabi. Email me if you think you can help on that front.

Learning from other people – Academic Summer Camp (except in winter???) « Nick Falkner says:

June 2, 2012 at 2:43 pm

[…] on the “Million Syllabi Project Hack-a-thon“, where “we explore new ways of using the million syllabi dataset gathered by Dan Cohen’s Syllabus Finder Tool” (from the web site). 10 years worth of […]

Craft and Joseph-Nicholas named first DIL/IAH Faculty Fellows – Digital Innovation Lab says:

November 20, 2012 at 12:17 pm

[…] enables student and instructor inputs and a data mining and visualization tool that draws on the Syllabus Finder database, the Internet Archive, and the Common Crawl tool and corpus to produce within-system and broad […]

Craft and Joseph-Nicholas named first DIL/IAH Faculty Fellows Carolina Digital Humanities Initiative says:

November 20, 2012 at 1:24 pm

Burnable Books | Medieval Studies in the Age of Big Data: A serial forum says:

December 13, 2012 at 2:44 pm

[…] exponential increase in information and data it has enabled; Dan Cohen’s recent release of a million syllabi as a single searchable database is a case in point. Nowhere are the quantitative dimensions of this […]

new semester, new project | the ivi project: inquire, visualize, innovate says:

August 23, 2013 at 11:20 am

[…] from various institutions, scraping the Web (with inspiration from Dan Cohen’s earlier Syllabus Finder project), and begging UNC’s Sakai people for data dumps. Then, while presenting on a Digital […]

Free is better. Why I’m giving away my course. | A better train wreck. says:

June 2, 2014 at 11:54 pm

[…] course and give it away under some type of create commons licensing. There have been a variety of efforts to collect and publish syllabi, which might help researchers and intrepid faculty willing to mine […]

Embracing ephemerality in the digital humanities | history, CLASS says:

February 2, 2016 at 5:02 pm

[…] not. Sometimes some digital tool or platform that seems like a wonderful thing fizzles, like Dan Cohen’s marvelous Syllabus Finder, R.I.P., but at least eventually something more robust comes along. Even commercial tools get […]

More Than a Million Syllabuses at Your Fingertips - Artificial Intelligence Online says:

August 4, 2016 at 2:51 pm

[…] project to attempt to gather syllabuses together. The syllabus data came primarily from a project in the early 2000s by Dan Cohen while at George Mason University. He scraped the web for links to […]

Sharing Syllabi: What’s Gained, What Challenges Remain | After Class says:

October 12, 2018 at 11:29 am

[…] the University of North Carolina-Chapel Hill, and Swarthmore College, built off the 2002-2009 “Million Syllabi” database created by Dan Cohen, the Executive Director of the Digital Public Library of […]

A Million Syllabi

Comments

Leave a Reply Cancel reply