A Million Syllabi

Today I’m releasing a database of over a million syllabi gathered by my Syllabus Finder tool from 2002 to 2009. My hope is that this unique corpus will be helpful for a broad range of researchers. I’m fairly sure this is the largest collection of syllabi ever gathered, probably by several orders of magnitude.

I created the Syllabus Finder in 2002 when Google released their first API to access their search engine. The initial API included the ability to grab cached HTML from millions of web pages, which I realized could then be scanned using high-relevancy keywords to identify pages that were most likely syllabi. In addition to my lousy PHP code that got it up and running, the brilliant Simon Kornblith wrote some additional code to make it work well. The result was a tool that was quite popular (1.3 million queries) until Google deprecated their original API in 2009 in favor of (what I consider to be) a less useful API. (With the original API you could basically clone google.com, which I’m sure was not popular at the Googleplex.)

If you are interested in the kind of research that can be done on these syllabi, please read my Journal of American History article “By the Book: Assessing the Place of Textbooks in U.S. Survey Courses.” For that article I used regular expressions to pull book titles out of a thousand American history surveys to see how textbooks and other works are used by instructors. Some hidden elements emerged. I’m excited to see what creative ideas other scholars and researchers come up with for this large database.

Some important clarifications and caveats:

1) I’m providing this archive in the same spirit (and under same regulations) that the Internet Archive provides web corpora (indeed, this corpus could probably be recreated from the Internet Archive’s Wayback Machine, albeit after a lot of work). To the best of my knowledge, and because of the way they were obtained, all of the documents this database contains were posted on the open web, and were cached (or not) respecting open-web standards such as robots.txt. It does not contain any syllabi that were posted in private places, such as gated Blackboard installations. Indeed, I suspect that most of these syllabi come from universities where it is expected that professors post syllabi in an open fashion (as is the case here at Mason), or from professors like me who believe that openness is good for scholarship and teaching. But as with the Internet Archive, if you are the creator of a syllabus and really can’t sleep unless it is purged from this research database, contact me.

2) This database is provided as is and without support. I get enough email and unfortunately cannot answer questions. If you are appreciative, you can make a tax-free donation to the Center for History and New Media, for which you will receive a hug from me. The database is intended for non-commercial use of the type seen in my JAH article.

3) The database is an SQL dump consisting of 1.4 million rows. The columns are syllabiID (the Syllabus Finder’s unique identifier), url (web address of the syllabus at the time it was found), title (of the web page the syllabus was on), date_added (when it was added to the Syllabus Finder database), and chnm_cache (the HTML of the page on the date it was added). The database is 804 MB uncompressed. The corpus is heavily U.S.-centric because web pages were matched to English-language words, and for a time the Syllabus Finder only took pages from .edu domains (thus leaving out, e.g., .ac.uk URLs).

4) Because the Syllabus Finder was completely automated, some percentage of the 1.4 million documents are not syllabi (my best guess is about 20%). Most often these incorrect matches are associated course documents such as assignments, which are interesting in their own right. But some are oddball documents that just looked like syllabi to the algorithms. I have made no attempt to weed them out.

If you understand all of this clearly, then here’s a million syllabi for you: CHNM Syllabus Finder Corpus, Version 1.0 (30 March 2011) (265 MB download, zipped SQL file)

UPDATE 1 (11pm 3/30/11): Matt Burton has helpfully provided a torrent for this file. If you can, please use it instead of the direct download.

UPDATE 2 (9pm 3/31/11): Unfortunately I should have checked the exported database before posting. Version 1.0 does indeed have the URLs, titles, and dates of about 1.45 million syllabi but it is missing a majority of the HTML caches of those syllabi. I am working to recreate the full database, which will be much larger and more useful.

30 thoughts on “A Million Syllabi

  1. Pingback: A Snarkmarket mini-collaboration: Snarksyllabi « Snarkmarket

  2. Pingback: Syllabi Data Mining « Jonathan Tregear

  3. Lev Manovich

    Great!

    I was thinking for a while how nice it will be to get lots of syllabi and then look at stats for books used, terms etc

    thank you,

    Lev

  4. Jason Priem

    You might want to check out:

    Kousha, K., & Thelwall, M. (2008). Assessing the impact of disciplinary research on teaching: An automatic analysis of online syllabuses. Journal of the American Society for Information Science and Technology, 59(13), 2060-2069. doi:10.1002/asi.20920

    They suggest inclusion in syllabi constitutes a neglected (and measurable) dimension of scholarly impact, at least in some fields. But they start from articles, and then do web searches to see if they’re in syllabi. With this dataset, you could approach it from the opposite direction, starting with syllabi.

  5. Pingback: New Million-Syllabi Repository Could Reveal Trends in Teaching - Wired Campus - The Chronicle of Higher Education

  6. Douglas Knox

    Dan, this is wonderful, creative, innovative almost ten years ago and no less so today. The interest in your release of this as a data set is heartening. I read and accepted the warranty in downloading, and have no regrets. Reluctantly, though, I have to question a detail relating to orders of magnitude.

    In the file available for download it looks more like just under 17,000 syllabi. There are indeed more than 1.4 million rows in the database, but for most of them the chnm_cache field is empty — anything with an ID number over 20,823, or anything harvested after 2002. I double-checked this with spot inspection of the unzipped SQL file. Is there more data that didn’t make it through export? If it were a million files, wouldn’t it be more like maybe 20-40 gigabytes? A syllabus is more likely to be 10K or more than it is to be just 1 kilobyte. Have I miscalculated somewhere?

    Even on an “as is” basis, more than 16,000 syllabi are plenty already to be interesting. At least 370 “blink” tags in the service of higher education in 2002 are evidence of a near-forgotten world now.

  7. Pingback: New Million-Syllabi Repository Could Reveal Trends in Teaching « The EdTech News Blog

  8. Pingback: Friday Quick Hits and Varia « The New Archaeology of the Mediterranean World

  9. Pingback: Weekly News Roundup | MindShift

  10. Pingback: Weekend Reading: Carnival Edition - ProfHacker - The Chronicle of Higher Education

  11. Pingback: Ed-Tech Weekly News Roundup | Hack Education

  12. Alex Garcia

    Dan,

    Awesome data, but as Douglas said above – having the full 1,000,000 records would be even better :). Are there any hopes that you will publish the full database?

    Btw, I see that there are few PDFs in the mix, and I could not open a single one of them… Did they get damaged during export?

    Alex

  13. Pingback: This is Something…

  14. Pingback: Euromachs Blog » Blog Archive » Web Readings Weekly Roundup

  15. Pingback: Recent Linkage 12 « Signifying Media

  16. Pingback: A million syllabi « My History 511 Blog Site

  17. Harpreet Singh

    Dan, is the link to the syllabus finder tool broken? Where can I download the full 1 million syllabi? Thank you.

  18. Dan Cohen Post author

    @Harpreet: For now, you can get the data set here. We are still working on getting the full text of the majority of the syllabi. Email me if you think you can help on that front.

  19. Pingback: Learning from other people – Academic Summer Camp (except in winter???) « Nick Falkner

  20. Pingback: Craft and Joseph-Nicholas named first DIL/IAH Faculty Fellows – Digital Innovation Lab

  21. Pingback: Craft and Joseph-Nicholas named first DIL/IAH Faculty Fellows Carolina Digital Humanities Initiative

  22. Pingback: Burnable Books | Medieval Studies in the Age of Big Data: A serial forum

  23. Pingback: new semester, new project | the ivi project: inquire, visualize, innovate

  24. Pingback: Free is better. Why I’m giving away my course. | A better train wreck.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>