<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Dan Cohen&#039;s Digital Humanities Blog &#187; Text Mining</title>
	<atom:link href="http://www.dancohen.org/category/text-mining/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dancohen.org</link>
	<description>Covering the intersection of digital technology and research, teaching, and learning in the humanities, including search, data mining, website development and design, and programming.</description>
	<lastBuildDate>Fri, 10 Feb 2012 03:20:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>A Million Syllabi</title>
		<link>http://www.dancohen.org/2011/03/30/a-million-syllabi/</link>
		<comments>http://www.dancohen.org/2011/03/30/a-million-syllabi/#comments</comments>
		<pubDate>Thu, 31 Mar 2011 01:37:03 +0000</pubDate>
		<dc:creator>Dan Cohen</dc:creator>
				<category><![CDATA[Archives]]></category>
		<category><![CDATA[Pedagogy]]></category>
		<category><![CDATA[Text Mining]]></category>

		<guid isPermaLink="false">http://www.dancohen.org/?p=1345</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=A+Million+Syllabi&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=Archives&amp;rft.subject=Pedagogy&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2011-03-30&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2011/03/30/a-million-syllabi/&amp;rft.language=English"></span>
Today I&#8217;m releasing a database of over a million syllabi gathered by my Syllabus Finder tool from 2002 to 2009. My hope is that this unique corpus will be helpful for a broad range of researchers. I&#8217;m fairly sure this is the largest collection of syllabi ever gathered, probably by several orders of magnitude. I [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=A+Million+Syllabi&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=Archives&amp;rft.subject=Pedagogy&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2011-03-30&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2011/03/30/a-million-syllabi/&amp;rft.language=English"></span>
<p>Today I&#8217;m releasing a database of over a million syllabi gathered by my <a href="http://chnm.gmu.edu/syllabus-finder/syllabi/">Syllabus Finder</a> tool from 2002 to 2009. My hope is that this unique corpus will be helpful for a broad range of researchers. I&#8217;m fairly sure this is the largest collection of syllabi ever gathered, probably by several orders of magnitude.</p>
<p>I created the Syllabus Finder in 2002 when Google released their first API to access their search engine. The initial API included the ability to grab cached HTML from millions of web pages, which I realized could then be scanned using high-relevancy keywords to identify pages that were most likely syllabi. In addition to my lousy PHP code that got it up and running, the brilliant <a href="http://simonster.com/">Simon Kornblith</a> wrote some additional code to make it work well. The result was a tool that was quite popular (1.3 million queries) until Google deprecated their original API in 2009 in favor of (what I consider to be) a less useful API. (With the original API you could basically clone google.com, which I&#8217;m sure was not popular at the Googleplex.)</p>
<p>If you are interested in the kind of research that can be done on these syllabi, please read my <em>Journal of American History </em>article &#8220;<a href="http://www.dancohen.org/publications/#by_the_book">By the Book: Assessing the Place of Textbooks in U.S. Survey Courses</a>.&#8221; For that article I used regular expressions to pull book titles out of a thousand American history surveys to see how textbooks and other works are used by instructors. Some hidden elements emerged. I&#8217;m excited to see what creative ideas other scholars and researchers come up with for this large database.</p>
<p>Some important clarifications and caveats:</p>
<p>1) I&#8217;m providing this archive in the same spirit (and under same regulations) that the <a href="http://archive.org">Internet Archive</a> provides web corpora (indeed, this corpus could probably be recreated from the Internet Archive&#8217;s <a href="http://www.archive.org/web/web.php">Wayback Machine</a>, albeit after a lot of work). To the best of my knowledge, and because of the way they were obtained, all of the documents this database contains were posted on the open web, and were cached (or not) respecting open-web standards such as <a href="http://en.wikipedia.org/wiki/Robots_exclusion_standard">robots.txt</a>. It does not contain any syllabi that were posted in private places, such as gated <a href="http://blackboard.com/">Blackboard</a> installations. Indeed, I suspect that most of these syllabi come from universities where it is expected that professors post syllabi in an open fashion (as is the case here at <a href="http://www.gmu.edu">Mason</a>), or from professors like me who believe that openness is good for scholarship and teaching. But as with the Internet Archive, if you are the creator of a syllabus and really can&#8217;t sleep unless it is purged from this research database, <a href="http://www.dancohen.org/bio/">contact me</a>.</p>
<p>2) This database is provided <strong>as is</strong> and without support. I get enough email and unfortunately cannot answer questions. If you are appreciative, you can make a <a href="http://chnm.gmu.edu/donate/">tax-free donation</a> to the <a href="http://chnm.gmu.edu">Center for History and New Media</a>, for which you will receive a hug from me. The database is intended for non-commercial use of the type seen in my <em>JAH</em> article.</p>
<p>3) The database is an SQL dump consisting of 1.4 million rows. The columns are syllabiID (the Syllabus Finder&#8217;s unique identifier), url (web address of the syllabus at the time it was found), title (of the web page the syllabus was on), date_added (when it was added to the Syllabus Finder database), and chnm_cache (the HTML of the page on the date it was added). The database is 804 MB uncompressed. The corpus is heavily U.S.-centric because web pages were matched to English-language words, and for a time the Syllabus Finder only took pages from .edu domains (thus leaving out, e.g., .ac.uk URLs).</p>
<p>4) Because the Syllabus Finder was completely automated, some percentage of the 1.4 million documents are not syllabi (my best guess is about 20%). Most often these incorrect matches are associated course documents such as assignments, which are interesting in their own right. But some are oddball documents that just looked like syllabi to the algorithms. I have made no attempt to weed them out.</p>
<p><em>If you understand all of this clearly</em>, then here&#8217;s a million syllabi for you: <a href="http://www.dancohen.org/files/chnm_syllabus_finder_corpus_2011-03-30.sql.zip">CHNM Syllabus Finder Corpus, Version 1.0</a> (30 March 2011) (265 MB download, zipped SQL file)</p>
<p><strong>UPDATE 1 (11pm 3/30/11):</strong> Matt Burton <a href="http://www.dancohen.org/2011/03/30/a-million-syllabi/#comment-6561">has helpfully provided</a> a torrent for this file. If you can, please <a href="http://tweedpiratebay.appspot.com/static/chnm_syllabus_finder_corpus.torrent">use it</a> instead of the direct download.</p>
<p><strong>UPDATE 2 (9pm 3/31/11):</strong> Unfortunately I should have checked the exported database before posting. Version 1.0 does indeed have the URLs, titles, and dates of about 1.45 million syllabi but it is missing a majority of the HTML caches of those syllabi. I am working to recreate the full database, which will be much larger and more useful.<strong><br />
</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dancohen.org/2011/03/30/a-million-syllabi/feed/</wfw:commentRss>
		<slash:comments>22</slash:comments>
		</item>
		<item>
		<title>Initial Thoughts on the Google Books Ngram Viewer and Datasets</title>
		<link>http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/</link>
		<comments>http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/#comments</comments>
		<pubDate>Mon, 20 Dec 2010 02:11:16 +0000</pubDate>
		<dc:creator>Dan Cohen</dc:creator>
				<category><![CDATA[Books]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Text Mining]]></category>

		<guid isPermaLink="false">http://www.dancohen.org/?p=1283</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Initial+Thoughts+on+the+Google+Books+Ngram+Viewer+and+Datasets&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=Books&amp;rft.subject=Google&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2010-12-19&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/&amp;rft.language=English"></span>
First and foremost, you have to be the most jaded or cynical scholar not to be excited by the release of the Google Books Ngram Viewer and (perhaps even more exciting for the geeks among us) the associated datasets. In the same way that the main Google Books site has introduced many scholars to the [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Initial+Thoughts+on+the+Google+Books+Ngram+Viewer+and+Datasets&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=Books&amp;rft.subject=Google&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2010-12-19&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/&amp;rft.language=English"></span>
<p>First and foremost, you have to be the most jaded or cynical scholar  not to be excited by the release of the <a href="http://ngrams.googlelabs.com/">Google Books Ngram Viewer</a> and  (perhaps even more exciting for the geeks among us) the <a href="http://ngrams.googlelabs.com/datasets">associated  datasets</a>. In the same way that the main Google Books site has introduced  many scholars to the potential of digital collections on the web,  Google Ngrams will introduce many scholars to the possibilities of digital research. There are precious few easy-to-use tools that allow one to  explore text-mining patterns and anomalies; perhaps only <a href="http://wordle.net">Wordle</a> has the  same dead-simple, addictive quality as Google Ngrams. <strong>Digital humanities needs gateway  drugs. Kudos to the pushers on the Google Books team.</strong></p>
<p>Second, on the concurrent launch of &#8220;<a href="http://culturomics.org/">Culturomics</a>&#8220;:  Naming new fields is always contentious, as is declaring precedence.  Yes, it was slightly annoying to have the Harvard/MIT scholars behind this coinage  and <a href="http://www.sciencemag.org/content/early/2010/12/15/science.1199644">the article</a> that launched it, Michel et al., stake out supposedly  new ground without making sufficient reference to <a href="http://digitalhumanities.org/companion/">prior</a> <a href="llc.oxfordjournals.org/">work</a> and even (ahem) some <a href="http://victorianbooks.org/">vaguely familiar, if simpler, graphs</a> and <a href="http://www.dancohen.org/2010/10/04/searching-for-the-victorians/">intellectual justifications</a>. Yes, &#8220;Culturomics&#8221; <a href="https://twitter.com/#%21/wragge/status/15752481099227136">sounds like an 80s new wave band</a>. If we&#8217;re going to coin neologisms, let&#8217;s at least go with Sean Gillies&#8217; satirical alternative: <strong><em><a href="http://twitter.com/#%21/sgillies/status/15617171308675072">Freakumanities</a></em></strong>.  No, there were <a href="http://sappingattention.blogspot.com/2010/12/missing-humanists.html">no humanities scholars in sight</a> in the Culturomics article. But I&#8217;m also sure that longtime &#8220;humanities computing&#8221; scholars consider  advocates of &#8220;digital humanities&#8221; like me Johnnies-come-lately.  Luckily, <strong><a href="http://www.foundhistory.org/2010/05/26/why-digital-humanities-is-%E2%80%9Cnice%E2%80%9D/">digital humanities is nice</a></strong>, and so let us all welcome Michel et al. to the fold, applaud their work, and do what we can to learn from their clever formulations. (But c&#8217;mon, Cantabs, at least return the favor by <a href="https://twitter.com/#!/culturomics">following some people on Twitter</a>.)</p>
<p>Third, on the quality and utility of the data: To be sure, there are issues. Some big ones. Mark Davies <a href="http://corpus.byu.edu/coha/compare-culturomics.asp">makes some excellent points</a> about why his <a href="http://corpus.byu.edu/coha/">Corpus of Historical American English</a> (COHA) might be a better choice for researchers, including more nuanced search options and better variety and normalization of the data. Natalie Binder asks <a href="http://thebinderblog.com/2010/12/17/googles-word-engine-isnt-ready-for-prime-time/">some tough questions</a> about Google&#8217;s OCR. On Twitter many of us were finding serious problems with the long &#8220;s&#8221; before 1800 (Danny Sullivan got <a href="http://searchengineland.com/when-ocr-goes-bad-googles-ngram-viewer-the-f-word-59181">straight to the naughty point</a> with his discourse on the history of the f-bomb). But the  Freakumanities, er, Culturomics guys themselves <a href="http://www.culturomics.org/Resources/A-users-guide-to-culturomics">talk about this problem in their caveats</a>, <a href="http://ngrams.googlelabs.com/info">as does Google</a>.</p>
<p>Moreover, the data will improve. The Google n-grams are already over a year old, and the plan is to release new data as soon as it can be compiled. In addition, unlike text-mining tools like COHA, Google Ngrams is multilingual. For the first time, historians working on  Chinese, French, German, and Spanish sources can do what many of us have been  doing for some time. <strong>Professors love to look a gift horse in the mouth. But let&#8217;s also ride the horse and see where it takes us.</strong></p>
<p>So where does it take us? My initial tests on the viewer and examination of the datasets—which, unlike the public site, allow you to count words not only by overall instances but, critically, by number of <em>pages</em> those instances appear on and number of <em>works</em> they appear in—hint at much work to be done:</p>
<p>1) <strong>The best possibilities for deeper humanities research are likely in the longer n-grams, not in the unigrams</strong>. While everyone obsesses about individuals words (<a href="http://www.dancohen.org/2010/10/04/searching-for-the-victorians/">guilty here too of unigramism</a>) or about proper names (which are generally bigrams), more elaborate and interesting interpretations are likelier in the 4- and 5-grams since they begin to provide some context. For instance, if you want to look at the history of marriage, charting the word itself is far less interesting than seeing if it co-occurs with words like &#8220;loving&#8221; or &#8220;arranged.&#8221; (This is something we learned in working on our NEH-funded grant on text mining for historians.)</p>
<p>2) We should remember that some of the best uses of Google&#8217;s n-grams will come from <strong>using this data along with other data</strong>. My gripe with the &#8220;Culturomics&#8221; name was that it implied (from &#8220;genomics&#8221;) that some single massive dataset, like the human genome, will be the be-all and end-all for cultural research. But much of the best digital humanities work has come from mashing up data from different domains. Creative scholars will find ways to use the Google n-grams in concert with other datasets from cultural heritage collections.</p>
<p>3) Despite my occasional griping about the Culturomists, they did some rather clever things with statistics in the latter part of their article to tease out cultural trends. <strong>We historians and humanists should be looking carefully at the more complex formulations of  Michel et al.</strong>, when they move beyond  linguistics and unigram patterns to investigate in shrewd ways topics  like how fleeting fame is and whether the suppression of authors by totalitarian regimes  works. Good stuff.</p>
<p>4) For me, <strong>the biggest problem with the viewer and the data is that you cannot seamlessly move from distant reading to close reading</strong>, from the bird&#8217;s eye view to the actual texts. Historical trends often need to be investigated in detail (another lesson from our NEH grant), and it&#8217;s not entirely clear if you move from Ngram Viewer to the main Google Books interface that you&#8217;ll get the book scans the data represents. That&#8217;s why I have my students use Mark Davies&#8217; <a href="http://corpus.byu.edu/time/">Time Magazine Corpus</a> when we begin to study historical text mining—they can easily look at specific magazine articles when they need to.</p>
<p>How do you plan to use the Google Books Ngram Viewer and its associated data? I would love to hear your ideas for smart work in history and the humanities in the comments, and will update this post with my own further thoughts as they occur to me.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/feed/</wfw:commentRss>
		<slash:comments>25</slash:comments>
		</item>
		<item>
		<title>New York Times Covers Victorian Books Project</title>
		<link>http://www.dancohen.org/2010/12/03/new-york-times-covers-victorian-books-project/</link>
		<comments>http://www.dancohen.org/2010/12/03/new-york-times-covers-victorian-books-project/#comments</comments>
		<pubDate>Fri, 03 Dec 2010 21:05:20 +0000</pubDate>
		<dc:creator>Dan Cohen</dc:creator>
				<category><![CDATA[Books]]></category>
		<category><![CDATA[Text Mining]]></category>

		<guid isPermaLink="false">http://www.dancohen.org/?p=1272</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=New+York+Times+Covers+Victorian+Books+Project&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=Books&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2010-12-03&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2010/12/03/new-york-times-covers-victorian-books-project/&amp;rft.language=English"></span>
Patricia Cohen of the New York Times has been working on an excellent series on digital humanities, and her second article focuses on our text mining work on Victorian books, which was directly enabled by a grant from Google and more broadly enabled by a previous grant from the National Endowment for the Humanities to [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=New+York+Times+Covers+Victorian+Books+Project&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=Books&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2010-12-03&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2010/12/03/new-york-times-covers-victorian-books-project/&amp;rft.language=English"></span>
<p><a href="http://www.nytimes.com/imagepages/2010/12/04/books/04victorian-graphic.html"><img class="alignnone size-full wp-image-1273" title="nytimes_victorian_books_universal_graphic" src="http://www.dancohen.org/wp/wp-content/uploads/2010/12/nytimes_victorian_books_universal_graphic.gif" border="0" alt="" width="500" height="163" /></a></p>
<p><a href="http://topics.nytimes.com/topics/reference/timestopics/people/c/patricia_cohen/index.html">Patricia Cohen</a> of the <em><a href="http://nytimes.com">New York Times</a></em> has been working on an excellent series on digital humanities, and her second article <a href="http://www.nytimes.com/2010/12/04/books/04victorian.html?pagewanted=all">focuses on our text mining work</a> on <a href="http://victorianbooks.org">Victorian books</a>, which was directly enabled by a grant from <a href="http://google.com">Google</a> and more broadly enabled by a previous grant from the <a href="http://neh.gov">National Endowment for the Humanities</a> to explore text mining in history. I&#8217;m glad Cohen (no relation) captured the nuances and caveats as well as the potential of digital methods. I also liked how the graphics department did <a href="http://www.nytimes.com/imagepages/2010/12/04/books/04victorian-graphic.html">a great job converting and explaining some of our graphs</a>.</p>
<p>I previously posted <a href="http://www.dancohen.org/2010/10/04/searching-for-the-victorians/">a rough transcript of my talk</a> on Victorian history and literature that Cohen mentions in the piece. She also covered my work earlier this year in <a href="http://www.nytimes.com/2010/08/24/arts/24peer.html?pagewanted=all">an article on peer review</a> that was much debated in academia.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dancohen.org/2010/12/03/new-york-times-covers-victorian-books-project/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Searching for the Victorians</title>
		<link>http://www.dancohen.org/2010/10/04/searching-for-the-victorians/</link>
		<comments>http://www.dancohen.org/2010/10/04/searching-for-the-victorians/#comments</comments>
		<pubDate>Tue, 05 Oct 2010 00:45:09 +0000</pubDate>
		<dc:creator>Dan Cohen</dc:creator>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[Humanities]]></category>
		<category><![CDATA[Text Mining]]></category>

		<guid isPermaLink="false">http://www.dancohen.org/?p=1057</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Searching+for+the+Victorians&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=Google&amp;rft.subject=Humanities&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2010-10-04&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2010/10/04/searching-for-the-victorians/&amp;rft.language=English"></span>
[A rough transcript of my keynote at the Victorians Institute Conference, held at the University of Virginia on October 1-3, 2010. The conference had the theme "By the Numbers." Attended by "analog" Victorianists as well as some budding digital humanists, I was delighted by the incredibly energetic reaction to this talk—many terrific questions and ideas [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Searching+for+the+Victorians&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=Google&amp;rft.subject=Humanities&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2010-10-04&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2010/10/04/searching-for-the-victorians/&amp;rft.language=English"></span>
<p>[<em>A rough transcript of my keynote at the Victorians Institute Conference, held at the University of Virginia on October 1-3, 2010. The conference had the theme "By the Numbers." Attended by "analog" Victorianists as well as some budding digital humanists, I was delighted by the incredibly energetic reaction to this talk—many terrific questions and ideas for doing scholarly text mining from those who may have never considered it before. The talk incorporates work on historical text mining under an NEH grant, as well as the first results of a grant that Fred Gibbs and I were awarded from Google to mine their vast collection of books.</em>]</p>
<p>Why did the Victorians look to mathematics to achieve certainty, and how we might understand the Victorians better with the mathematical methods they bequeathed to us? I want to relate the Victorian debate about the foundations of our knowledge to a debate that we are likely to have in the coming decade, a debate about how we know the past and how we look at the written record that I suspect will be of interest to literary scholars and historians alike. It is a philosophical debate about idealism, empiricism, induction, and deduction, but also a practical discussion about the methodologies we have used for generations in the academy.</p>
<p><strong>Victorians and the Search for Truth</strong></p>
<p>Let me start, however, with the Heavens. This is Neptune. It was seen for the first time through a telescope in 1846.</p>
<p><img class="aligncenter size-full wp-image-1058" title="victorians_keynote_uva_2oct2010.002" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/victorians_keynote_uva_2oct2010.002.jpg" alt="" width="500" /></p>
<p>At the time, the discovery was hailed as a feat of pure mathematics, since two mathematicians, one from France, Urbain Le Verrier, and one from England, John Couch Adams, had independently calculated Neptune’s position using mathematical formulas. There were dozens of poems written about the discovery, hailing the way these mathematicians had, like &#8220;magicians&#8221; or &#8220;prophets,&#8221; divined the Truth (often written with a capital T) about Neptune.</p>
<p>But in the less-triumphal aftermath of the discovery, it could also be seen as a case of the impact of cold calculation and the power of a good data set. Although pure mathematics, to be sure, were involved—the equations of geometry and gravity—the necessary inputs were countless observations of other heavenly bodies, especially precise observations of perturbations in the orbit of Uranus caused by Neptune. It was intellectual work, but <em>intellectual work informed by a significant amount of data</em>.</p>
<p>The Victorian era saw tremendous advances in both pure and applied mathematics. Both were involved in the discovery of Neptune: the pure mathematics of the ellipse and of gravitational pull; the computational modes of plugging observed coordinates into algebraic and geometrical formulas.</p>
<p>Although often grouped together under the banner of &#8220;mathematics,&#8221; the techniques and attitudes of pure and applied forms diverged significantly in the nineteenth century. By the end of the century, pure mathematics and its associated realm of symbolic logic had become so abstract and removed from what the general public saw as math—that is, numbers and geometric shapes—that Bertrand Russell could famously conclude in 1901 (in a Seinfeldian moment) that mathematics was a science about nothing. It was a set of signs and operations completely divorced from the real world.</p>
<p>Meanwhile, the early calculating machines that would lead to modern computers were proliferating, prodded by the rise of modern bureaucracy and capitalism. Modern statistics arrived, with its very unpure notions of good-enough averages and confidence levels.</p>
<p>The Victorians thus experienced the very modern tension between pure and applied knowledge, art and craft. They were incredibly self-reflective about the foundations of their knowledge. Victorian mathematicians were often <em>philosophers</em> of mathematics as much as practitioners of it. They repeatedly asked themselves: How could they know truth through mathematics? Similarly, as Meegan Kennedy has shown, in putting patient data into tabular form for the first time—thus enabling the discernment of patterns in treatment—Victorian doctors began wrestling with whether their discipline should be data-driven or should remain subject to the &#8220;genius&#8221; of the individual doctor.</p>
<p>Two mathematicians I studied for <a href="http://www.dancohen.org/publications/#equations_from_god"><em>Equations from God</em></a> used their work in mathematical logic to assail the human propensity to come to conclusions using faulty reasoning or a small number of examples, or by an appeal to interpretive genius. George Boole (1815-1864), the humble father of the logic that is at the heart of our computers, was the first professor of mathematics at Queen&#8217;s College, Cork. He had the misfortune of arriving in Cork (from Lincoln, England) on the eve of the famine and increasing sectarian conflict and nationalism.</p>
<p style="text-align: center;"><img class="size-medium wp-image-1177   aligncenter" title="George_Boole" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/George_Boole-245x300.jpg" alt="" width="245" height="300" /></p>
<p>Boole spend the rest of his life trying to find a way to rise above the conflict he saw all around him. He saw his revolutionary mathematical logic as a way to dispassionately analyze arguments and evidence. His seminal work, <em>The Laws of Thought</em>, is as much a work of literary criticism as it is of mathematics. In it, Boole deconstructs texts to find the truth using symbolical modes.</p>
<p>The stained-glass window in Lincoln Cathedral honoring Boole includes the biblical story of Samuel, which the mathematician enjoyed. It&#8217;s a telling expression of Boole&#8217;s worry about how we come to know Truth. Samuel hears the voice of God three times, but each time cannot definitively understand what he is hearing. In his humility, he wishes not to jump to divine conclusions.</p>
<p><img class="aligncenter size-full wp-image-1065" title="victorians_keynote_uva_2oct2010.005" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/victorians_keynote_uva_2oct2010.005.jpg" alt="" width="500" /></p>
<p>Not jumping to conclusions based on limited experience was also a strong theme in the work of Augustus De Morgan (1806-1871). De Morgan, co-discoverer of symbolic logic and the first professor of mathematics at University College London, had a similar outlook to Boole&#8217;s, but a much more abrasive personality. He rather enjoyed proving people wrong, and also loved to talk about how quickly human beings leap to opinions.</p>
<p style="text-align: center;"><img class="size-full wp-image-1174  aligncenter" title="AugustusDeMorgan" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/AugustusDeMorgan.png" alt="" width="198" height="240" /></p>
<p>De Morgan would give this hypothetical: “Put it to the first comer, what he thinks on the question whether there be volcanoes on the unseen side of the moon larger than those on our side. The odds are, that though he has never thought of the question, he has a pretty stiff opinion in three seconds.&#8221; Human nature, De Morgan thought, was too inclined to make mountains out of molehills, conclusions from scant or no evidence. He put everyone on notice that their deeply held opinions or interpretations were subject to verification by the power of logic and mathematics.</p>
<p>As Walter Houghton highlighted in his reading of the Victorian canon, <em>The Victorian Frame of Mind, 1830-1870</em>, the Victorians were truth-seekers <em>and</em> skeptics. They asked how they could know better, and challenged their own assumptions.</p>
<p><strong>Foundations of Our Own Knowledge</strong></p>
<p>This attitude seems healthy to me as we present-day scholars add digital methods of research to our purely analog ones. Many humanities scholars have been satisfied, perhaps unconsciously, with the use of a limited number of cases or examples to prove a thesis. Shouldn&#8217;t we ask, like the Victorians, what can we do to be most certain about a theory or interpretation? If we use intuition based on close reading, for instance, is that enough?</p>
<p>Should we be worrying that our scholarship might be anecdotally correct but comprehensively wrong? Is 1 or 10 or 100 or 1000 books an adequate sample to know the Victorians? What we might do with <em>all</em> of Victorian literature—not a sample, or a few canonical texts, as in Houghton’s work, but <em>all</em> <em>of it</em>.</p>
<p>These questions were foremost in my mind as Fred Gibbs and I began work on our Google digital humanities grant that is attempting to apply text mining to our understanding of the Victorian age. If Boole and De Morgan were here today, how acceptable would our normal modes of literary and historical interpretation be to them?</p>
<p>As Victorianists, we are rapidly approaching the time when we have access—including, perhaps, computational access—to the full texts not of thousands of Victorian books, or hundreds of thousands, but virtually all books published in the Victorian age. Projects like <a href="http://books.google.com">Google Books</a>, the Internet Archive&#8217;s <a href="http://openlibrary.org">OpenLibrary</a>, and <a href="http://hathitrust.org">HathiTrust</a> will become increasingly important to our work.</p>
<p><img class="aligncenter size-full wp-image-1072" title="victorians_keynote_uva_2oct2010.012" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/victorians_keynote_uva_2oct2010.012.jpg" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1073" title="victorians_keynote_uva_2oct2010.013" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/victorians_keynote_uva_2oct2010.013.jpg" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1074" title="victorians_keynote_uva_2oct2010.014" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/victorians_keynote_uva_2oct2010.014.jpg" alt="" width="500" /></p>
<p>If we were to look at all of these books using the computational methods that originated in the Victorian age, what would they tell us? And would that analysis be somehow more “true” than looking at a small subset of literature, the books we all have read that have often been used as representative of the Victorian whole, or, if not entirely representative, at least indicative of some deeper Truth?</p>
<p>Fred and I have received back from Google a first batch of data. This first run is limited just to words in the titles of books, but even so is rather suggestive of the work that can now be done. This data covers the 1,681,161 books that were published in English in the UK in the long nineteenth century, 1789-1914. We have  normalized the data in many ways, and for the most part the charts I&#8217;m about to show you graph the data from zero to one percent of all books published in a year so that they are on the same scale and can be visually compared.</p>
<p>Multiple printings of a book in a single year have been collapsed into one &#8220;expression.&#8221; (For the library nerds in the audience, the data has been partially FRBRized. One could argue that we should have accepted the accentuation of popular titles that went through many printings in a single year, but editions and printings in subsequent years do count as separate expressions. We did not go up to the level of &#8220;work&#8221; in the FRBR scale, which would have collapsed all expressions of a book into one data point.)</p>
<p>We plan to do much more; in the pipeline are analyses of the use of words in the full texts (not just titles) of those 1.7 million books, a comprehensive exploration of the use of the Bible throughout the nineteenth century, and more. And more could be be done to further normalize the data, such as accounting for the changing meaning of words over time.</p>
<p><strong>Validation</strong></p>
<p>So what does the data look like even at this early stage? And does it seem valid? That is where we began our analysis, with graphs of the percent of all books published with certain words in the titles (y-axis) on a year by year basis (x-axis). Victorian intellectual life as it is portrayed in this data set is in many respects consistent with what we already know.</p>
<p>The frequency chart of books with the word in &#8220;revolution&#8221; in the title, for example, shows spikes where it should, around the French Revolution and the revolutions of 1848. (Keen-eyed observers will also note spikes for a minor, failed revolt in England in 1817 and the successful 1830 revolution in France.)</p>
<p><img class="aligncenter size-full wp-image-1132" title="chart_revolution" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_revolution.png" alt="" width="500" /></p>
<p>Books about science increase as they should, though with some interesting leveling off in the late Victorian period. (We are aware that the word &#8220;science&#8221; changes over this period, becoming more associated with natural science rather than generalized knowledge.)</p>
<p><img class="aligncenter size-full wp-image-1133" title="chart_science" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_science.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1134" title="chart_scientific" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_scientific.png" alt="" width="500" /></p>
<p>The rise of factories&#8230;</p>
<p><img class="aligncenter size-full wp-image-1136" title="chart_industrial" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_industrial.png" alt="" width="500" /></p>
<p>and the concurrent Victorian nostalgia for the more sedate and communal Middle Ages&#8230;</p>
<p><img class="aligncenter size-full wp-image-1137" title="chart_middle_ages" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_middle_ages.png" alt="" width="500" /></p>
<p>&#8230;and the sense of modernity, a new phase beyond the medieval organization of society and knowledge that many Britons still felt in the eighteenth century.</p>
<p><img class="aligncenter size-full wp-image-1138" title="chart_modern" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_modern.png" alt="" width="500" /></p>
<p><strong>The Victorian Crisis of Faith, and Secularization</strong></p>
<p>Even more validation comes from some basic checks of key Victorian themes such as the crisis of faith. These charts are as striking as any portrayal of the secularization that took place in Great Britain in the nineteenth century.</p>
<p><img class="aligncenter size-full wp-image-1139" title="chart_religion" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_religion.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1140" title="chart_sacred" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_sacred.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1141" title="chart_worship" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_worship.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1142" title="chart_faith" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_faith.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1143" title="chart_divine" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_divine.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1144" title="chart_churches" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_churches.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1145" title="chart_god" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_god.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1146" title="chart_christian" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_christian.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1147" title="chart_bible" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_bible.png" alt="" width="500" /></p>
<p><strong>Correlation Is (not) Truth</strong></p>
<p>So it looks fairly good for this methodology. Except, of course, for some obvious pitfalls. Looking at the charts of a hundred words, Fred noticed a striking correlation between the publication of books on &#8220;belief,&#8221; &#8220;atheism,&#8221; and&#8230;&#8221;Aristotle&#8221;?</p>
<p><img class="aligncenter size-full wp-image-1148" title="chart_atheism_belief_aristotle" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_atheism_belief_aristotle.png" alt="" width="500" /></p>
<p>Obviously, we cannot simply take the data at face value. As I have called this on my blog, we have to be on guard for <a href="../../blog/posts/its_about_russia">oversimplifications that are the equivalent of saying that <em>War and Peace</em> is about Russia</a>. We have to marry these attempts at what Franco Moretti has called &#8220;distant reading&#8221; with more traditional close reading to find rigorous interpretations behind the overall trends.</p>
<p><strong>In Search of New Interpretations</strong></p>
<p>Nevertheless, even at this early stage of the Google grant, there are numerous charts that are suggestive of new research that can be done, or that expand on existing research. Correlation can, if we go from the macro level to the micro level, help us to illustrate some key features of the Victorian age better. For instance, the themes of Jeffrey von Arx&#8217;s<em> Progress and Pessimism: </em><em>Religion, Politics and History in Late Nineteenth Century Britain</em>, in which he notes the undercurrent of depression in the second half of the century, are strongly supported and enhanced by the data.</p>
<p><img class="aligncenter size-full wp-image-1150" title="chart_progress" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_progress.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1151" title="chart_improvement" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_improvement.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1152" title="chart_hope" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_hope.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1153" title="chart_happiness" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_happiness.png" alt="" width="500" /></p>
<p>And given the following charts, we can imagine writing much more about the decline of certainty in the Victorian age. &#8220;Universal&#8221; is probably the most striking graph of our first data set, but they all show telling slides toward relativism that begin before<em> </em>most interpretations in the secondary literature.</p>
<p><img class="aligncenter size-full wp-image-1154" title="chart_universal" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_universal.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1155" title="chart_virtue" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_virtue.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1156" title="chart_vice" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_vice.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1157" title="chart_truth" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_truth.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1158" title="chart_false" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_false.png" alt="" width="500" /></p>
<p>Rather than looking for what we expect to find, perhaps we can have the computer show us tens, hundreds, or even thousands of these graphs. Many will confirm what we already know, but some will be strikingly new and unexpected. Many of those may show false correlations or have other problems (such as the changing or multiple meaning of words), but some significant minority of them will reveal to us new patterns, and perhaps be the basis of new interpretations of the Victorian age.</p>
<p>What if I were to give you Victorianists hundreds of these charts?</p>
<p><img class="aligncenter size-full wp-image-1108" title="victorians_keynote_uva_2oct2010.050" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/victorians_keynote_uva_2oct2010.050.jpg" alt="" width="500" /></p>
<p>I believe it is important to keep our eyes open about the power of this technique. At the very least, it can tell us—as Augustus De Morgan would—when we have made mountains out of a molehills. If we do explore this new methodology, we might be able to find some charts that pique our curiosity as knowledgeable readers of the Victorians. We’re the ones that can accurately interpret the computational results.</p>
<p>We can see the rise of the modern work lifestyle&#8230;</p>
<p><img class="aligncenter size-full wp-image-1159" title="chart_work" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_work.png" alt="" width="500" /></p>
<p>&#8230;or explore the interaction between love and marriage, an important theme in the recent literature.</p>
<p><img class="aligncenter size-full wp-image-1160" title="chart_love" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_love.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1161" title="chart_marriage" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_marriage.png" alt="" width="500" /></p>
<p>We can look back at the classics of secondary literature, such as Houghton&#8217;s <em>Victorian Frame of Mind</em>, and ask whether those works hold up to the larger scrutiny of virtually all Victorian books, rather than just the limited set of books those authors used. For instance, while in general our initial study supports Houghton&#8217;s interpretations, it also shows relatively few books on heroism, a theme Houghton adopts from Thomas Carlyle.</p>
<p><img class="aligncenter size-full wp-image-1169" title="chart_heroic" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_heroic.png" alt="" width="500" /></p>
<p>And where is the supposed Victorian obsession with theodicy in this chart on books about &#8220;evil&#8221;?</p>
<p><img class="aligncenter size-full wp-image-1193" title="chart_evil" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_evil1.png" alt="" width="500" /></p>
<p>Even more suggestive are the contrasts and anomalies. For instance, publications on &#8220;Jesus&#8221; are relatively static compared to those on &#8220;Christ,&#8221; which drop from nearly 1 in 60 books in 1843 to less than 1 in 300 books 70 years later.</p>
<p><img class="aligncenter size-full wp-image-1163" title="chart_jesus" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_jesus.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1164" title="chart_christ" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_christ.png" alt="" width="500" /></p>
<p>The impact of the ancient world on the Victorians can be contrasted (albeit with a problematic dual modern/ancient meaning for Rome)&#8230;</p>
<p><img class="aligncenter size-full wp-image-1165" title="chart_rome" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_rome.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1166" title="chart_greece" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_greece.png" alt="" width="500" /></p>
<p>&#8230;as can the Victorians&#8217; varying interest in the afterlife.</p>
<p><img class="aligncenter size-full wp-image-1167" title="chart_heaven" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_heaven.png" alt="" width="500" /></p>
<p><img class="aligncenter size-full wp-image-1168" title="chart_hell" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/chart_hell.png" alt="" width="500" /></p>
<p>I hope that these charts have prodded you to consider the anecdotal versus the comprehensive, and the strengths and weaknesses of each. It is time we had a more serious debate—not just in the digital humanities but in the humanities more generally—about measurement and interpretation that the Victorians had. Can we be so confident in our methods of extrapolating from some literary examples to the universal whole?</p>
<p><img class="aligncenter size-full wp-image-1128" title="victorians_keynote_uva_2oct2010.070" src="http://www.dancohen.org/wp/wp-content/uploads/2010/10/victorians_keynote_uva_2oct2010.070.jpg" alt="" width="500" /></p>
<p>This is a debate that we should have in the present, aided by our knowledge of what the Victorians struggled with in the past.</p>
<p>[Image credits (other than graphs): Wikimedia Commons]</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dancohen.org/2010/10/04/searching-for-the-victorians/feed/</wfw:commentRss>
		<slash:comments>33</slash:comments>
		</item>
		<item>
		<title>Postdoc in Text Mining at CHNM</title>
		<link>http://www.dancohen.org/2008/04/03/postdoc-in-text-mining-at-chnm/</link>
		<comments>http://www.dancohen.org/2008/04/03/postdoc-in-text-mining-at-chnm/#comments</comments>
		<pubDate>Thu, 03 Apr 2008 20:19:34 +0000</pubDate>
		<dc:creator>Dan Cohen</dc:creator>
				<category><![CDATA[Jobs]]></category>
		<category><![CDATA[Text Mining]]></category>

		<guid isPermaLink="false">http://www.dancohen.org/?p=274</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Postdoc+in+Text+Mining+at+CHNM&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=Jobs&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2008-04-03&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2008/04/03/postdoc-in-text-mining-at-chnm/&amp;rft.language=English"></span>
[Yes, we're hiring again. Come join us if this sounds like you!] The Center for History and New Media (CHNM) at George Mason University is seeking a postdoctoral fellow to work on a new text-mining initiative supported by the National Endowment for the Humanities. ABD candidates are also strongly encouraged to apply. This is a [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Postdoc+in+Text+Mining+at+CHNM&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=Jobs&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2008-04-03&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2008/04/03/postdoc-in-text-mining-at-chnm/&amp;rft.language=English"></span>
<p>[<em>Yes, we're hiring again. Come join us if this sounds like you!</em>]</p>
<p><a href="http://chnm.gmu.edu">The Center for History and New Media</a> (CHNM) at <a href="http://www.gmu.edu">George Mason University</a> is seeking a postdoctoral fellow to work on a new text-mining initiative supported by the <a href="http://www.neh.gov">National Endowment for the Humanities</a>. ABD candidates are also strongly encouraged to apply. This is a grant-funded, two-year position that is particularly appropriate for someone with interests in computational linguistics, machine learning, or technology and the humanities and social sciences. Specific background and experience is less important than the ability to learn new technical skills quickly. Knowledge of some combination of the following would be particularly helpful: Java, JavaScript, MySQL, PHP, or object-oriented programming. Ability to work in a team is very important. CHNM (http://chnm.gmu.edu), known for innovative work in digital media, is located in Fairfax, Virginia, 15 miles from Washington, DC, and is accessible by public transportation. Please send a cover letter and resume, including relevant programming projects and experience, to <a href="mailto:chnm.gmu.edu">chnm@gmu.edu</a> with subject line &#8220;Text Mining.&#8221; We will begin considering applications on 5/1/2008 and continue until the position is filled. Applications without a cover letter will not be considered.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dancohen.org/2008/04/03/postdoc-in-text-mining-at-chnm/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Enhancing Historical Research With Text-Mining and Analysis Tools</title>
		<link>http://www.dancohen.org/2008/02/04/enhancing-historical-research-with-text-mining-and-analysis-tools/</link>
		<comments>http://www.dancohen.org/2008/02/04/enhancing-historical-research-with-text-mining-and-analysis-tools/#comments</comments>
		<pubDate>Tue, 05 Feb 2008 02:54:41 +0000</pubDate>
		<dc:creator>Dan Cohen</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Text Mining]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://www.dancohen.org/2008/02/04/enhancing-historical-research-with-text-mining-and-analysis-tools/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Enhancing+Historical+Research+With+Text-Mining+and+Analysis+Tools&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=History&amp;rft.subject=Research&amp;rft.subject=Text+Mining&amp;rft.subject=Tools&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2008-02-04&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2008/02/04/enhancing-historical-research-with-text-mining-and-analysis-tools/&amp;rft.language=English"></span>
I&#8217;m delighted to announce that beginning this summer the Center for History and New Media will undertake a major two-year study of the potential of text-mining tools for historical (and by extension, humanities) scholarship. The project, entitled &#8220;Scholarship in the Age of Abundance: Enhancing Historical Research With Text-Mining and Analysis Tools,&#8221; has just received generous [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Enhancing+Historical+Research+With+Text-Mining+and+Analysis+Tools&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=History&amp;rft.subject=Research&amp;rft.subject=Text+Mining&amp;rft.subject=Tools&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2008-02-04&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2008/02/04/enhancing-historical-research-with-text-mining-and-analysis-tools/&amp;rft.language=English"></span>
<p><img src="http://www.dancohen.org/wp/wp-content/uploads/2008/02/521804_42ea4f44af_m.jpg" alt="Open Book" align="left" hspace="10" />I&#8217;m delighted to announce that beginning this summer the <a href="http://chnm.gmu.edu">Center for History and New Media</a> will undertake a major two-year study of the potential of text-mining tools for historical (and by extension, humanities) scholarship. The project, entitled &#8220;Scholarship in the Age of Abundance: Enhancing Historical Research With Text-Mining and Analysis Tools,&#8221; has just received generous funding from the <a href="http://www.neh.gov">National Endowment for the Humanities</a>.</p>
<p>In the last decade the library community and other providers of digital collections have created an incredibly rich digital archive of historical and cultural materials. Yet most scholars have not yet figured out ways to take full advantage of the digitized riches suddenly available on their computers. Indeed, the abundance of digital documents has actually exacerbated the problems of some researchers who now find themselves overwhelmed by the sheer quantity of available material. Meanwhile, some of the most profound insights lurking in these digital corpora remain locked up.</p>
<p>For some time computer scientists have been pursuing text mining as a solution to the problem of abundance, and there have even been a few attempts at bringing text-mining tools to the humanities (such as <a href="http://www.monkproject.org/">the MONK project</a>). Yet there is not as much research as one might hope on what non-technically savvy scholars (especially historians) might actually want and use in their research, and how we might integrate sophisticated text analysis into the workflow of these scholars.</p>
<p>We will first conduct a survey of historians to examine closely their use of digital resources and prospect for particularly helpful uses of digital technology. We will then explore three main areas where text mining might help in the research process: locating documents of interest in the sea of texts online; extracting and synthesizing information from these texts; and analyzing large-scale patterns across these texts. A focus group of historians will be used to assess the efficacy of different methods of text mining and analysis in real-world research situations in order to offer recommendations, and even some tools, for the most promising approaches.</p>
<p>In addition to other forms of dissemination, I will of course provide project updates in this space.</p>
<p>[<em>Image credit: <a href="http://flickr.com/photos/mattwright/521804/">Matt Wright</a></em>]</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dancohen.org/2008/02/04/enhancing-historical-research-with-text-mining-and-analysis-tools/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Why Google Books Should Have an API</title>
		<link>http://www.dancohen.org/2007/09/04/why-google-books-should-have-an-api/</link>
		<comments>http://www.dancohen.org/2007/09/04/why-google-books-should-have-an-api/#comments</comments>
		<pubDate>Tue, 04 Sep 2007 19:20:45 +0000</pubDate>
		<dc:creator>Dan Cohen</dc:creator>
				<category><![CDATA[APIs]]></category>
		<category><![CDATA[Books]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Open Access]]></category>
		<category><![CDATA[Text Mining]]></category>

		<guid isPermaLink="false">http://www.dancohen.org/2007/09/04/why-google-books-should-have-an-api/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Why+Google+Books+Should+Have+an+API&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=APIs&amp;rft.subject=Books&amp;rft.subject=Google&amp;rft.subject=Open+Access&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2007-09-04&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2007/09/04/why-google-books-should-have-an-api/&amp;rft.language=English"></span>
[This post is a version of a message I sent to the listserv for CenterNet, the consortium of digital humanities centers. Google has expressed interest in helping CenterNet by providing a (limited) corpus of full texts from their Google Books program, but I have been arguing for an API instead. My sense is that this [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Why+Google+Books+Should+Have+an+API&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=APIs&amp;rft.subject=Books&amp;rft.subject=Google&amp;rft.subject=Open+Access&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2007-09-04&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2007/09/04/why-google-books-should-have-an-api/&amp;rft.language=English"></span>
<p><img src="http://www.dancohen.org/wp/wp-content/uploads/2007/09/books.jpeg" alt="No Way Out" align="left" border="0" hspace="10" /><em>[This post is a version of a message I sent to <a href="http://lists.digitalhumanities.org/mailman/listinfo/centernet">the listserv for CenterNet</a>, the consortium of digital humanities centers. Google has expressed interest in helping CenterNet by providing a (limited) corpus of full texts from their <a href="http://books.google.com">Google Books</a> program, but I have been arguing for an <a href="http://www.dancohen.org/2005/11/21/do-apis-have-a-place-in-the-digital-humanities/">API</a> instead. My sense is that this idea has considerable support but that there are also some questions about the utility of an API, including from within Google.]</em></p>
<p>My argument for an API over an extracted corpus of books begins with a fairly simple observation: how are we to choose a particular dataset for Google to compile for us? I&#8217;m a scholar of the Victorian era, so a large corpus from the nineteenth century would be great, but how about those who study the Enlightenment? If we choose novels, what about those (like me) who focus on scientific literature? Moreover, many of us wish to do more expansive horizontal (across genres in a particular age) and vertical (within the same genre but through large spans of time) analyses. How do we accommodate the wishes of everyone who does computational research in the humanities?</p>
<p>Perhaps some of the misunderstanding here is about the kinds of research a humanities scholar might do as opposed to, say, the computational linguist, who might make use of a dataset or corpus (generally a broad and/or normalized one) to assess the nature of (a) language itself, examine frequencies and patterns of words, or address computer science problems such as document classification. Some of these corpora can provide a historian like me with insights as long as the time span involved is long enough and each document includes important metadata such as publication date (e.g., you can trace the rise and fall of certain historical themes using <a href="http://corpus.byu.edu/time/">BYU&#8217;s Time Magazine corpus</a>).</p>
<p>But there are many other analyses that humanities scholars could undertake with an API, especially one that allowed them to first search for books of possible interest and then to operate on the full texts of that ad hoc corpus. An example from my own research: in <a href="http://www.dancohen.org/publications/#equations_from_god">my last book</a> I argued that mathematics was &#8220;secularized&#8221; in the nineteenth century, and part of my evidence was that mathematical treatises, which normally contained religious language in the early nineteenth century, lost such language by the end of the century. By necessity, researching in the pre-Google Books era, my textual evidence was limited&#8211;I could only read a certain number of treatises and chose to focus on the writing of high-profile mathematicians.</p>
<p>How would I go about supporting this thesis today using Google Books? I would of course love to have an exhaustive corpus of mathematical treatises. But in my book I also used published books of poems, sermons, and letters about math. In other words, it&#8217;s hard to know exactly what to assemble in advance&#8211;just treatises would leave out much of the story and evidence.</p>
<p>Ideally, I would like to use an API to find books that matched a complicated set of criteria (it would be even better if I could use regular expressions to find the many variants of religious language and also to find religious language relatively close to mentions of mathematics), and then use get_cache to acquire the full OCRed text of these matching books. From that ad hoc corpus I would want to do some further computational analyses on my own server, such as extracting references to touchstones for the divine vision of mathematics (e.g., Plato&#8217;s later works, geometry rather than number theory), and perhaps even do some aggregate analyses (from which works did British mathematicians most often acquire this religious philosophy of mathematics?). I would also want to examine these patterns over time to see if indeed the bond between religion and mathematics declined in the late Victorian era.</p>
<p>This is precisely the model I use for my <a href="http://chnm.gmu.edu/tools/syllabi">Syllabus Finder</a>. I first find possible syllabi using an algorithm-based set of searches of Google (via the unfortunately deprecated <a href="http://code.google.com/apis/soapsearch/">SOAP Search API</a>) while also querying local Center for History and New Media databases for matches. Since I can then extract the full texts of matching web pages from Google (using the API&#8217;s cache function), I can do further operations, such as pulling book assignments out of the syllabi (using regular expressions).</p>
<p>It seems to me that a model is already in place at Google for such an API for Google Books: <a href="http://research.google.com/university/search/">their special university researcher&#8217;s version of the Search API</a>. That kind of restricted but powerful API program might be ideal because 1) I don&#8217;t think an API would be useful without the get_OCRed_text function, which (let&#8217;s face it) liberates information that is currently very hard to get even though Google has recently released a plain text view of (only some of) its books; and 2) many of us want to ping the Google Books API with more than the standard daily hit limit for Google APIs.</p>
<p><em>[Image credit: the best double-entendre cover I could find on Google Books: </em>No Way Out<em> by Beverly Hastings.]</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dancohen.org/2007/09/04/why-google-books-should-have-an-api/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Nora Project Screencast</title>
		<link>http://www.dancohen.org/2007/06/19/nora-project-screencast/</link>
		<comments>http://www.dancohen.org/2007/06/19/nora-project-screencast/#comments</comments>
		<pubDate>Tue, 19 Jun 2007 13:07:40 +0000</pubDate>
		<dc:creator>Dan Cohen</dc:creator>
				<category><![CDATA[Software]]></category>
		<category><![CDATA[Text Mining]]></category>

		<guid isPermaLink="false">http://www.dancohen.org/2007/06/19/nora-project-screencast/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Nora+Project+Screencast&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=Software&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2007-06-19&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2007/06/19/nora-project-screencast/&amp;rft.language=English"></span>
The Nora text analysis and visualization project has a screencast out explaining how to use a new web interface to their server-based software.]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Nora+Project+Screencast&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=Software&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2007-06-19&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2007/06/19/nora-project-screencast/&amp;rft.language=English"></span>
<p>The Nora text analysis and visualization project has <a href="http://noraproject.org/nora_ol_video/">a screencast</a> out explaining how to use a new web interface to their server-based software.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dancohen.org/2007/06/19/nora-project-screencast/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>American Studies Tagline</title>
		<link>http://www.dancohen.org/2007/06/18/american-studies-tagline/</link>
		<comments>http://www.dancohen.org/2007/06/18/american-studies-tagline/#comments</comments>
		<pubDate>Mon, 18 Jun 2007 13:09:02 +0000</pubDate>
		<dc:creator>Dan Cohen</dc:creator>
				<category><![CDATA[History]]></category>
		<category><![CDATA[Tagging]]></category>
		<category><![CDATA[Text Mining]]></category>

		<guid isPermaLink="false">http://www.dancohen.org/2007/06/18/american-studies-tagline/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=American+Studies+Tagline&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=History&amp;rft.subject=Tagging&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2007-06-18&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2007/06/18/american-studies-tagline/&amp;rft.language=English"></span>
Dave Lester provides an interesting visualization of the history of American Studies over the last fifty years by running Lucy Maddox&#8217;s Locating American Studies: The Evolution of a Discipline through a tag cloud creator and then putting it on a slider timeline. Note the rise and fall of Leo Marx&#8217;s influence on the field, among [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=American+Studies+Tagline&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=History&amp;rft.subject=Tagging&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2007-06-18&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2007/06/18/american-studies-tagline/&amp;rft.language=English"></span>
<p>Dave Lester provides <a href="http://tagline.davelester.org/">an interesting visualization</a> of the history of American Studies over the last fifty years by running Lucy Maddox&#8217;s Locating American Studies: The Evolution of a Discipline through a tag cloud creator and then putting it on a slider timeline. Note the rise and fall of Leo Marx&#8217;s influence on the field, among other things.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dancohen.org/2007/06/18/american-studies-tagline/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Million Books Workshop Wrap-up</title>
		<link>http://www.dancohen.org/2007/05/24/million-books-workshop-wrap-up/</link>
		<comments>http://www.dancohen.org/2007/05/24/million-books-workshop-wrap-up/#comments</comments>
		<pubDate>Thu, 24 May 2007 14:10:30 +0000</pubDate>
		<dc:creator>Dan Cohen</dc:creator>
				<category><![CDATA[Text Mining]]></category>

		<guid isPermaLink="false">http://www.dancohen.org/2007/05/24/million-books-workshop-wrap-up/</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Million+Books+Workshop+Wrap-up&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2007-05-24&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2007/05/24/million-books-workshop-wrap-up/&amp;rft.language=English"></span>
May has been a month of travel for me (thus the light posting in this space). I gave a talk about Zotero and related developments in the humanities and technology at the Stanford Humanities Center, and spoke at the annual meeting of the American Council of Learned Societies about how digital research is a major [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Million+Books+Workshop+Wrap-up&amp;rft.aulast=Cohen&amp;rft.aufirst=Dan&amp;rft.subject=Text+Mining&amp;rft.source=Dan+Cohen%26%23039%3Bs+Digital+Humanities+Blog&amp;rft.date=2007-05-24&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.dancohen.org/2007/05/24/million-books-workshop-wrap-up/&amp;rft.language=English"></span>
<p>May has been a month of travel for me (thus the light posting in this space). I gave a talk about <a href="http://www.zotero.org/">Zotero</a> and related developments in the humanities and technology at the <a href="http://shc.stanford.edu/">Stanford Humanities Center</a>, and spoke at the annual meeting of the <a href="http://www.acls.org/">American Council of Learned Societies</a> about how digital research is a major emerging theme in scholarship. Finally, I participated in the <a href="http://www.tufts.edu/">Tufts</a> <a href="http://devwiki.perseus.tufts.edu/wiki/Million_Books_Workshop">&#8220;Million Books&#8221; Workshop</a>, which explored the technical feasibility and theoretical validity of extracting evidence and meaning from the large new corpora of online texts. The three main topics were how to get from scanned documents (especially the complicated ones that scholars sometimes encounter, like Sanskrit manuscripts or early modern broadsides, rather than simply formatted texts like modern English books) to machine-readable text that can be searched and analyzed; machine translation of texts; and moving from text to actionable data (e.g., extraction all of the place names from a document or summarizing large masses of text). Some developments worth noting from the workshop:</p>
<p>I had vaguely heard about the open-source optical character recognition (OCR) project <a href="http://www.ocropus.org">OCRopus</a>, but <a href="http://www.iupr.org/tmb/">Thomas Breuel&#8217;s</a> detailed description of the project made it seem extremely promising, especially for scholarly applications. Even after two decades of research and development, the error rate of OCR is still too high for many historical texts, and atrocious for compound texts like <a href="http://www.dancohen.org/publications/#equations_from_god">Victorian mathematical monographs</a> (with all of those equations that end up, improperly and disastrously, as regular text after OCR) or works with vertical text (e.g., Japanese poetry) or images. OCRopus ambitiously plans to support any language written in any direction with any layout. It also breaks down the conversion of scans to text into separate processes that produce <em>probabilities rather than certainties</em>. This is critical. Most OCR packages give you a text result without noting where the software was unsure of a word or letter. Thus you might get &#8220;Cohem&#8221; rather than &#8220;Cohen&#8221; without knowing that the software thought long and hard about the correct interpretation of that last letter. OCRopus instead produces a statistical output that says to any end-user application (like search), &#8220;I&#8217;m sure about &#8216;Cohe&#8217; but the last letter has a 60% probability of being an &#8216;m&#8217; and a 40% probability of being an &#8216;n&#8217;.&#8221; A search for &#8220;Cohen&#8221; could thus return the document as a result even if the &#8220;final&#8221; transcription defaults to &#8220;Cohem.&#8221;</p>
<p>OCRopus also uses far more sophisticated methods than current OCR software to find titles, ordered blocks of text (like columns), and marginalia. Brilliantly, rather than outputting XML at the end of its processes, OCRopus outputs to XHTML and CSS3 so that it can much more accurately represent the fonts and layout of the original. Very impressive. The project is just in pre-alpha right now with a 1.0 release to come in the fall of 2008. Unsurprisingly, OCRopus is supported by <a href="http://www.google.com">Google</a>, which plans to use it for <a href="http://books.google.com">Google Book Search</a>. (Right now they have OCR that&#8217;s good enough for search, which doesn&#8217;t need anywhere near 100% accuracy, but they plan to re-OCR their book scans with OCRopus when it&#8217;s ready.)</p>
<p><a href="http://www.cs.jhu.edu/~dasmith/">David Smith</a> spoke about the cutting edge of machine translation (i.e., the use of computational methods to translate text from one language to another). The field seems extremely active right now, and new methods promise better translations in the near future. David spoke of several developments. First, many projects are seeding their software with parallel texts, such as documents from the United Nations or the European Parliament, which are translated very precisely by humans into many languages. Parallel text corpora (with English as one of the parallels) on the order of 20-200 million words (roughly 1-10 million sentences) are available for a number of languages. Unfortunately, the parallel texts often come from genres like laws, parliamentary proceedings, and religious texts (not only the Bible but also, quite interestingly, <em>Dianetics</em> is one English text that has been translated into virtually every language, including Uzbek). These genres are, of course, less than optimal for widespread translation uses. We might, however, be able to use parallel translated works from Google&#8217;s scans or the <a href="http://www.opencontentalliance.org">Open Content Alliance</a> to help improve the seed corpus.</p>
<p>Second, David noted the resilience of <a href="http://en.wikipedia.org/wiki/N-gram">n-gram</a> analysis—breaking down a document into word pairs or triads. Usually you can predict the next word in a document by looking at the previous two words and then assessing the probability of the word following each pair. Most of the best machine translation services (like Google&#8217;s) now split a text into bi-grams and tri-grams (two- and three-word pieces) and then translate those n-grams into very exact parallels in the target text using an n-gram library. This is better at keeping the style of the text and avoiding the off-sounding literal translations that have dogged the field. David feels that machine translation has reached the point where it can very usefully tell a user when a primary source document has been mistranslated by a human, which can be very useful for scholarship.</p>
<p>Finally, <a href="http://www.cs.umass.edu/~mimno/">David Mimno</a> discussed how to move from the text that results from the work of OCR and machine translation (if necessary) into forms that will help with research and analysis in the humanities. David has been doing impressive work in document classification, i.e., computationally assessing a set of digitized texts and figuring out which ones are letters or poems or lab notes, or if the documents are all articles, separating them out into topic clusters. Like machine translation and OCR, when you begin to look under the hood this is an extraordinarily complicated field. The three main techniques—<a href="http://en.wikipedia.org/wiki/Support_vector_machine">support vector machines</a> (SVM), <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">naive Bayes</a> (probably the best-known method, often used in spam filters), and <a href="http://en.wikipedia.org/wiki/Logistic_regression">logistic regression</a>—are best viewed mathematically, and so lie beyond the scope of this blog. David is working on the <a href="http://mallet.cs.umass.edu/">Mallet project</a> at the University of Massachusetts, Amherst, which seems promising for document classification (a topic we are increasingly interested in at the <a href="http://chnm.gmu.edu">Center for History and New Media</a> for historical research). The software is still in alpha but I plan to keep an eye on it.</p>
<p>Obviously a lot to think about from the month of May. How do we get these complicated tools to scholars who don&#8217;t have technical skills? How can we use these tools to reveal new, meaningful information about the past, <a href="http://www.dancohen.org/blog/posts/its_about_russia">without reproducing the obvious using computational means</a>? As I felt at the <a href="http://www.dancohen.org/blog/posts/digital_humanities_summit_wrap-up">National Endowment for the Humanities meeting in April</a>, the application of digital methods to the humanities is experiencing a burst of energy and attention in 2007. It will be interesting to see what happens next.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dancohen.org/2007/05/24/million-books-workshop-wrap-up/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

