Dan Cohen

Google and the World of Search
Exploring the potential of search engines, data mining applications, web services, APIs, and related matters.



Google Fingers
Posted to Google and the World of Search on 26 June 2006, 12:22 AM EDT

No, it's not another amazing new piece of software from Google, which will type for you (though that would be nice). Just something that I've noticed while looking at many nineteenth-century books in Google's massive digitization project. The following screenshot nicely reminds us that at the root of the word "digitization" is "digit," which is from the Latin word "digitus," meaning finger. It also reminds us that despite our perception of Google as a collection of computer geniuses, and despite their use of advanced scanning technology, their library project involves an almost unfathomable amount of physical labor. I'm glad that here and there, the people doing this difficult work (or at least their fingers) are being immortalized... [Read on...]


10 Most Popular Philosophy Syllabi
Posted to Google and the World of Search on 21 May 2006, 10:44 PM EDT

It's time once again to find the most influential syllabi in a discipline—this time, philosophy—as determined by data gleaned from the Syllabus Finder. As with my earlier analysis of the most popular history syllabi the following list was compiled by running a series of calculations to determine the number of times Syllabus Finder users glanced at a syllabus (had it turn up in a search), the number of times Syllabus Finder users inspected a syllabus (actually went from the Syllabus Finder website to the website of the syllabus to do further reading), and the overall "attractiveness" of a syllabus (defined as the ratio of full reads to mere glances). It goes without saying (but I'll say it) that this methodology is unscientific and gives an advantage to older syllabi, but it still probably provides a good sense of the most visible and viewed syllabi on the web. Anyway, here are the ten most popular philosophy syllabi... [Read on...]


Google Book Search Blog
Posted to Google and the World of Search on 9 May 2006, 9:50 AM EDT

For those interested in the Google book digitization project (one of my three copyright-related stories to watch for 2006), Google launched an official blog yesterday. Right now "Inside Google Book Search" seems more like "Outside Google Book Search," with a first post celebrating the joys of books and discovery, and with a set of links lauding the project, touting "success stories," and soliciting participation from librarians, authors, and publishers. Hopefully we'll get more useful insider information about the progress of the project, hints about new ways of searching millions of books, and other helpful tips for scholars in the near future. As I recently wrote in an article in D-Lib Magazine, Google's project has some serious—perhaps fatal—flaws for those in the digital humanities (not so for the competing, but much smaller, Open Content Alliance). In particular, it would be nice to have more open access to the text (rather than mere page images) of pre-1923 books (i.e., those that are out of copyright). Of course, I'm a historian of the Victorian era who wants to scan thousands of nineteenth-century books using my own digital tools, not a giant company that may want to protect its very expensive investment in digitizing whole libraries... [Read on...]


The Single Box Humanities Search
Posted to Google and the World of Search on 17 April 2006, 11:42 AM EDT

I recently polled my graduate students to see where they turn to begin research for a paper. I suppose this shouldn't come as a surprise: the number one answer—by far—was Google. Some might say they're lazy or misdirected, but the allure of that single box—and how well it works for most tasks—is incredibly strong. Try getting students to go to five or six different search engines for gated online databases such as ProQuest Academic and JSTOR—all of which have different search options and produce a complex array of results compared to Google. I was thinking about this recently as I tested the brand new scholarly search engine from Microsoft, Windows Live Academic. Windows Live Academic is a direct competitor to Google Scholar, which has been in business now for over a year but is still in "beta" (like most Google products). Both are trying to provide that much-desired single box for academic researchers. And while those in the sciences may eventually be happy with this new option from Microsoft (though it's currently much rougher than Google's beta, as you'll see), like Google Scholar, Windows Live Academic is a big disappointment for students, teachers, and professors in the humanities. I suspect there are three main reasons for this lack of a high-quality single box humanities search... [Read on...]


Google Adds Topic Clusters to Search Results
Posted to Google and the World of Search on 21 March 2006, 9:47 AM EST

Google has been very conservative about changing their search results page. Indeed, the design of the page and the information presented has changed little since the search engine's public introduction in 1998. Innovations have literally been marginal: Google has added helpful spelling corrections ("Did you mean...?"), related search terms, and news items near the top of the page, and of course the ubiquitous text ads to the right. But the primary search results block has remained fairly untouched. Competitors have come and gone (mostly the latter), promoting new—and they say better—ways of browsing masses of information. But Google's clean, relevant list has brushed off these upstarts. So it surprised me when I was doing some fact checking on a book I'm finishing to see the following search results page:.. [Read on...]


Impact of Field v. Google on the Google Library Project
Posted to Google and the World of Search on 9 February 2006, 11:53 AM EST

I've finally had a chance to read the federal district court ruling in a case, Field v. Google, that has not been covered much (except in the technology press), but which has obvious and important implications for the upcoming battle over the legality of Google's library digitization project. The case, Field v. Google, involved a lawyer who dabbles in some online poetry, and who was annoyed that Google's spider cached a version of his copyrighted ode to delicious tea ("Many of us must have it iced, some of us take it hot and combined with milk, and others are not satisfied unless they know that only the rarest of spices and ingredients are contained therein..."). Field sued Google for copyright infringement; Google argued fair use. Field lost the case, with most of his points rejected by the court. The Electronic Frontier Foundation has hailed Google's victory as a significant one, and indeed there are some very good aspects of the ruling for the book copying case. But there also seem to be some major differences between Google's wholesale copying of websites and its wholesale copying of books that the court implicitly recognized. The following seem to be the advantages and disadvantages of this ruling for Google, the University of Michigan, and others who wish to see the library project reach completion... [Read on...]


Google, the Khmer Rouge, and the Public Good
Posted to Google and the World of Search on 7 February 2006, 11:07 AM EST

Like Daniel into the lion's den, Mary Sue Coleman, the President of the University of Michigan, yesterday went in front of the Association of American Publishers to defend her institution's participation in Google's massive book digitization project. Her speech, "Google, the Khmer Rouge and the Public Good," is an impassioned defense of the project, if a bit pithy at certain points. It's worth reading in its entirety, but here are some highlights with commentary... [Read on...]


Wikipedia vs. Encyclopaedia Britannica Keyword Shootout Results
Posted to Google and the World of Search on 6 February 2006, 2:18 PM EST

In my post "Wikipedia vs. Encyclopaedia Britannica for Digital Research", I asked you to compare two lists of significant keywords and phrases, derived from matching articles on George H. W. Bush in Wikipedia and the Encyclopaedia Britannica. Which one is a better keyword profile—a data mining list that could be used to find other documents on the first President Bush in a sea of documents—and which list do you think was derived from Wikipedia? The people have spoken and it's time to open the envelope... [Read on...]


Wikipedia vs. Encyclopaedia Britannica for Digital Research
Posted to Google and the World of Search on 30 January 2006, 12:07 PM EST

In a prior post I argued that the recent coverage of Wikipedia has focused too much on one aspect of the online reference source's openness—the ability of anyone to edit any article—and not enough on another aspect of Wikipedia's openness—the ability of anyone to download or copy the entire contents of its database and use it in virtually any way they want (with some commercial exceptions). I speculated that, as I discovered in my data-mining work with H-Bot, which uses Wikipedia in its algorithms, having an open and free resource such as this could be very important for future digital research—e.g., finding all of the documents about the first President Bush in a giant, untagged corpus on the American presidency. For a piece I'm writing for D-Lib Magazine, I decided to test this theory by pulling out significant keywords and phrases from matching articles in Wikipedia and the Encyclopaedia Britannica on George H. W. Bush to see if one was better than the other for this purpose. Which resource is better? Here are the unedited term lists, derived by running plain text versions of each article through Yahoo's Term Extraction web service. Vote on which one you think is a better profile, and I'll reveal which list belongs to which reference work later this week... [Read on...]


How Much Google Knows About You
Posted to Google and the World of Search on 26 January 2006, 8:51 PM EST

As the U.S. Justice Department put pressure on Google this week to hand over their search records in a questionable pursuit of evidence for an overturned pornography law, I wondered: How much information does Google really know about us? Strangely, at nearly the same time an email arrived from Google (one of the Google Friends Newsletters) telling me that they had just launched Google Personal Search Trends. Someone in the legal department must not have vetted that email: Google Personal Search Trends reveals exactly how much they know about you. So, how much?.. [Read on...]


10 Most Popular History Syllabi
Posted to Google and the World of Search on 11 January 2006, 7:38 PM EST

My Syllabus Finder search engine has been in use for three years now, and I thought it would be interesting to look back at the nearly half-million searches and 640,000 syllabi it has handled to see which syllabi have been the most popular. The following list was compiled by running a series of calculations to determine the number of times Syllabus Finder users glanced at a syllabus (had it turn up in a search), read a syllabus (actually went from the Syllabus Finder website to the website of the syllabus to do further reading), and "attractiveness" of a syllabus (defined as the ratio of full reads to mere glances). Here are the most popular history syllabi on the web... [Read on...]


The Wikipedia Story That's Being Missed
Posted to Google and the World of Search on 20 December 2005, 2:08 PM EST

With all of the hoopla over Wikipedia in the recent weeks (covered in two prior posts), most of the mainstream as well as tech media coverage has focused on the openness of the democratic online encyclopedia. Depending on where you stand, this openness creates either a Wild West of publishing, where anything goes and facts are always changeable, or an innovative mode of mostly anonymous collaboration that has managed to construct in just a few years an information resource that is enormous, often surprisingly good, and frequently referenced. But I believe there is another story about Wikipedia that is being missed, a story unrelated to its (perhaps dubious) openness. This story is about Wikipedia being free, in the sense of the open source movement—the fact that anyone can download the entirety of Wikipedia and use it and manipulate it as they wish. And this more hidden story begins when you ask, Why would Google and Yahoo be so interested in supporting Wikipedia?.. [Read on...]


Alexa Web Search Platform Debuts
Posted to Google and the World of Search on 14 December 2005, 4:01 PM EST

I'm currently working on an article for D-Lib Magazine explaining in greater depth how some of my tools that use search engine APIs work (such as the Syllabus Finder and H-Bot). These APIs, such as the services from Google and Yahoo, allow somewhat more direct access to mammoth web databases than you can get through these companies' more public web interfaces. I thought it would be helpful for the article to discuss some of the advantages and drawbacks of these services, and was just outlining one of my major disappointments with their programming interfaces—namely, that you can't run sophisticated text analysis on their servers, but have to do post-processing on your own server once you get a set of results back—when it was announced that Alexa released its Web Search Platform. The AWSP allows you to do just what I've been wanting to do on an extremely large (4 billion web page) corpus: scan through it in the same way that employees at Yahoo and Google can do, using advanced algorithms and manipulating as large a set of results as you can handle, rather than mere dozens of relevant pages. Here's what's notable about AWSP for researchers and digital humanists... [Read on...]


Reliability of Information on the Web
Posted to Google and the World of Search on 5 December 2005, 10:51 AM EST

Given the current obsession with the reliability (or more often in media coverage, the unreliability) of information on the web—the New York Times weighed in on the matter yesterday, and USA Today carried a scathing op-ed last week—I feel lucky that an article Roy Rosenzweig and I wrote entitled "Web of Lies? Historical Information on the Internet" happens to appear today in First Monday. If you're interested in the subject, it's probably best to read the full article, but I'll provide a quick summary of our argument here... [Read on...]


Clifford Lynch and Jonathan Band on Google Book Search
Posted to Google and the World of Search on 28 November 2005, 10:38 PM EST

The topic for the November 2005 Washington DC Area Forum on Technology and the Humanities focused on "Massive Digitization Programs and Their Long-Term Implications: Google Print, the Open Content Alliance, and Related Developments." The two speakers at the forum, Clifford Lynch and Jonathan Band, are among the most intelligent and thought-provoking commentators on the significance of Google's Book Search project (formerly known as Google Print, with the Google Print Library Project being the company's attempt to digitize millions of books at the University of Michigan, Stanford, Harvard, Oxford, and the New York Public Library). These are my notes from the forum, highlighting not the basics of the project, which have been covered well in the mainstream media, but angles and points that may interest the readers of this blog... [Read on...]