Category: Spam

A reCAPTCHA Dilemma?

Here’s a possible conundrum worthy of the New York Times’s ethicist, Randy Cohen (no relation to your’s truly). I have been a major proponent of reCAPTCHA, the red and yellow box at the bottom of my blog posts that uses words from books scanned by the Internet Archive/Open Content Alliance as a system to prevent comment spam. At the same time visitors decipher the words in that box to add a comment, they help to turn old texts into accurate, useful transcriptions. My glee about killing two birds with one stone has soured a bit after discovering something unsettling: I still get comment spam on my blog, and a lot of it–thousands and thousands of bogus comments.

My investigation of these comments–checking IP addresses, looking at patterns of posting and the links therein, and other discussions of how solid reCAPTCHA’s technology is (e.g., it doesn’t seem susceptible to a “relay attack,” where a puzzle is redirected by the spammer to a unsuspecting person logging onto another site)–leads me to the depressing conclusion that these comments are not done by bots or unwitting third parties. Rather, they are added by hand, one at a time, intentionally. Real human beings are figuring out the blurry words from those old books to insert vaguely plausible comments (“Nice post! Check out my site for more on the same topic.”).

I suppose it’s good news that the spammers are being used as human OCR. By my calculations they’ve decoded, word by word, about 50 pages of text on my blog alone. (Real commenters have transcribed about a half a page.) But I suspect–and would be happy to be proven wrong in real comments, below–that many of the actual people solving the reCAPTCHA are being paid pennies an hour by spam overlords to boost the Google rankings of their clients by adding keyword-rich linked comments to sites with high PageRank.

So in a sense, reCAPTCHA leads to a kind of indirect outsourcing similar to sending a book to be “rekeyed” by low-paid, third-world typists.

October 8, 2007 6 Comments

When Machines Are the Audience

I recently received an email from someone at the Woodrow Wilson Center that began in the following way: “Dear Sir/Madam: I was wondering if you might share the following fellowship opportunity with the members of your list…The Africa Program is pleased to announce that it is now accepting applications…” The email was, of course, tagged as spam by my email software, since it looked suspiciously like what the U.S. Secret Service calls a 419 fraud scheme, or a scam where someone (generally from Africa) asks you to send them your bank account information so they can smuggle cash out of their country (the transfer then occurs in the opposite direction, in case you were wondering). Checking the email against a statistical list of high-likelihood spam triggers identified the repeated use of words such as “application,” “generous,” “Africa,” and “award,” as well as the phrases “submitted electronically” and the opening “Dear Sir/Madam.” The email piqued my curiosity because over the past year I’ve started altering some of my email writing to avoid precisely this problem of a “false positive” spam label, e.g., never sending just an attachment with no text (a class spam trigger) and avoiding the use of phrases such as “Hey, you’ve got to look at this.” In other words, I’ve semi-consciously started writing for a new audience: machines. One of the central theories of humanities disciplines such as literature and history is that our subjects write for an audience (or audiences). What happens when machines are part of this audience?

As the Woodrow Wilson Center email shows, the fact that digital text is machine readable suddenly makes the use of specific words problematic, because keyword searches can much more easily uncover these words (and perhaps act on them) than in a world of paper. It would be easy to find, for instance, all of the emails about Monica Lewinsky in the 40 million Clinton White House emails saved by the National Archives because “Lewinsky” is such an unusual word. Flipping that logic around, if I were currently involved in a White House scandal, I would studiously avoid the use of any identifying keywords (e.g., “Abramoff”) in my email correspondence.

In other cases, this keyword visibility is desirable. For instance, if I were a writer today thinking about my Word files, I would consider including or excluding certain words from each file for future research (either by myself or by others). Indeed, the “smart folder” technology in Apple’s Spotlight search or the upcoming Windows Vista search can automatically group documents based on the presence of a keyword or set of keywords. When people ask me how they can create a virtual network of websites on a historical topic, I often respond by saying that they could include at the bottom of each web page in the network a unique invented string of characters (e.g., “medievalhistorynetwork”). After Google indexes all of the web pages with this string, you could easily create a specialized search engine that scans only these particular sites.

“Machine audience consciousness” has probably already infected many other realms of our writing. Have some other examples? Let me know and I’ll post them here.

March 2, 2006 2 Comments