A reCAPTCHA Dilemma?

Here’s a possible conundrum worthy of the New York Times’s ethicist, Randy Cohen (no relation to your’s truly). I have been a major proponent of reCAPTCHA, the red and yellow box at the bottom of my blog posts that uses words from books scanned by the Internet Archive/Open Content Alliance as a system to prevent comment spam. At the same time visitors decipher the words in that box to add a comment, they help to turn old texts into accurate, useful transcriptions. My glee about killing two birds with one stone has soured a bit after discovering something unsettling: I still get comment spam on my blog, and a lot of it–thousands and thousands of bogus comments.

My investigation of these comments–checking IP addresses, looking at patterns of posting and the links therein, and other discussions of how solid reCAPTCHA’s technology is (e.g., it doesn’t seem susceptible to a “relay attack,” where a puzzle is redirected by the spammer to a unsuspecting person logging onto another site)–leads me to the depressing conclusion that these comments are not done by bots or unwitting third parties. Rather, they are added by hand, one at a time, intentionally. Real human beings are figuring out the blurry words from those old books to insert vaguely plausible comments (“Nice post! Check out my site for more on the same topic.”).

I suppose it’s good news that the spammers are being used as human OCR. By my calculations they’ve decoded, word by word, about 50 pages of text on my blog alone. (Real commenters have transcribed about a half a page.) But I suspect–and would be happy to be proven wrong in real comments, below–that many of the actual people solving the reCAPTCHA are being paid pennies an hour by spam overlords to boost the Google rankings of their clients by adding keyword-rich linked comments to sites with high PageRank.

So in a sense, reCAPTCHA leads to a kind of indirect outsourcing similar to sending a book to be “rekeyed” by low-paid, third-world typists.

October 8, 2007 6 Comments

In Blogs, Google, Spam

Comments

Alexis says:

October 8, 2007 at 1:10 pm

Sounds like China’s World of Warcraft gold farmers?

Are you certain they are real people? It’d be interesting to mirror this site elsewhere, but using a different captcha technology. If the spam stopped, perhaps it really would indicate a flaw in recaptcha. I wonder if there are wide scale statistics on amount of spam received by users of different anti-spam techniques. Hmmm….

Dan Cohen says:

October 8, 2007 at 3:26 pm

Exactly like China’s WoW workers. I do think they are real people, judging by the input patterns of the spam, the fact that reCAPTCHA is very hard to crack in an automated way, and especially because the blog posts that are spammed are those with the highest PageRank and thus the most potential to pass along “Google juice” (i.e., not an even distribution, as you might imagine from a bot).

Jeanne says:

October 9, 2007 at 9:47 am

So… is that inherently a bad thing? With the advent of services like Amazon Mechanical Turk (http://www.mturk.com/mturk/welcome) – this sort of low pay per action computer work is only going to increase. I am sort of cheered by the thought that if they are paying folks to hand enter CAPTCHAs in general as an avenue to generate spam that some of the time they are hand transcribing books.

I definitely have been getting more hand entered comment spam (in addition to trackback spam that reCAPTCHA doesn’t prevent) – but for now it is still manageable.

I guess the next step is slightly more controlled communities – ones in which you must register in order to comment. There are definitely existing popular sites that moderate the first few comments posted by a new member and therefore discourage manual spam of the type you are describing – but then permit unmoderated comment posting after the user has proven they are a real person with a desire to contribute. That model sounds like a fairly sustainable one. My gut tells me that you would spend about the same time approving new posts by new ‘blog members’ as you used to spend yanking hand written spam posts – but perhaps it gives you the opportunity to build more of a community around your blog? Would some folks NOT comment if they had to register? Probably… but there is no perfect answer to all this. Of course this also begs the question if the hand-spammers wouldn’t just register along with ‘real’ people and see if they could sneak under the radar.

Ben Maurer says:

October 9, 2007 at 10:04 am

Hi,

I’m one of the engineers who works on reCAPTCHA. We’ve found that sometimes the way spam detection works in wordpress can be confusing.

Please see our FAQ on this matter:

reCAPTCHA WordPress Plugin

The reCAPTCHA WordPress plugin uses a CAPTCHA to prevent comment spam. Here is how to add reCAPTCHA to your WordPress blog:

1. Download the zip file.
2. Unzip the recaptcha folder into your WordPress wp-content/plugins directory.
3. Activate the plugin on the Options | Plugins Management page of your WordPress admin site. A web form will prompt you to enter a public and private API key. You can sign-up for the keys at the ReCAPTCHA site using the link provided, and then enter them in the text fields to activate the plugin.
4. That’s it! Your reCAPTCHA widget should now appear on the comments page.

FAQ
HELP, I’m still seeing comment spam

There are two common issues that make reCAPTCHA appear to be broken, but are actually not problems.

* Moderation emails: reCAPTCHA marks comments as spam, so if you get moderation emails when spam comments are sent, you will get moderation emails for all spam comments with reCAPTCHA. We highly recommend turning off moderation emails with reCAPTCHA.
* Trackbacks and Pingbacks: reCAPTCHA can’t do anything about pingbacks and trackbacks. You can disable pingbacks and trackbacks in Options | Discussion | Allow link notifications from other Weblogs (pingbacks and trackbacks).

We’ve looked at the logs for your site and only seen about 40 solutions from your blog (all correct). If you’re still having spam issues, please contact us at support@recaptcha.net.

Wally Grotophorst says:

October 9, 2007 at 12:38 pm

I too have reCAPTCHA installed…saw your post and went to my Akismet log and see that in the past 48 hours I have 63 comment spams that Akismet caught. The location of the offending IP address are all over the place: Ashburn, VA, Littleton, CO, Netherlands, Czech Republic, etc.

I ran a test: did a couple of comments 1) bad info in reCAPTCHA box and 2) no info in reCAPTCHA box. Both comments were tagged as spam and my Akismet filter caught them. When I installed reCAPTCHA I thought it tossed comments if the images weren’t successfully “read” by the commenter. I was wrong about that, it apparently just tags them as spam.

Doesn’t this indicate that spammers aren’t necessarily interacting with reCAPTCHA at all? I guess I need to know if your blog software is picking up that the comments are spam or are they getting posted as viewable/legit comments?

Dan Cohen says:

October 9, 2007 at 1:16 pm

Thanks, Wally and Ben, for clearing this up. I did a few more tests myself, and now realize that the spammers aren’t actually solving the CAPTCHAs–unfortunately for the book transcription side of this equation. They are just inputting the spam into the comment text box and submitting it; it then gets automatically tagged as spam without pinging the reCAPTCHA servers. It seems to me that WordPress simply shouldn’t accept comments (as spam or not spam) in cases where the CAPTCHA isn’t solved. But I guess it’s OK to put these messages into the Akismet spam deletion box for automatic purging.

So, no ethical dilemma, though I was trending anyway toward Jeanne’s no-problem-here enjoyment of the spammers doing the OCR for us.

A reCAPTCHA Dilemma?

Comments

Leave a Reply Cancel reply