reCAPTCHA

Everyone knows what a captcha is by now but in case you are blissfully unaware let me explain it. You are perusing a website article about how it’s perfectly acceptable for grown men to read all the Harry Potter novels when you feel the need to set the record straight. You click on ‘Post Comment’ ready to troll the author and otherwise tell him what a big nerd he is when something like this appears on your screen.

So now you need to stop what you are doing and decode some text before you can proceed to explain why everybody who is commenting besides you is a moron.

The reason we all need to go through with this ritual is because the internet is a horrible place where everybody is trying to sell you budget viagra. Spam bots will scour the web looking for places to comment and leave messages like “BUY VIAGGRA FOR DICK LONG!! LADIES SAY YES!!”. Since these spam bots aren’t actual people they have the advantage of being able to post millions of comments around the web in the time it would take us to string a proper sales pitch together. But the lack of a human brain is also the spam bot’s downfall.

You see, computers have trouble in certain areas that are second nature to humans. Spatial recognition, context, speaking- these are all challenging tasks that we take for granted. Optical Character Recognition, or OCR, allows computers to scan an image of text and convert it to actual text, in effect translating the image. Computers can actually do this fairly well but problems arise when the font type is not recognized or the letters are smudged or the image quality is low. And that’s why captchas work, because spam bots aren’t able to correctly decipher the fuzzy images of words to access the commenting sections of websites.

Enter reCAPTCHA, a clever new kind of captcha that wants to put the human brain to noble use. It is a Google project that aims to translate years of printed books and New York Times newspapers and preserve them on the web. After scanning and analyzing text images, reCAPTCHA takes the words OCR has trouble with and publishes them for human eyes to decipher. You enter the text and essentially translate a word that the computer couldn’t so now it knows the correct answer.

This is all well and good but what does this mean for security? If the computer doesn’t know the correct answer then how does it know that you are a legitimate human? reCAPTCHA uses two words, one that it knows the answer to and the other that it wants the answer to. If you submit the correct answer for the known word then it assumes your answer for the other is correct and validates you. I initially worried about there being an easy word and a difficult word. What is stopping a spam bot with OCR from successfully bypassing the captcha? The OCR will correctly identify the easy word, then mess up on the difficult word, and the reCAPTCHA will accept the mistake as a correct translation as long as the easy word is a match! Let me reiterate. If captcha text is not generated by a human on a word proven to fail OCR, then it is not an effective antispam measure. Google claims that both words used are unreadable by OCR and that is probably true, but playing around with the web service shows a whole lot of easy words that I have a hard time believing can’t be solved by bots.

Now, I’m not claiming to be smarter than Google and the entire reCAPTCHA project so I will defer to their claims of security. It seems like a legit enough process, yet still a process that inherently allows us to game the system. Any time you see a reCAPTCHA branded captcha, it’s time to start having fun.

This is an actual captcha from their website that was approved.

Oops. I just made the New York Times racist. You can play around with it yourself here.