Details: Jennifer Tam, Jiri Simsa, Sean Hyde, and Luis von Ahn. Breaking Audio CAPTCHAs. In Advances in Neural Information Processing Systems (NIPS).
Firstly, the idea of audio captcha itself intimidates me, imagining the whole process that an automated device should go through, if it wants to decipher it. When I chanced across this paper, I was so sure that I can’t make the head or tail out of it. Hence, I did not touch it for the past 1 month. It was left open on my browser, for I was reluctant to close it. Finally, I read it to some extent today. Now, I can only identify the head or tail and still don’t know anything in depth, since all the speech processing stuff went above my head. Yet, I try to summarize what I understood.
Audio Captchas can be understood as those in which the user will be asked to listen to an audio which consists of words to be identified, combined with enough noise. I understand the target users most of the times were the visually impaired, since they can’t access the image based Captchas. This paper analyzed different audio captchas (From google site, digg and reCaptcha site) and estimated which one of them is more robust, using different Machine Learning algorithms (AdaBoost, SVM and k-Nearest Neighbour algorithms). In this process, they also understood the strengths and weaknesses of these algorithms as such too.
So, basically, in this paper, the focus was on understanding how much can a machine learn, in its attempt to break audio captchas. The experiment was done in two stages, like the image captchas: Speech segmentation and recognition. Machine Learning has been used on the captcha segments to perform automatic speech recognition (ASR). The features for this were generated using popular speech feature extraction techniques like – MFCC (Mel Frequency Cepstral Co-efficients), PLP (Perceptual Linear Prediction) and RASTA-PLP(relative spectral transform-PLP). These names appeared scary enough and I did not beyond them to find out more details 🙂
O my God, the creation of training data for these machine learning algorithms was a big exercise in itself. I began wondering about all the speech processing related experiments, involving huge amounts of training data. Patience man! patience! Finally, all their speech samples are transformed in to a set of segments of fixed size. They are all labelled either noise, digit or alphabet. These were used for training the classifiers, one type of captcha at a time.
So, I’ve blah-ed enough and hence, I’ll jump to conclusions:
1. An audio captcha that consists of a finite vocabulary and background noise should have multiple speakers and noise similar to the speakers.
2. Captchas that have longer solutions and multiple speakers tend to be difficult to solve.
3. Having a large vocabulary will make it difficult to collect enough training data for those who aim to break these captchas.
4. To develop an audio captcha with an improved human pass rate, we plan to take advantage of the human mind’s ability to understand distorted audio through context clues.
And finally, it made an interesting read. Some cool work. 🙂
The paper can be read here.