It took me about a week of my Dashain vacation to make this simple captcha breaker (5 days to be exact). I had been reading Image Processing in my CS undergrad for some time then, and wanted to put my skills to use, although, I must admit, its not as overly complicated or full of image processing bits as I’d have likely enjoyed. I had only learned the basics when I started doing this. If I were to do it now, I would have definitely done it differently.
I decoded a simple text based CAPTCHA with an acceptable level of accuracy (acceptable to me, that is). I did not generate any CAPTCHA for the purpose, but instead used a live website to grab the CAPTCHAs from. A now-defunct WordPress plugin called Captcha on Login generated these CAPTCHAs, and the plugin is still used by several websites.
With the help of the selenium module, I put together a Python script to take screenshot of the website whenever the CAPTCHA was shown, and then simply crop the image to fit to the CAPTCHA and save it to my local disk. In this way, I prepared a training and testing set for my Neural Network which I was going to use for recognizing the letters.
The reason I called it a simple text based CAPTCHA is because, all the characters have a uniform color, they are not connected, and the characters have a constant font. The background color is also uniform although completely different from the color of the characters, and there are no distortions, rotations etc. That means, a lot of the hardwork that goes into identifying text locations, and extracting characters from the background was unnecessary. I only needed to focus on segmentation, and its not difficult either, given that the characters are not connected, and the background and text color is clearly separable.
The first image is what the CAPTCHA looks like. It’s easier to work with grayscale image than it is with a colored image. So, I converted it into a grayscaled version, which you can see on the second image. Most of the work has been accomplished by using the pillow imaging library in Python.
How did I do it? First, I converted it to a GIF image so that the available colors are reduced to 256 (GIF supports 8 bits per pixel, that means, the maximum number of colors a frame can use is 256). Then, as you can see, the characters have clearly a distinct color from the background. So, I made a color histogram of the now converted image, and then chose a pixel that is moderately common. In the above CAPTCHA, orange color (i.e. the background) is the most common. Using the color histogram, I tried to take a guess at the character’s pixel (i.e. Red colored pixel) since all the characters in every CAPTCHA had red color. Then, I made a new image the same size as the first image with a white background, and then went through the first image, looking for the pixel of the color that I had chosen earlier, then when I found that pixel, I marked the pixel on the same position on the second image as black. I continued this until I had correctly chosen the pixels for red color. Turns out, anything below 52 would give me just the characters from the original image. And so, I could successfully convert the original CAPTCHA with background to a binarized image.
The next and major step is segmentation. The neural network would only be able to recognize one character at a time. That means, I need to feed one character’s image at a time. The grayscale image that we now have, needs to be divided into separate images for each characters that it contains, and that is done with the help of vertical slicing.
This is actually pretty easy in this case because the characters are not connected at all. If they were, I’d have to put more effort into untangling them. But, in this case, I can keep track of where each character starts and ends and just extract that part.
The way I have done that is iterate through the column of the image first and then through each of the rows in that column. I then check for the value of the pixel at that row and column, and if it is 255, I know that it is background. If it is not 255, however, I know that I’m inside a character, and I mark the value of the current column as the starting location. I keep on iterating until I find out the immediate next column where all of the rows have a pixel value of 255, which means that the column only contains the white background. And then I mark that column as the end location for that character.
In this way, I build an array with starting and ending point for each of the characters. The problem with this however, is that, sometimes there are gaps in the pixels i.e. some characters might miss a pixel or two because earlier when we converted the image to grayscale, we only chose to blacken the pixels that were under 52, so some of the pixels will be missed. So, I added a check to make sure that the new character is indeed a new character, and not the gaps between the pixels of the same character. If the difference between the end location and start location is greater than 5 then I took it as a new character otherwise, I assumed it to belong to the same character. I then cropped the image using the starting and ending point of each of the characters.
This is what I would have done differently had I done it today. Instead of such a naive assumption of the pixel distance, I would have just used gaussian blur, morphological processing such as erosion and dilation or line thickening to tackle this issue. Unfortunately, when I did this, I was yet to learn about all these things. So, one thing you can do to improve this approach, and maybe increase accuracy, is to try these out rather than my naive assumption way.
Since the Neural Network expected a uniform input, I created images of 60×60 dimension with white background, and placed the cropped image of each of the characters in the middle of that image. So, each cropped character was saved with a dimension of 60×60 in the middle of a white background. And thus, all of the characters were successfully segmented.
Now the only part left is to train the Neural Network. Each of the segmented characters were saved into their respective directories identified by their number in serial order (starting with 0, 1, 2, …, 9 for digits, 10, 11, 12, …, 35 for alphabets, and 36, 37, 38, …, 61 for capital alphabets). Some of the digits and alphabets were not used in the CAPTCHA, so some of the folders turned out to be empty.
The neural network is implemented the same way as it has been before. Using 3600 input nodes, 2700 hidden nodes, 62 output nodes, and a learning rate of 0.005 gave me the best result. Just a single run did not give me satisfactory result, so I trained the neural network by epoch for about 30 runs. With a training set of 1350 samples, and a test set of 50 samples, it was able to recognize characters with a 90% accuracy. In the end, the CAPTCHA breaker was tested with 200 CAPTCHAs and it recognized almost 80% of it — and that includes the process of reading the image, converting it to grayscale, segmenting each characters, feeding the extracted characters to the neural network in order to recognize it, and then concatenating the outputs of the neural networks together. I also tried using Tesseract directly on the CAPTCHA, and it gave me a 25% accuracy.
As I’ve mentioned before, it was a simple CAPTCHA, and I broke it with such minimal effort that it should be a lesson not to roll your own CAPTCHA. Everyone should be using reCAPTCHA. Now go ahead, look for sites that have weak CAPTCHAs and try breaking them. Here’s one for you to try out your skills.
All the codes and the dataset are available on my GitHub Repository. For your convenience:
- Code to grab CAPTCHA from the website
- Code to prepare the CSV dataset for the characters
- Code to train the Neural Network
- Code for the main program (i.e. read, segment, resize, feed into Neural Network, and concatenate output)
- Training set for the segmented characters: Actual Images | CSV File
- Testing set for the segmented characters : Actual Images | CSV File
- Testing set for the whole CAPTCHA : Actual Images
Hidden Markov Models.