Breaking the 4Chan CAPTCHA | nullpt.rs

submited by
Style Pass
2024-11-29 21:00:04

This project was entered into as a learning experience, to enhance my knowledge of machine learning, as well as TensorFlow specifically. At the end, I wanted to have a trained machine learning model that runs in the browser to reliably (at least 80% accuracy, >90% preferred) solve the 4Chan CAPTCHA. These goals were achieved - let's talk about how I got there!

I've heard many times that the hardest part of any machine learning problem is getting the data to train your model. This assertion was definitely pertinent here, for several reasons. There's two parts to this problem: Getting the CAPTCHAs, and getting solutions to those CAPTCHAs.

After looking at the HTTP requests in the browser console when requesting a new CAPTCHA, I found that it makes a request to https://sys.4chan.org/captcha?framed=1&board={board}, where {board} is the name of the board we're trying to post on. The response is an HTML document that contains a script tag with a window.parent.postMessage() call with some JSON. On a hunch, I tried to remove the framed=1 parameter, and found that this causes it to just spit out the raw JSON. That should be easier to work with. The JSON looks like this:

Some of these keys are pretty obvious. ttl and cd are the least obvious to me. I know from experience that the 4Chan CAPTCHA only displays for about 2 minutes before it expires and you have to request a new one, so that's what ttl must be. But what about cd? Let's make another request, shortly after the first one:

Leave a Comment