Automated Scraping with GPT-4, Part 1

submited by
Style Pass
2023-03-19 05:30:04

Like most people I know, I’ve been watching the pace of improvements to LLMs like ChatGPT and GPT-4 with a mix of awe and trepidation. I’ve been wanting a small project to get to explore the APIs better, and recently decided I’d try to see if I could use it to automate web scraping.

For context, I’ve written a lot of web scrapers. For the better part of thirteen years, I ran Open States, a project that scraped state legislative websites to make them more accessible to the public. The biggest challenge in running a project like that is keeping up with the constant changes to the websites you’re scraping.

Writing web scrapers is a translation task, you take a piece of HTML and transform it to a structured data format. From what I understand, LLMs should be good at this task. They also seem to parse HTML and JSON well enough that existing models should be useful to generate a scraper.

That same schema could then be applied to pages from other states like https://www.ncleg.gov/Members/Biography/H/339 with no more than a single customization: a CSS selector to limit what portion of the HTML was sent to the API.

Leave a Comment