One of the cases where AI models are used is parsing raw data into a structured format. We can use an AI model to collect the data we need in web scraping without writing a parser. This is especially useful when a website updates its layout frequently. Using AI reduces the need to update the parser.
We're using this page: https://books.toscrape.com/ as the target website. The goal is to get the book data in a nicely structured JSON format. We skip the fetching part and go directly to the parsing stage. So, I'm saving the raw HTML (the entire body) in a static .txt file.
During development, I'm also creating a simplified version of the HTML by adding a new .txt file that includes only the relevant data we're targeting, not the whole body tag content. This is useful to avoid using too many tokens during the development process.
Instead of using a regular prompt, we need to find out if the AI model has a specific feature for parsing the string. If not, we can still use a simple prompt to do the job.