For a small talk I wanted to demonstrate that you can pack more business logic into Postgres than some people would guess. It might not be a good idea

Building a web-scraper in Postgres - The Hidden Blog

submited by
Style Pass
2025-01-16 21:00:03

For a small talk I wanted to demonstrate that you can pack more business logic into Postgres than some people would guess. It might not be a good idea for most cases but it's certainly fun.

The most important part is the Postgres extension pgsql-http, which allows us to do http calls directly from within Postgres. This does not come included by default so we'll have to build a custom Docker image with that dependency included. A simple Dockerfile like the following will do.

Once you run the image you have to load the extension. You can check the pg_extension table if the extension was installed successfully.

Tasks, a list of tasks that the scraper should work on. Each of the tasks should have a status like "available", "in_flight", "finished" or "failed".

This alone wouldn't do anything, so we need a trigger that acts on every new insert into the task table. If a task with the status "available" is inserted, we want to process it. This is the main part of the scraper where we fetch the results from the web, check the status code, parse the response and spawn additional tasks if needed.

Leave a Comment