Emacs link scraping (2021 edition)

submited by
Style Pass
2021-05-31 09:00:10

A recent Hacker News post, Ask HN: Favorite Blogs by Individuals, led me to dust off my oldie but trusty command to extract comment links. I use it to dissect these wonderful references more effectively.

You see, I wrote this command back in 2015. We can likely revisit and improve. The enlive package continues to do a fine job fetching, parsing, and querying HTML. Let's improve my code instead… we can shed a few redundant bits and maybe use newer libraries and features.

We start by writing a function that looks for a URL in the clipboard and subsequently fetches, parses, and extracts all links found in the target page.

Let's chat (current-kill 0) for a sec. No improvement from my previous usage, but let's just say building interactive commands that work with your current clipboard (or kill ring in Emacs terminology) is super handy (see clone git repo from clipboard).

Moving on to sanitizing and filtering URLs… Links often have trailing slashes. Let's flush them. string-remove-suffix to the rescue. This and other handy string-manipulating functions are built into Emacs since 24.4 as part of subr-x.el.

Leave a Comment