Removing tens of thousands of web pages

submited by
Style Pass
2024-05-08 08:30:06

From 1997 onwards, we had web pages for security announcements. We had to manually prepare a .data and a .wml file which then generated a web page for each security announcement (DSA or DLA). We have listed the 6 most recent messages in a short list that was created from these files. Most of the work that went into the Debian web pages was creating these files.

Our search engine often listed the pages with security announcements instead of a more relevant web page for a particular topic.

At DebConf Kosovo (2022) I started with a proof of concept and wrote a script, that generates this list without using the .data/.wml files in the Git repository, but instead reading the primary sources of security information[1]. This new list now includes links to the security tracker and the email of the announcement.

When I looked at the OVAL files and the apache logs of our web server, I saw that more than 99% of the web traffic was generated by these XML files (134TB of 135TB total in two weeks). They were not compressed and were around 50MB in size. With the help of Carsten Schönert we managed to modify the python scripts that generate this OVAL file without using the .data/.wml files and now we only provide bzip2 compressed XML files[2].

Leave a Comment