At Netlife I helped build norceresearch.no for NORCE, which is a research institution formed by joining several well established research institutions

Case study: Archiving web sites with wget and puppeteer

submited by
Style Pass
2021-06-30 03:00:05

At Netlife I helped build norceresearch.no for NORCE, which is a research institution formed by joining several well established research institutions. And while building their new site it became clear that there was a need to also archive the web sites that would be replaced by the new research institution.

The web sites to be archived were cmr.no, teknova.no, norut.no, agderforskning.no, uni.no, iris.no, prototech.no and polytech.no. Some sites were soon to be phased out, while others were archived just in case. These sites were built and hosted by various firms including NORCE itself.

My approach had two parts. One, I would get in touch with the various web site administrators and request a copy of the codebase and the database. Two, I would apply the Swiss army knife that is wget for creating static exports of the sites. wget is a command line tool initially released in January 1996. It’s feature-rich, mature and a real workhorse.

I got to work by sending a flurry of e-mails, before getting to work by archiving iris.no with wget. And as luck (or bad luck) would have it it turned out to be no easy feat.

Leave a Comment