URL parser performance

submited by
Style Pass
2023-11-21 12:30:09

“URL” is this mythical concept of a string that identifies a resource online and yet there is no established standard for its syntax. There are instead multiple ones out of which one is on purpose “moving” so it never actually makes up its mind but instead keeps changing.

This then leads to there being basically no two URL parsers that treat URLs the same, to the extent that mixing parsers is considered a security risk.

The browsers have established their WHATWG URL Specification as a “living document”, saying how browsers should parse URLs, gradually taking steps away from the earlier established RFC 3986 and RFC 3987 attempts.

The WHATWG standard keeps changing and the world that tries to stick to RFC 3986 still needs to sometimes get and adapt to WHATWG influences in order to interoperate with the browser-centric part of the web. It leaves URL parsing and “URL syntax” everywhere in a sorry state.

In the curl project we decided in 2018 to help mitigate the mixed URL parser problem by adding a URL parser API so that applications that use libcurl can use the same parser for all its URL parser needs and thus avoid the dangerous mixing part.

Leave a Comment