Scan HTML faster with SIMD instructions: Chrome edition

submited by
Style Pass
2024-06-08 05:30:03

Modern processors have instructions to process several bytes at once. Effectively all processors have the capability of processing 16 bytes one once. These instructions are called SIMD, for single instruction, multiple data.

It was once an open question whether these instructions could be useful to accelerate common tasks such as parsing HTML or JSON. However, the work on JSON parsing, as in the simdjson parser, has shown rather decisively that SIMD instructions could, indeed, be helpful in breaking speed records.

Inspired by such work, the engine under the Google Chrome browser (Chromium) has adopted SIMD parsing of the HTML inputs. It is the result of the excellent work by a Google engineer, Anton Bikineev.

The approach is used to quickly jump to four specific characters: <, &, \r and \0. You can implement something that looks a lot like it using regular C++ code as follows:

A ‘naive’ approach using the SIMD instructions available on ARM processors looks as follows. Basically, you just do more or less the same thing as the naive regular/scalar approach, except that instead of taking one character at a time, you take 16 characters at a time.

Leave a Comment