As you're here, you've probably seen those posts. Arguably, this one might be a bit more boring, as there's no brief introduction to a weird language

GitHub - expr-fi/fastlwc: SIMD-enhanced word counter

submited by

Style Pass

2022-09-22 20:00:22

As you're here, you've probably seen those posts. Arguably, this one might be a bit more boring, as there's no brief introduction to a weird language here; it's all just the C you already know (or don't), and the algorithm used is just as simple as the task at hand (more on that below). Still, intrigued by the entries mentioned above, the goal here is to find out exactly how fast is fast when it comes to wc.

A word, as far as wc is concerned, is a non-zero-length string of characters, delimited by white space (POSIX.1-2017). As some implementations have noted, no distinction is made between printable and non-printable characters, yet for example GNU wc ignores non-printable (non-whitespace) characters when counting words. Historical implementations regarded only spaces, tabs and newlines as whitespace – a straightforward approach, which can still be seen in use today. POSIX.1, however, recommends the use of an equivalent of isspace() from the C standard library.

All non-historical implementations support multibyte characters, including multibyte whitespace characters (when using a multibyte locale). However, some of them only do so when compelled with the -m option, which is used to print input lengths in characters instead of bytes. Other implementations recognise multibyte whitespace characters when counting words regardless of that option. This not only alters the behaviour, but has also considerable performance implications when wc is not given any options, i.e. the default use case, as highlighted below. (This disparity has been mistakenly attributed to different reasons in previous discussions.)