Some Strategies For Fast Lexical Analysis when Parsing Programming Languages

submited by

Style Pass

2024-04-28 17:00:10

Abstract: Some techniques I've used to make lexical analysis faster: enhancing state machine traversals with additional inner-loop operations to minimize branch mispredictions, and restricting location-tracking-for-diagnostics to only file byte-offsets.

I've generally found the use of parser generators like lex & yacc to be painful; they generate source code, which complicates the build process (e.g. on Windows where developers don't have them available by default); yacc is twitchy (shift-reduce conflicts, error handling, etc.); the regular expressions in lex/flex can get annoying (e.g. for strings with backslash-escaped elements).

But the proof is in the pudding: what if we can just do faster than flex, and the speed of lexing matters? Here's the results I got in 2006 when I did this work with a C parser:

Runtime performance lexing ~7,500,000 lines of C code using MSVC 6 in 2006: lexer input reading performance flex (default) fread 8KB blocks 13.56 s stb_lex (w/symbol hashing) fread whole file 4.84 s stb_lex fread whole file 4.23 s flex -F (fast) fread 8K blocks 3.07 s flex -f (full) fread 8K blocks 2.92 s handcoded fread whole file 2.45 s handcoded mmap mmap() 2.14 s wc fread 64K blocks 1.73 s