After discovering the SQLite file format and implementing the .tables command in part 1 and part 2 of this series, we're ready to tackle the next big

Build your own SQLite, Part 3: SQL parsing 101

submited by

Style Pass

2024-11-18 21:30:20

After discovering the SQLite file format and implementing the .tables command in part 1 and part 2 of this series, we're ready to tackle the next big challenge: writing our own SQL parser from scratch.

As the SQL dialect supported by SQLite is quite large and complex, we'll initially limit ourselves to a subset that comprises only the select statement, in a striped-down form. Only expressions of the form select <columns> from <table> will be supported, where <columns> is either * or a comma-separated list of columns names (with an optional as alias), and <table> is the name of a table.

Our SQL parser will follow a conventional 2 steps process: lexical analysis (or tokenization) and syntax analysis (or parsing).

The lexical analysis step takes the input SQL string and groups individual characters into tokens, which are meaningful units of the language. For example, the character sequence S-E-L-E-C-T will be grouped into a single token of type select, and the sequence * will be grouped into a token of type star. This stage is also responsible for discarding whitespace and normalizing the case of the input.

The syntax analysis step takes the stream of tokens produced by the lexical analysis, and tries to match them against the syntax rules of the language. Its output is an abstract syntax tree (AST), which is a hierarchical representation of the input SQL.