How to Build a Query Engine

submited by
Style Pass
2024-10-22 16:30:03

In our early days, all the queries doing sequence analytics and feeding the output to visualization layer were built using SQL underneath. Rather quickly, the code for generating these queries become complex and the queries themselves were massive and slow. The attempts to use MATCH_RECOGNIZE (a row pattern matching capability added in SQL:2016 and now available in multiple SQL query engines) were somewhat disappointing and we came to the conclusion that we must take a different path.

We didn’t yet have SOL back then yet. Just a small glimpse of a DSL (Domain Specific Language), prototyped by our CEO, Misha, in his Observable Notebooks. The critical part of the DSL (and the one we struggled the most while trying to express it via SQL) was a capability to match and reference sub-sequences of events in a flexible way. Essentially being able to use array semantics within a window of events for a specific actor. The fundamental issue with SQL is that it is not useful for expressing operations across rows (unlike across columns), which makes this a challenging task. Similar observation was made by Michael Stonebraker and Andy Pavlo in their excellent paper reviewing the landscape of database management systems, where they state that “(…) RDBMSs cannot efficiently store and analyze arrays despite new SQL/MDA enhancements.”.

When dealing with clickstreams or other types of sequence events, this sort of the capability matters a lot.  We looked at various available options as of 2022 and concluded that we really need a dedicated language to solve this problem, though borrowing multiple ideas from existing concepts, such as SASE+, Cypher syntax or regex capture groups.

Leave a Comment