This is a note on Louis Brandy and Nathan Bronson's talk at Systems @scale - https://www.facebook.com/atscaleevents/videos/5279535778820968/.  Data ge

Fast SQL from Schemaless ingestion

submited by
Style Pass
2022-07-01 21:00:06

This is a note on Louis Brandy and Nathan Bronson's talk at Systems @scale - https://www.facebook.com/atscaleevents/videos/5279535778820968/.

Data generated from the customers are often schemaless (e.g. JSON), but fast SQL for analytical queries are desired out of this unstructured data. Having fast query performance and less rigidity of the data schema are usually contradictory goals. Let's see how Rockset solves this problem.

The overall strategy taken by Rockset is very similar to Databricks' Photon, as both use vectorized query engines. The query engine operates against a batch (array) of inputs of a specific type, so C++ templates can be specialized for fast execution in those tight loops inside the query engine. Rockset infers the type by examining the query text and column names.

Vectorization is a good fit for dynamic schemaless workloads because you can batch type checks (e.g. here's an array of integers) before dispatching the work to the query engine, compared to doing type checks per row.

Leave a Comment