This rust library compresses and decompresses sequences of numerical data very well. It currently supports the following data types: i32, i64, u32, u6

mwlon / quantile-compression

submited by
Style Pass
2021-06-23 23:00:04

This rust library compresses and decompresses sequences of numerical data very well. It currently supports the following data types: i32, i64, u32, u64, f32, f64. Smaller data types like i16 can be efficiently compressed by casting to i32. Timestamp support may come soon in the future.

For natural data, it typically shrinks data to 25-40% smaller than what gzip -9 produces, compresses much faster, and decompresses equally quickly.

The intended use case for this algorithm is compressing columnar data, especially for use by Spark and other execution engines.

This works by describing each number with a range and an offset. The range specifies an inclusive range [lower, upper] that the number might be in, and the offset specifies the exact position within that range. The compressor chooses a prefix for each range via Huffman codes.

For data sampled from a random distribution, this compression algorithm can reduce byte size to near the theoretical limit of the distribution's Shannon entropy. Ideally it encodes a number k in b bits if 2^-b ~= P(k). We can plot Q(k) = 2^-b to see how close quantile compression gets to the ideal in this example with max_depth=3:

Leave a Comment
Related Posts