Emin Bahadır Tülüce

submited by
Style Pass
2021-06-12 01:00:06

In a typical serialization library, encoding a collection gets carried out in the same way for lists (ordered) and sets (order-agnostic). Regardless of the collection type, the elements are simply written one after the other. This means that the encoded bytes implicitly carry the information of the order of elements. Is it possible to isolate this information, eliminate it from the encoding, therefore end up with a more compact byte representation?

Bottom line up front: Yes, it is possible to store a set using less than the regular amount of bytes. Both theoretically and practically. However, the practical algorithm is not so straightforward hence it's being classified as a "compression" algorithm.

First, let me demonstrate a very simple example to show how this is theoretically possible. I'll be using a "multiset" instead of a "set" for the order-agnostic collection to keep the example simpler.

We were able to drop half of the possible collections in the multiset case since we don't respect the order of elements. This then allowed us to use one less bit during the encoding.

Leave a Comment