On Multi-Set Hashing

submited by
Style Pass
2021-07-25 19:30:08

A typical hash function takes some data, usually just bytes, and outputs a succint digest of bits. A slight change to the data creates a large change in the digest.

Here, I’m asking if you can design a hash function for a set of objects. The order you hash the objects in shouldn’t matter. For example,

A simple solution is to sort your objects before passing them to the hash function. For example, you’d sort $\{c, a, b\}$ to get the ordering $\{a, b, c\}$, and then pass that to the hash function. The initial ordering no longer matters, because you convert to a standard ordering before using the hash function.

Hash functions can work incrementally. If you want to hash a 2GB file, you don’t need to keep the whole file in memory. Instead, you can feed the hash function small chunks of the file, and end up with the same result as feeding in the entire 2GB of data at once.

Having to sort our objects before feeding them into the hash function loses this incremental property. We have to keep a large sorted collection around until we see all the objects.

Leave a Comment