If you're working with large amounts of data, you've likely heard about high-cardinality or ran into issues relating to it. It might sound l

How databases handle 10 million devices in high-cardinality benchmarks

submited by
Style Pass
2021-06-17 13:00:03

If you're working with large amounts of data, you've likely heard about high-cardinality or ran into issues relating to it. It might sound like an intimidating topic if you're unfamiliar with it, but this article explains what cardinality is and why it crops up often with databases of all types. IoT and monitoring are use cases where high-cardinality is more likely to be a concern. Still, a solid understanding of this concept helps when planning general-purpose database schemas and understanding common factors that can influence database performance.

Cardinality typically refers to the number of elements in a set's size. In the context of a time series database (TSDB), rows will usually have columns that categorize the data and act like tags. Assume you have 1000 IoT devices in 20 locations, they're running one of 5 firmware versions, and report input from 5 types of sensor per device. The cardinality of this set is 500,000 (1000 x 20 x 5 x 5). This can quickly get unmanageable in some cases, as even adding and tracking a new firmware version for the devices would increase the set to 600,000 (1000 x 20 x 6 x 5).

In these scenarios, experience shows that we will want to eventually get insights on more kinds of information about the devices, such as application errors, device state, metadata, configuration and so on. With each new tag or category we add to our data set, cardinality grows exponentially. In a database, high-cardinality boils down to the following two conditions:

Leave a Comment