Ted is writing things

submited by

Style Pass

2021-06-17 21:00:11

S o you decided to use differential privacy (DP) to publish or share some statistical data. You ran your first pipeline or query1, all excited, but then… You're getting useless results. Maybe all your data seems to have disappeared. Or the statistics look very unreliable: the noise completely drowns out the useful signal. Don't lose hope! This situation is common the first time people try to use differential privacy. And chances are that you can fix it with a few simple changes.

In this post, we'll cover five basic strategies to improve the utility of your anonymized data. These are far from the only tricks you can use, but they're a good place to start. And none of them requires you to sacrifice any privacy guarantees.

DP algorithms produce better results when run over more data. Remember: the noise they introduce is proportional to the contribution of a single person. It doesn't depend on the size of the input data. So, the more people you have in a statistic, the smaller the relative noise will be. Individual contributions will "vanish into the crowd", and so will the added uncertainty.

Increasing the total amount of input data will improve utility, but you may not be able to get more data. Luckily, there are other ways to take advantage of this property. What matters is the amount of data that contributes to each statistic you produce. In other words, the finer you slice and dice your data, the worse your utility will be. If you can, slice the data into coarser partitions. For example, calculate weekly statistics rather than daily statistics. Or aggregate your data per-country rather than per-city. You get the idea.