In Praise of Small Data

submited by
Style Pass
2023-11-20 11:30:05

Lately there has been a lot of talk about the “Big Data” (petabytes and so forth) and how much valuable information is just waiting to be extracted from it.

There are diminishing returns to the amount of information you can extract from data. The tenth gigabyte is worth much less than the second gigabyte. The hundredth gigabyte is worth less than than the tenth.

How much less? In regression analysis, the information you can derive from a data set is related to the square root of the size of the data set. That is, if you double the size of your data set, you only glean about 40% more information. On its face, a hundred gigabytes seems like it should have a hundred times more information than a single gigabyte, but from the standpoint of statistics, it only has ten times as much. So the hundredth gigabyte has about a tenth of the informational value as the second gigabyte.

And then there are diminishing returns to the value of information itself. Suppose you wish to estimate the effectiveness of a new drug (or something). Suppose the actual effect is 1.53824, pick whatever units you like. You might be able to make a decision just knowing the first decimal place (that is, you estimate the effect around 1.5). The second decimal place might be nice to have, but unless you are working in the physical sciences, it is unlikely that knowing the fifth decimal place to be a 4 instead of an 8 will have any effect on anyone’s perception of reality.

Leave a Comment