Someone at work posted a link to this Quanta Magazine article. It describes a novel, and seemingly straight-forward way to estimate the number o

Go deh!: Recreating the CVM algorithm for estimating distinct elements gives problems

submited by
Style Pass
2024-11-18 03:30:04

 Someone at work posted a link to this Quanta Magazine article. It describes a novel, and seemingly straight-forward way to estimate the number of distinct elements in a datastream. 

I looked at the description and decided to follow their text. They carefully described each round of the algorithm which I coded up and then looked for the generalizations and implemented a loop over alll items in the stream ....

It did not work! I got silly numbers. I could download Hamlet split it into words, (around 32,000), do len(set(words) to get the exact number of distinct words, (around 7,000), then run it through the algorithm and get a stupid result with tens of digits for the estimated number of distinct words. I re-checked my implementation of the Quanta-described algorithm and couldn't see any mistake, but I had originally noticed a link to the original paper. I did not follow it at first as original papers can be heavily into maths notation and I prefer reading algorithms described in code/pseudocode. 

I looked at Algorithm 1 as a probable candidate to decypher into Python, but the description was cryptic. Heres that description taken from the paper:

Leave a Comment