Mainly Tech projects on Python and Electronic Design Automation.
Monday, May 27, 2024
Recreating the CVM algorithm for estimating distinct elements gives problems
[画像:a Jamaican teaching Python in the Blue Mountains. Image 2 of 4]
Someone at work posted a link to this Quanta Magazine article. It describes a novel, and seemingly straight-forward way to estimate the number of distinct elements in a datastream.
Quanta describes the algorithm, and as an example gives "counting the number of distinct words in Hamlet".
Following Quanta
I looked at the description and decided to follow their text. They carefully described each round of the algorithm which I coded up and then looked for the generalizations and implemented a loop over alll items in the stream ....
It did not work! I got silly numbers. I could download Hamlet split it into words, (around 32,000), do len(set(words) to get the exact number of distinct words, (around 7,000), then run it through the algorithm and get a stupid result with tens of digits for the estimated number of distinct words.
I re-checked my implementation of the Quanta-described algorithm and couldn't see any mistake, but I had originally noticed a link to the original paper. I did not follow it at first as original papers can be heavily into maths notation and I prefer reading algorithms described in code/pseudocode.
I decided to take a look at the original.
The CVM Original Paper
I scanned the paper.
I read the paper.
I looked at Algorithm 1 as a probable candidate to decypher into Python, but the description was cryptic. Heres that description taken from the paper:
AI To the rescue!?
I had a brainwave💡lets chuck it at two AI's and see what they do. I had Gemini and I had Copilot to hand and asked them each to express Algorithm 1 as Python. Gemini did something, and Copilot finally did something but I first had to open the page in Microsoft Edge.
There followed hours of me reading and cross-comparing between the algorithm and the AI's. If I did not understand where something came from I would ask the generating AI; If I found an error I would first, (and second and...), try to get the AI to make a fix I suggested.
At this stage I was also trying to get a feel for how the AI's could help me, (now way past what I thought the algorithm should be, just to see what it would take to get those AI's to cross T's and dot I's on a good solution).
Not a good use of time! I now know that asking questions to update one of the 20 to 30 lines of the Python function might fix that line, but unfix another line you had fixed before. Code from the AI does not have line numbers making it difficult to state what needs changing, and where.They can suggest type hints and create the beginnings of docstrings, but, for example, it pulled out the wrong authors for the name of the algorithm.
In line 1 of the algorithm, the initialisation of thresh is clearly shown, I thought, but both AI's had difficulty getting the Python right. eventually I cut-n-pasted the text into each AI, where they confidentially said "OF course...", made a change, and then I had to re-check for any other changes.
My Code
I first created this function:
I tested it with Hamlet data and it made OK estimates.
Elated, I took a break.
Hacker News
The next evening I decided to do a search to see If anyone else was talking about the algorithm and found a thread on Hacker News that was right up my street. People were discussing those same problems found in the Quanta Article - and getting similar ginormous answers. They had one of the original Authors of the paper making comments! And others had created code from the actual paper and said it was also easier than the Quanta description.
The author mentioned that no less than Donald Knuth had taken an interest in their algorithm and had noted that the expression starting `X = ...` four lines from the end could, thoretically, make no change to X, and the solution was to encase the assignment in a while loop that only exited if len(X) < thresh.
Code update
I decided to add that change:
thresh
In the code above, the variable thresh, (threshhold), named from Algorithm 1, is used in the Quanta article to describe the maximum storage available to keep items from the stream that have been seen before. You must know the length of the stream - m, epsilon, and delta to calculate thresh.
If you were to have just the stream and thresh as the arguments you could return both the estimate of the number of distinct items in the stream as well as counting the number of total elements in the stream.
Epsilon could be calculated from the numbers we now know.
Testing
The algorithm generates an estimate based on random sampling, so I run it multiple times for the same input and report the mean estimate from those runs.
Sample output
Looks good!
Wikipedia
Another day, and I decide to start writing this blog post. I searched again and found the Wikipedia article on what it called the Count-distinct problem.
Looking through it, It had this wrong description of the CVM algorithm:
The, (or a?), problem with the wikipedia entry is that it shows
{\displaystyle p\leftarrow {\frac {p}{2}}}...within the while loop. You need an enclosing if |B| >= s for the while loop and the assignment to p outside the while loop, but inside this new if statement.
STOp PRESS! 02 May 2024: The Wikipedia entry was actually correct.
It's tough!
Both Quanta Magazine, and whoever added the algorithm to Wikipedia got the algorithm wrong.
I've written around two hundred tasks on site Rosettacode.org for over a decade. Others had to read my description and create code in their chosen language to implement those tasks. I have learnt from the feedback I got on talk pages to hone that craft, but details matter. Examples matter. Constructive feedback matters.
END.
About Me
Followers
Subscribe Now: google
Go deh too!
whos.amung.us
Blog Archive
-
2014
(18)
- October (1)
- September (1)
- August (2)
- July (2)
- June (3)
- May (2)
- March (3)
- February (3)
- January (1)
-
2013
(14)
- November (2)
- October (2)
- August (1)
- July (2)
- June (1)
- May (1)
- March (1)
- February (3)
- January (1)
-
2009
(42)
- December (1)
- November (4)
- October (3)
- September (3)
- August (4)
- July (1)
- June (4)
- May (6)
- April (6)
- March (3)
- February (5)
- January (2)
-
2008
(28)
- December (4)
- November (3)
- October (3)
- September (3)
- August (4)
- July (2)
- June (2)
- May (1)
- April (2)
- March (2)
- February (2)