Counting sounds simple, but it is very difficult to implement in practice.
Imagine that you are sent to a pristine tropical rainforest to conduct a wildlife census. Whenever you see an animal, take a photo.
Digital cameras only record the total number of animals tracked, but if you are interested in the number of unique animals, there is no statistics.
So, what is the best way to obtain this unique animal population?
At this point, you must say, start counting from now on, and finally compare each new species from the photo to the list.
However, this common counting method is sometimes not suitable for information amounts up to billions of entries.
Computer scientists from the Indian Statistical Institute, UNL, and the National University of Singapore have proposed a new algorithm - CVM.
It can approximate the number of different entries in a long list, and only needs to remember a small number of entries.
##Paper address: https://arxiv.org/pdf/2301.10191
This algorithm works for any list in which an item appears one at a time, such as text in a speech, merchandise on a conveyor belt, or cars on the interstate.
The CVM algorithm is named after the first letters of the three authors and has made significant progress in solving the "different elements problem".
This problem has troubled computer scientists for more than 40 years.
It requires an efficient way to monitor a stream of elements (the total number of which may exceed available memory) and estimate the number of unique elements in it.
So, how does the CVM algorithm solve the problem?
Suppose you are listening to the audiobook of "Hamlet".
This drama has a total of 30,557 words. How many are different?
To find the answer, you can pause while listening and write each word in alphabetical order, then skip words already on the list, and finally, just count the list on each word count.
This method is feasible, but it tests one's "memory" too much.
Researcher Vinodchandran Variyam said, "In a typical data flow situation, there may be millions of items to track. You may not want to store all the information.
This is where cloud server algorithms can provide a simpler approach."
The trick is "randomization".
Vinodchandran Variyam helped invent a CVM algorithm for estimating the number of distinct elements in a data stream
Go back to "Hamlet" and assume that your "effective memory" can only hold 100 words.
Once the audio starts playing, you write down the first 100 words you hear and skip any repeated words.
When you have finished recording 100 words, all that’s left is to toss a coin for each word –
Heads, keep word. If it is the reverse side, delete it.
After this preliminary round, you will be left with about 50 different words.
Now you continue with what the team calls Round 1, continuing to read Hamlet and adding new words.
If you encounter a word again that is already on the list, flip the coin again until you have 100 words in your memory whiteboard.
Then, roughly half of the words are randomly deleted again based on the results of 100 coin tosses. Round 1 ends here.
Next, enter the second round, Round 2.
Like the first round, we're going to increase the difficulty of a word - when you encounter a repeated word, flip the coin again.
The condition is, if it's the other side, delete it like before. But if it’s heads, flip the coin again. The word is retained only when it appears heads for the second time.
Once the memory board is full, end the round and then delete about half of the words again based on the 100 tosses.
In Round 3, you need to flip a coin heads three times in a row to retain a word.
In the fourth round, keep one word on the front four times in a row, and so on.
Finally, in round k, you will listen to the entire play of "Hamlet".
The point of this exercise is to ensure that each word has the same probability of occurrence: 1/2 (k).
Suppose, at the end of the Hamlet audio, you have 61 words in your list and it took six rounds to complete.
You can estimate the number of different words by dividing 61 by the probability 1/2 (6) - the final result in this game is 3904.
Researchers Chakraborty, Variyam and Meel mathematically proved that the accuracy of the CVM algorithm is proportional to the amount of memory Proportional to the size of the quantity.
And "Hamlet" has exactly 3967 unique words. (By ordinary counting method)
In the experiment using 100 word memory, the average estimate of the results of the 5 rounds of experiments is 3955 words.
With 1,000 words in memory, the average memory capacity increased to 3,964.
Variyam said, "If (the amount of memory) is large enough to accommodate all words, then we can achieve 100% accuracy."
William Kuszmau of Harvard University said, "This is a good example of how even very basic and widely studied problems can sometimes have simple but not obvious answers. Solutions are still to be discovered."
The above is the detailed content of Groundbreaking CVM algorithm solves more than 40 years of counting problems! Computer scientist flips coin to figure out unique word for 'Hamlet'. For more information, please follow other related articles on the PHP Chinese website!