How to implement a more comprehensive Golang version of the cuckoo filter-Golang-php.cn

The following is introduced to you by the go language tutorial column. I have implemented a more comprehensive Golang version of the cuckoo filter. I hope it will be helpful to friends in need. !

"Determining whether a value is in a huge set" (hereinafter collectively referred to as set membership testing) is a common data processing problem. In past experience, if a certain false positive rate is allowed, Bloom filters are the first choice, but now we have a better choice: cuckoo filters.
Recent business requires the use of filters. After searching, I found that the cuckoo filter is more cost-effective and better than the Bloom filter in our scenario.
In order to determine the final technology selection, I read the original paper. Later, when I decided to use the cuckoo filter, I found that there were almost no comprehensive implementations of golang. Currently, several high-star implementations on GitHub have some flaws. , and did not maximize space utilization, so I transplanted and optimized a version of the Golang library with reference to the original paper and the original C implementation of the paper. The details are below.
The code address is here, welcome to star, use, contribute and debug: github.com/linvon/cuckoo-filter

cuckoo filter

cuckoo There are already many introductory articles on filters on the Internet. I won’t go into too much introduction here. I will only mention the key points to lead to the following content

If you want to know more details, you can refer to Original paper, or check out my Chinese translated version

What is a cuckoo filter?

is a filter implemented based on the cuckoo hash algorithm. It is essentially a cuckoo hash table that stores the hash value of the storage item.

If you understand Bloom filters, you should know that the principle of Bloom filters is to use multiple hashing methods to map different hashes of storage items to bit arrays, and check these bits during querying to determine whether it exists.

The cuckoo filter hashes the storage item, takes a certain number of digits from its hash value and stores it in the array. When querying, it determines whether the hash with equal digits is in the array. exist.

Why choose cuckoo filter?

They also store hash values, essentially storing multi-bit hashes. Why is the cuckoo filter better?

First, because the cuckoo hash table is more compact, it can save more space.
The second reason is that when querying, the Bloom filter uses a variety of hash functions for multiple hashes, while the cuckoo filter only needs one hash, so the query efficiency Very high
Third, the cuckoo filter supports deletion, while the Bloom filter does not support deletion

The advantages are there, but what are the disadvantages? Compared with the Bloom filter

, the cuckoo filter adopts a backup candidate bucket scheme. The candidate bucket and the preferred bucket can be obtained by XORing each other through the position and storage value. This correspondence relationship It is required that the size of the bucket must be an exponential multiple of 2
When the Bloom filter is inserted, the hash is calculated and the bit is written directly, while the cuckoo filter may appear that the current position has been stored after calculation. Fingerprint, at this time, it is necessary to kick the stored items into the candidate bucket. As the bucket becomes fuller and fuller, the possibility of conflict becomes greater and greater, and the insertion time becomes higher and higher. Therefore, its insertion performance is compared with Bloom filtering. The filter is very poor
Inserting duplicate elements: The Bloom filter has no effect when inserting duplicate elements, it just resets the existing bits. The cuckoo filter will kick out existing values, so there is an upper limit for the insertion of repeated elements.
The deletion of the cuckoo filter is not perfect: there are the above restrictions on repeated insertion, and it will also be deleted when deleting. A related problem arises: deletion is only perfect when the same hash value is inserted once. If the element is deleted without being inserted, accidental deletion may occur, which is the same reason for the false positive rate; if the element is inserted multiple times, Then each deletion will only delete one value. You need to know how many times the element has been inserted before it can be deleted, or run the deletion in a loop until the deletion fails.

The advantages and disadvantages are all listed, let’s summarize them again . For this kind of set membership test problem, most scenarios involve more reading and less writing, and repeated insertions are meaningless. Although the deletion of the cuckoo filter is not perfect, it is better than nothing. There are also better queries and storage. Efficiency, it should be said that in most cases it is a more cost-effective choice.

Practical Guide

Detailed Implementation

Let’s talk about the concept of cuckoo filter first. The filter is composed of many buckets. , each bucket stores the hashed value of the inserted item, which only stores a fixed number of digits.

There are n buckets in the filter, and the number of buckets is calculated based on the number of items to be stored. Through the hash algorithm, we can calculate which bucket an item should be stored in. In addition, each additional hash algorithm can generate a candidate bucket for an item. When repeated insertions are made, the currently stored item will be kicked into the candidate bucket. Go in. Theoretically, the more hash algorithms, the higher the space utilization, but in actual testing, k=2 hash functions were used to achieve a utilization rate of 98%.

Each bucket will store multiple fingerprints. This is subject to the size of the bucket. Different fingerprints may be mapped to the same bucket. The larger the bucket, the higher the space utilization, but at the same time, the more fingerprints are scanned in the same bucket for each query, so the probability of generating false positives is higher. At this time, it is necessary to increase the number of stored fingerprints to reduce the conflict rate. Maintain false positive rate.

In the paper, several parameters required to implement the cuckoo filter are mentioned, mainly

The number of hash functions (k): the number of hashes, take 2 It’s enough
Bucket size (b): How many fingerprints are stored in each bucket
Fingerprint size (f): How many bits of the hash value of each fingerprint storage key

Read the paper in detail. In Chapter 5, the author relies on experimental data to tell us how to choose the most appropriate construction parameters. We can get the following conclusion

The filter cannot be filled 100%, There is a maximum load factor α, then the storage space allocated to each item is f/α
When the total size of the filter is kept constant, the larger the bucket, the higher the load factor, that is, the higher the space utilization. High, but the more fingerprints stored in each bucket, the higher the probability of conflicts during query. In order to maintain the same false positive rate, the larger the bucket, the larger the fingerprints required

Based on the above theoretical basis, the relevant experimental data obtained are:

When using k=2 hash functions, when the bucket size b=1 (that is, direct mapping of the hash table), the load The factor α is 50%, but when using bucket size b=2, 4 or 8, it will increase to 84%, 95% and 98% respectively
In order to ensure the false positive rate r, it is necessary to ensure $2b/2 ^f\leq r$ , then the size of fingerprint f is approximately $f ≥ log_2(2b/r)=log_2(1/r) log_2(2b)$ , then the amortized cost of each item is $C ≤ [log_2( 1/r) log_2(2b)]/α$
The experimental data shows that when r>0.002. Two entries per bucket produces slightly better results than using four entries per bucket; four entries per bucket minimizes space when r is reduced to 0.00001
If using Semi-sorted bucket can reduce 1 bit of storage space for each storage item, but it only acts on filters with b=4

#In this way we can determine how to choose parameters. Constructing our cuckoo filter:

First we use two hash functions, which is enough, which can achieve sufficient space utilization. Depending on the false positive rate we need, we can determine what bucket size to use, of course the choice of b is not absolute, even if r>0.002, you can use b=4 to enable semi-sorted buckets. We can then calculate the size of f we need to achieve the target false positive rate based on b, so that all filter parameters are determined.

Comparing the above conclusion with $1.44log_2(1/r)$ for each item of the Bloom filter, we can find that when semi-sorting is enabled, when r<0.03, the cuckoo filter space is more Small, if half sorting is not enabled, it will degrade to about r<0.003.

Some advanced explanations

Optimization of hash algorithm

Although we specified that two hash algorithms are required, But in actual implementation, it is enough for us to use a hash algorithm, because an alternative bucket calculation method is mentioned in the paper. The second hash value can be XORed by the first hash value and the fingerprint stored at that location. Calculated. If you are worried that we still need to calculate the hash of the fingerprint and the hash of the location separately, we can just use one algorithm to make a 64-bit hash, with the high 32 bits used to calculate the location and the low 32 bits used to calculate the fingerprint.

Why can semi-sorted buckets only be used when b=4?

The essence of half sorting is to take four digits of each fingerprint. The four digits can be expressed as a hexadecimal number. The four-digit storage of b fingerprints can be expressed as b 16 After arranging all possible base numbers in order, the corresponding arrangement can be found by indexing their positions to obtain the actual stored value.

We can calculate the number of all situation types through the following function

func getNum(base, k, b, f int, cnt *int) {
    for i := base; i < 1<> 1
    n |= n >> 2
    n |= n >> 4
    n |= n >> 8
    n |= n >> 16
    n |= n >> 32
    n++
    return uint(n)}func getNumOfKindAndBit(b, f int) {
    cnt := 0
    getNum(0, 0, b, f, &cnt)
    fmt.Printf("Num of kinds: %v, Num of needed bits: %v\n", cnt, math.Log2(float64(getNextPow2(uint64(cnt)))))}When b=4, there are a total of 3786 permutations, which is less than 4096, that is, 12 bits can be used to store all permutation indexes , and if all fingerprints are stored directly, 4X4=16 bits are needed, which saves 4 bits, that is, one bit is saved for each fingerprint. 
It can be found that when b is 2, whether to enable half sorting requires the same number of stored digits, which is meaningless. If b is too large, the index that needs to be stored will also expand rapidly, which will cause a great loss in query performance. Therefore, b=4 is the most cost-effective option. 
 In addition, the choice of encoding to store four-digit fingerprints is because it can be represented by a hexadecimal system, which is convenient for storage 
Parameter selection when using half sorting
When using half sorting, you should ensure that $ceil(b(f-1)/8)f/8)$, otherwise the space occupied will be the same whether you use half sorting or not. 

Filter size selection
The total bucket size of the filter must be an exponential multiple of 2, so when setting the filter size, try to satisfy $size/α ~=(<) 2^n$, size is the amount of data you want a filter to store. If necessary, you should choose a smaller filter and use multiple filters to achieve the target effect

Golang implementation
This part is mainly related to the Golang library
After looking through the golang implementation of cuckoofilter on Github, I found that the existing implementations have some shortcomings:

Most libraries have fixed b and f, that is, the false positive rate is also fixed, and the adaptability is not good
All libraries f are in bytes, only It can be adjusted in multiples of 8, and it is inconvenient to adjust the false positive rate
All libraries do not implement semi-sorted buckets, which greatly reduces the advantages compared to Bloom filters

 Because my own scenario requires better space and a custom false positive rate, I transplanted the C implementation of the original paper and made some optimizations, mainly including

Support adjustment parameters
Support semi-sorted buckets
Compress space into compact bit array, store fingerprints bitwise
Support binary serialization


The above is the detailed content of How to implement a more comprehensive Golang version of the cuckoo filter. For more information, please follow other related articles on the PHP Chinese website!