The following is introduced to you by the go language tutorial column. I have implemented a more comprehensive Golang version of the cuckoo filter. I hope it will be helpful to friends in need. !
"Determining whether a value is in a huge set" (hereinafter collectively referred to as set membership testing) is a common data processing problem. In past experience, if a certain false positive rate is allowed, Bloom filters are the first choice, but now we have a better choice: cuckoo filters.
Recent business requires the use of filters. After searching, I found that the cuckoo filter is more cost-effective and better than the Bloom filter in our scenario.
In order to determine the final technology selection, I read the original paper. Later, when I decided to use the cuckoo filter, I found that there were almost no comprehensive implementations of golang. Currently, several high-star implementations on GitHub have some flaws. , and did not maximize space utilization, so I transplanted and optimized a version of the Golang library with reference to the original paper and the original C implementation of the paper. The details are below.
The code address is here, welcome to star, use, contribute and debug: github.com/linvon/cuckoo-filter
cuckoo There are already many introductory articles on filters on the Internet. I won’t go into too much introduction here. I will only mention the key points to lead to the following content
If you want to know more details, you can refer to Original paper, or check out my Chinese translated version
is a filter implemented based on the cuckoo hash algorithm. It is essentially a cuckoo hash table that stores the hash value of the storage item.
If you understand Bloom filters, you should know that the principle of Bloom filters is to use multiple hashing methods to map different hashes of storage items to bit arrays, and check these bits during querying to determine whether it exists.
The cuckoo filter hashes the storage item, takes a certain number of digits from its hash value and stores it in the array. When querying, it determines whether the hash with equal digits is in the array. exist.
They also store hash values, essentially storing multi-bit hashes. Why is the cuckoo filter better?
First, because the cuckoo hash table is more compact, it can save more space.
The second reason is that when querying, the Bloom filter uses a variety of hash functions for multiple hashes, while the cuckoo filter only needs one hash, so the query efficiency Very high
Third, the cuckoo filter supports deletion, while the Bloom filter does not support deletion
The advantages are there, but what are the disadvantages? Compared with the Bloom filter
The advantages and disadvantages are all listed, let’s summarize them again . For this kind of set membership test problem, most scenarios involve more reading and less writing, and repeated insertions are meaningless. Although the deletion of the cuckoo filter is not perfect, it is better than nothing. There are also better queries and storage. Efficiency, it should be said that in most cases it is a more cost-effective choice.
Let’s talk about the concept of cuckoo filter first. The filter is composed of many buckets. , each bucket stores the hashed value of the inserted item, which only stores a fixed number of digits.
There are n buckets in the filter, and the number of buckets is calculated based on the number of items to be stored. Through the hash algorithm, we can calculate which bucket an item should be stored in. In addition, each additional hash algorithm can generate a candidate bucket for an item. When repeated insertions are made, the currently stored item will be kicked into the candidate bucket. Go in. Theoretically, the more hash algorithms, the higher the space utilization, but in actual testing, k=2 hash functions were used to achieve a utilization rate of 98%.
Each bucket will store multiple fingerprints. This is subject to the size of the bucket. Different fingerprints may be mapped to the same bucket. The larger the bucket, the higher the space utilization, but at the same time, the more fingerprints are scanned in the same bucket for each query, so the probability of generating false positives is higher. At this time, it is necessary to increase the number of stored fingerprints to reduce the conflict rate. Maintain false positive rate.
In the paper, several parameters required to implement the cuckoo filter are mentioned, mainly
Read the paper in detail. In Chapter 5, the author relies on experimental data to tell us how to choose the most appropriate construction parameters. We can get the following conclusion
Based on the above theoretical basis, the relevant experimental data obtained are:
#In this way we can determine how to choose parameters. Constructing our cuckoo filter:
First we use two hash functions, which is enough, which can achieve sufficient space utilization. Depending on the false positive rate we need, we can determine what bucket size to use, of course the choice of b is not absolute, even if r>0.002, you can use b=4 to enable semi-sorted buckets. We can then calculate the size of f we need to achieve the target false positive rate based on b, so that all filter parameters are determined.
Comparing the above conclusion with $1.44log_2(1/r)$ for each item of the Bloom filter, we can find that when semi-sorting is enabled, when r<0.03, the cuckoo filter space is more Small, if half sorting is not enabled, it will degrade to about r<0.003.
Optimization of hash algorithm
Although we specified that two hash algorithms are required, But in actual implementation, it is enough for us to use a hash algorithm, because an alternative bucket calculation method is mentioned in the paper. The second hash value can be XORed by the first hash value and the fingerprint stored at that location. Calculated. If you are worried that we still need to calculate the hash of the fingerprint and the hash of the location separately, we can just use one algorithm to make a 64-bit hash, with the high 32 bits used to calculate the location and the low 32 bits used to calculate the fingerprint.
Why can semi-sorted buckets only be used when b=4?
The essence of half sorting is to take four digits of each fingerprint. The four digits can be expressed as a hexadecimal number. The four-digit storage of b fingerprints can be expressed as b 16 After arranging all possible base numbers in order, the corresponding arrangement can be found by indexing their positions to obtain the actual stored value.
We can calculate the number of all situation types through the following function
func getNum(base, k, b, f int, cnt *int) { for i := base; i < 1<> 1 n |= n >> 2 n |= n >> 4 n |= n >> 8 n |= n >> 16 n |= n >> 32 n++ return uint(n)}func getNumOfKindAndBit(b, f int) { cnt := 0 getNum(0, 0, b, f, &cnt) fmt.Printf("Num of kinds: %v, Num of needed bits: %v\n", cnt, math.Log2(float64(getNextPow2(uint64(cnt)))))} When b=4, there are a total of 3786 permutations, which is less than 4096, that is, 12 bits can be used to store all permutation indexes , and if all fingerprints are stored directly, 4X4=16 bits are needed, which saves 4 bits, that is, one bit is saved for each fingerprint.
It can be found that when b is 2, whether to enable half sorting requires the same number of stored digits, which is meaningless. If b is too large, the index that needs to be stored will also expand rapidly, which will cause a great loss in query performance. Therefore, b=4 is the most cost-effective option.
In addition, the choice of encoding to store four-digit fingerprints is because it can be represented by a hexadecimal system, which is convenient for storage
Parameter selection when using half sorting
When using half sorting, you should ensure that $ceil(b(f-1)/8)
f/8)$, otherwise the space occupied will be the same whether you use half sorting or not. Filter size selection
The total bucket size of the filter must be an exponential multiple of 2, so when setting the filter size, try to satisfy $size/α ~=(<) 2^n$, size is the amount of data you want a filter to store. If necessary, you should choose a smaller filter and use multiple filters to achieve the target effect
Golang implementation
This part is mainly related to the Golang library
After looking through the golang implementation of cuckoofilter on Github, I found that the existing implementations have some shortcomings:
- Most libraries have fixed b and f, that is, the false positive rate is also fixed, and the adaptability is not good
- All libraries f are in bytes, only It can be adjusted in multiples of 8, and it is inconvenient to adjust the false positive rate
- All libraries do not implement semi-sorted buckets, which greatly reduces the advantages compared to Bloom filters
Because my own scenario requires better space and a custom false positive rate, I transplanted the C implementation of the original paper and made some optimizations, mainly including
Support adjustment parameters
Support semi-sorted buckets
Compress space into compact bit array, store fingerprints bitwise
Support binary serialization
The above is the detailed content of How to implement a more comprehensive Golang version of the cuckoo filter. For more information, please follow other related articles on the PHP Chinese website!