The Bloom Filter was proposed by a guy named Bloom in 1970.
It can actually be viewed as a data structure consisting of a binary vector (or bit array) and a series of random mapping functions (hash functions).
Its advantage is that space efficiency and query time are much better than ordinary algorithms. Its disadvantage is that it has a certain misrecognition rate and difficulty in deletion.
Let’s take a picture first
Bloom filter algorithm The main idea is to use n hash functions to perform hashing to obtain different hash values. According to the hash, they are mapped to different index positions of the array (the length of this array may be very long), and then the corresponding index bits are The value on is set to 1.
To determine whether the element appears in the set is to use k different hash functions to calculate the hash value and see whether the value at the corresponding index position of the hash value is 1. If there is one that is not 1 , indicating that the element does not exist in the collection.
But it is also possible to judge that the element is in the set, but the element is not. The 1s above all index positions of this element are set by other elements, which leads to a certain probability of misjudgment (this is why the above is live The root cause may be in a collection, because there will be certain hash conflicts).
Note: The lower the false positive rate, the lower the corresponding performance will be.
The Bloom filter can be used to determine whether an element is (possibly) in a set, and compared to other data structures, the Bloom filter There are huge advantages in both space and time.
Note the word above: possible. There is a suspense reserved here, which will be analyzed in detail below.
Judge whether the given data exists
Prevent cache penetration (judge whether the requested data is valid to avoid directly bypassing the cache to request the database), etc., mailbox spam filtering, blacklist function, etc. wait.
After reading the algorithm idea of Bloom filter, let’s start to explain the specific implementation.
Let me first give an example. Suppose there are two strings, Wangcai and Xiaoqiang. They have been hashed three times respectively, and then the index of the corresponding array (assuming the array length is 16) is calculated based on the hash result. The value of the position is set to 1. Let’s first look at the phrase Wangcai:
After three hashes of Wangcai, the values are 2, 4, and 6 respectively. Then we can get The index values are 2, 4, and 6 respectively, so the values of the index (2, 4, 6) positions of the array are set to 1, and the rest are regarded as 0. Now suppose that you need to find Wangcai, and also go through these three hashes and then It is found that the values of the positions corresponding to indexes 2, 4, and 6 are all 1, then it can be judged that prosperous wealth may exist.
Then insert Xiaoqiang into the Bloom filter. The actual process is the same as above. Assume that the obtained subscripts are 1, 3, 5
Putting aside the existence of Wangcai, Xiaoqiang looks like this in the Bloom filter at this time. The actual array combining Wangcai and Xiaoqiang looks like this:
Now there is a data: 9527. The current requirement is to determine whether 9527 exists. Assume that the subscripts obtained by hashing 9527 three times are: 5, 6, and 7. It turns out that the value of the position with subscript 7 is 0, so it can be definitely judged that 9527 must not exist.
Then came another domestic 007. After three hashes, the subscripts obtained were: 2, 3, and 5. It turned out that the values corresponding to the subscripts 2, 3, and 5 were all 1, so it can be roughly It is judged that domestic 007 may exist. But in fact, after our demonstration just now, domestic 007 does not exist at all. The reason why the values of index positions 2, 3, and 5 are 1 is because of other data settings.
Speaking of which, I wonder if everyone understands the role of the Bloom filter.
As Java programmers, we are really happy. We use many frameworks and tools, and they are basically encapsulated. Bloom filters , we will use the tool class packaged by Google. Of course there are other methods that you can explore.
First add dependencies
<!--布隆过滤依赖--> <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>25.1-jre</version> </dependency>
Code implementation
import com.google.common.hash.BloomFilter; import com.google.common.hash.Funnels; import java.nio.charset.Charset; public class BloomFilterDemo { public static void main(String[] args) { /** * 创建一个插入对象为一亿,误报率为0.01%的布隆过滤器 * 不存在一定不存在 * 存在不一定存在 * ---------------- * Funnel 对象:预估的元素个数,误判率 * mightContain :方法判断元素是否存在 */ BloomFilter<CharSequence> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charset.forName("utf-8")), 100000000, 0.0001); bloomFilter.put("死"); bloomFilter.put("磕"); bloomFilter.put("Redis"); System.out.println(bloomFilter.mightContain("Redis")); System.out.println(bloomFilter.mightContain("Java")); } }
The specific explanation has been written in the comments. By now I believe everyone must understand the Bloom filter and how to use it.
Let’s simulate this scenario: solving cache penetration through Bloom filters.
First of all, do you know what cache penetration is?
Cache penetration means that the user accesses a data that is not in the cache or the database. Because it does not exist in the cache, it will access the database if the concurrency is high. It is easy to defeat the database
So how does the Bloom filter solve this problem? he
的原理是这样子的:将数据库中所有的查询条件,放入布隆过滤器中,当一个查询请求过来时,先经过布隆过滤器进行查,如果判断请求查询值存在,则继续查;如果判断请求查询不存在,直接丢弃。
其代码如下:
String get(String key) { String value = redis.get(key); if (value == null) { if(!bloomfilter.mightContain(key)){ return null; }else{ value = db.get(key); redis.set(key, value); } } return value; }
The above is the detailed content of How to quickly determine whether an element is in a collection in java. For more information, please follow other related articles on the PHP Chinese website!