The Bloom Filter is a very space-efficient random data structure. It uses a bit array (BitSet) to represent a set and passes a certain number of The hash function maps elements to positions in a bit array and is used to check whether an element belongs to this set.
For an element, multiple hash values are generated through multiple hash functions, and the corresponding bits are set to 1 in the bit array. If there are multiple hashes If the corresponding bits of the value are all 1, it is considered that the element may be in the set; if at least one corresponding bit of the hash value is 0, the element is definitely not in the set. This method can achieve efficient search in a smaller space, but may have a false positive rate.
A typical Bloom filter contains three parameters: the size of the bit array (i.e. the number of stored elements); the number of hash functions; the fill factor (i.e. False positive rate), that is, the ratio of the number of elements to the size of the bit array.
As shown in the figure above: The basic operation process of the Bloom filter includes initializing the bit array and hash function, inserting elements, checking whether the elements are in the set, etc. Among them, each element will be mapped to multiple positions in the bit array by multiple hash functions. When checking whether the element is in the set, you need to ensure that all corresponding bits are set to 1 before it is considered that the element may be in the set. in collection.
Spam filtering: Set the corresponding position of the hash value of all blacklist emails in the Bloom filter to 1, for For each new email, check whether its hash value in the corresponding position in the Bloom filter is 1. If so, the email is considered to be spam, otherwise it may be a normal email;
URL deduplication: Set the corresponding position of the hash value of the crawled URL in the Bloom filter to 1. For each new URL, set its hash value in the Bloom filter. Check whether the corresponding positions are all 1. If so, the URL is considered to have been crawled, otherwise it needs to be crawled;
Cache breakdown: Correspond to all data existing in the cache The corresponding position of the hash value in the Bloom filter is set to 1. For each query key value, check whether its hash value in the corresponding position in the Bloom filter is all 1. If so, it is considered The key value exists in the cache, otherwise it needs to be queried from the database and added to the cache.
It should be noted that the false positive rate of the Bloom filter will decrease as the bit array size increases, but it will also increase the memory overhead and calculation time. In order to facilitate the understanding of Bloom filters, the following uses java code to implement a simple Bloom filter:
import java.util.BitSet; import java.util.Random; public class BloomFilter { private BitSet bitSet; // 位集,用于存储哈希值 private int bitSetSize; // 位集大小 private int numHashFunctions; // 哈希函数数量 private Random random; // 随机数生成器 // 构造函数,根据期望元素数量和错误率计算位集大小和哈希函数数量 public BloomFilter(int expectedNumItems, double falsePositiveRate) { this.bitSetSize = optimalBitSetSize(expectedNumItems, falsePositiveRate); this.numHashFunctions = optimalNumHashFunctions(expectedNumItems, bitSetSize); this.bitSet = new BitSet(bitSetSize); this.random = new Random(); } // 根据期望元素数量和错误率计算最佳位集大小 private int optimalBitSetSize(int expectedNumItems, double falsePositiveRate) { int bitSetSize = (int) Math.ceil(expectedNumItems * (-Math.log(falsePositiveRate) / Math.pow(Math.log(2), 2))); return bitSetSize; } // 根据期望元素数量和位集大小计算最佳哈希函数数量 private int optimalNumHashFunctions(int expectedNumItems, int bitSetSize) { int numHashFunctions = (int) Math.ceil((bitSetSize / expectedNumItems) * Math.log(2)); return numHashFunctions; } // 添加元素到布隆过滤器中 public void add(String item) { // 计算哈希值 int[] hashes = createHashes(item.getBytes(), numHashFunctions); // 将哈希值对应的位设置为 true for (int hash : hashes) { bitSet.set(Math.abs(hash % bitSetSize), true); } } // 检查元素是否存在于布隆过滤器中 public boolean contains(String item) { // 计算哈希值 int[] hashes = createHashes(item.getBytes(), numHashFunctions); // 检查哈希值对应的位是否都为 true for (int hash : hashes) { if (!bitSet.get(Math.abs(hash % bitSetSize))) { return false; } } return true; } // 计算给定数据的哈希值 private int[] createHashes(byte[] data, int numHashes) { int[] hashes = new int[numHashes]; int hash2 = Math.abs(random.nextInt()); int hash3 = Math.abs(random.nextInt()); for (int i = 0; i < numHashes; i++) { // 使用两个随机哈希函数计算哈希值 hashes[i] = Math.abs((hash2 * i) + (hash3 * i) + i) % data.length; } return hashes; } }
The above is the detailed content of How to apply Bloom filter in Java. For more information, please follow other related articles on the PHP Chinese website!