Explore the power of Bloom Filters using Node.js and Redis-WordPress-php.cn

使用 Node.js 和 Redis 探索 Bloom Filter 的魅力

In the right use case, bloom filters look like magic. That's a bold statement, but in this tutorial we'll explore this strange data structure, how to best use it, and some practical examples using Redis and Node.js.

The Bloom filter is a probabilistic, one-way data structure. The word "filter" can be confusing in this context; filter means it's an active thing, a verb, but it might be easier to think of it as storage, a noun. With a simple bloom filter you can do two things:

Add an item.
Check if an item has not been added before.

These are important limitations to understand - you cannot delete items, nor can you list items in a bloom filter. Additionally, you cannot determine whether an item has been added to the filter in the past. This is where the probabilistic nature of Bloom filters comes into play - false positives are possible, but false positives are not. If the filter is set up correctly, the chance of false positives is very small.

Variants of Bloom filters exist that add additional functionality such as removal or scaling, but they also add complexity and limitations. Before moving on to variations, it is important to first understand a simple bloom filter. This article only introduces simple Bloom filters.

With these limits, you get many benefits: fixed size, hash-based encryption, and fast lookups.

When you set up a bloom filter, you need to specify a size for it. This size is fixed, so if there are one or billion items in the filter, it will never grow beyond the specified size. As you add more items to the filter, the likelihood of false positives increases. If you specify a smaller filter, the false positive rate will increase faster than if you use a larger filter.

Bloom filters are built on the concept of one-way hashing. Much like correctly storing passwords, Bloom filters use a hashing algorithm to determine the unique identifier of the item passed into it. A hash is essentially irreversible and is represented by a seemingly random string of characters. Therefore, if someone gains access to a bloom filter, it will not directly reveal anything.

Finally, bloom filters are fast. This operation involves far fewer comparisons than other methods and can be easily stored in memory, preventing performance-impacting database hits.

Now that you understand the limitations and advantages of Bloom filters, let's look at some situations where they can be used.

set up

We will illustrate Bloom filters using Redis and Node.js. Redis is the storage medium for Bloom filters; it's fast, in-memory, and has specific commands (GETBIT, SETBIT) that make implementation more efficient. I assume you have Node.js, npm, and Redis installed on your system. Your Redis server should be running on the default port on localhost for our example to work properly.

In this tutorial, we will not implement a filter from scratch; instead, we will implement a filter from scratch. Instead, we'll focus on a practical use of a pre-built module in npm: bloom-redis. bloom-redis has a very concise set of methods: add, contains, and clear.

As mentioned before, bloom filters require a hashing algorithm to generate an item's unique identifier. bloom-redis uses the well-known MD5 algorithm, which works fine although it may not be suitable for Bloom filters (a bit slow, a bit overkill).

Unique username

Usernames, especially those that identify the user in the URL, need to be unique. If you build an application that allows users to change their username, then you may want a username that is never used to avoid username confusion and attacks.

Without bloom filters, you would need to reference a table containing every username ever used, which can be prohibitively expensive at scale. Bloom filters allow you to add an item every time a user adopts a new name. When a user checks to see if the username is taken, all you need to do is check the bloom filter. It will be able to tell you with absolute certainty whether the requested username has been added previously. The filter may incorrectly return that the username has been taken when in fact the username has not been taken, but this is just a precaution and does not cause any real harm (other than that the user may not be able to declare "k3w1d00d47").

To illustrate this, let's build a fast REST server using Express. First, create the package.json file and then run the following terminal command.

npm install bloom-redis --save

npm install express --save

npm install redis --save

The default option size for bloom-redis is set to 2 MB. That's wrong out of caution, but it's quite large. Setting the size of the bloom filter is critical: too large and you waste memory, too small and the false positive rate will be too high. The math involved in determining the size is complex and beyond the scope of this tutorial, but luckily there is a bloom filter size calculator that does the job without having to crack a textbook.

Now, create app.js as follows:

var
  Bloom         =   require('bloom-redis'),
  express       =   require('express'),
  redis         =   require('redis'),
  
  app,
  client,
  filter;

//setup our Express server
app = express();

//create the connection to Redis
client = redis.createClient();


filter = new Bloom.BloomFilter({ 
  client    : client, //make sure the Bloom module uses our newly created connection to Redis
  key       : 'username-bloom-filter', //the Redis key
  
  //calculated size of the Bloom filter.
  //This is where your size / probability trade-offs are made
  //http://hur.st/bloomfilter?n=100000&p=1.0E-6
  size      : 2875518, // ~350kb
  numHashes : 20
});

app.get('/check', function(req,res,next) {
  //check to make sure the query string has 'username'
  if (typeof req.query.username === 'undefined') {
    //skip this route, go to the next one - will result in a 404 / not found
    next('route');
  } else {
   filter.contains(
     req.query.username, // the username from the query string
     function(err, result) {
       if (err) { 
        next(err); //if an error is encountered, send it to the client
        } else {
          res.send({ 
            username : req.query.username, 
            //if the result is false, then we know the item has *not* been used
            //if the result is true, then we can assume that the item has been used
            status : result ? 'used' : 'free' 
          });
        }
      }
    );
  }
});


app.get('/save',function(req,res,next) {
  if (typeof req.query.username === 'undefined') {
    next('route');
  } else {
    //first, we need to make sure that it's not yet in the filter
    filter.contains(req.query.username, function(err, result) {
      if (err) { next(err); } else {
        if (result) {
          //true result means it already exists, so tell the user
          res.send({ username : req.query.username, status : 'not-created' });
        } else {
          //we'll add the username passed in the query string to the filter
          filter.add(
            req.query.username, 
            function(err) {
              //The callback arguments to `add` provides no useful information, so we'll just check to make sure that no error was passed
              if (err) { next(err); } else {
                res.send({ 
                  username : req.query.username, status : 'created' 
                });
              }
            }
          );
        }
      }
    });
  }
});

app.listen(8010);

Copy after login

To run this server: node app.js. Go to your browser and point it to: https://localhost:8010/check?username=kyle. The response should be: {"username":"kyle","status":"free"}.

Now, let's save that username by pointing your browser to http://localhost:8010/save?username=kyle. The response will be: {"username":"kyle","status":"created"}. If the return address is http://localhost:8010/check?username=kyle, the response will be {"username":"kyle","status ":"used"} .Similarly, returning http://localhost:8010/save?username=kyle will result in {"username":"kyle","status":"not -created"} .

From the terminal you can see the size of the filter: redis-cli strlen username-bloom-filter.

Now, for one item, it should read 338622.

Now, go ahead and try to add more usernames using the /save route. You can try as many as you want.

If you check the dimensions again, you may find that the dimensions have increased slightly, but not with every addition. Curious, right? Internally, the bloom filter sets individual bits (1/0) at different locations in the string stored in username-bloom. However, these are not contiguous, so if you set a bit at index 0 and then set a bit at index 10,000, everything in between will be 0. For practical purposes, it's not important to understand the precise mechanics of each operation at first, just know that this is normal and you will never store more in Redis than you specify.

Fresh content

Fresh content on the website can attract users to return, so how to show new content to users every time? Using a traditional database approach, you would add a new row to a table containing the user identifier and story identifier, and then query the table when you decide to display a piece of content. As you might imagine, your database will grow very quickly, especially as your users and content grow.

In this case, the consequences of false negatives (e.g. not showing unseen content) are very small, making bloom filters a viable option. At first glance, you might think that each user needs a Bloom filter, but we'll use a simple concatenation of a user identifier and a content identifier, and then insert that string into our filter. This way we can use a single filter for all users.

In this example, let's build another basic Express server that displays content. Each time you access the route /show-content/any-username (any-username is any URL-safe value), a new piece of content will be displayed until the site is empty of content. In the example, the content is the first line of the top ten Project Gutenberg books.

We need to install another npm module. Run from terminal: npm install async --save

Your new app.js file:

var
  async         =   require('async'),
  Bloom         =   require('bloom-redis'),
  express       =   require('express'),
  redis         =   require('redis'),
  
  app,
  client,
  filter,
  
  // From Project Gutenberg - opening lines of the top 10 public domain ebooks
  // https://www.gutenberg.org/browse/scores/top
  openingLines = {
    'pride-and-prejudice' : 
      'It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.',
    'alices-adventures-in-wonderland' : 
      'Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it' }

Copy after login

If you pay careful attention to the round trip time in the development tools, you will find that the more times you request a single path with the username, the longer it takes. While checking the filter takes a fixed amount of time, in this case we are checking for the presence of more items. Bloom filters are limited in what they can tell you, so you are testing the presence of each item. Of course, in our example it's fairly simple, but testing hundreds of projects is inefficient.

Outdated data

In this example, we will build a small Express server that will do two things: accept new data via POST, and display the current data (using a GET request). When new data is POSTed to the server, the application checks whether it exists in the filter. If it doesn't exist we will add it to the collection in Redis, otherwise we will return null. A GET request will get it from Redis and send it to the client.

This is different from the first two situations, false positives are not allowed. We will use bloom filters as the first line of defense. Given the properties of bloom filters, we can only be sure that something is not in the filter, so in this case we can continue to let the data in. If the bloom filter returns data that might be in the filter, we check against the actual data source.

那么，我们得到了什么？我们获得了不必每次都检查实际来源的速度。在数据源速度较慢的情况下（外部 API、小型数据库、平面文件的中间），确实需要提高速度。为了演示速度，我们在示例中添加 150 毫秒的实际延迟。我们还将使用 console.time / console.timeEnd 来记录 Bloom 过滤器检查和非 Bloom 过滤器检查之间的差异。

在此示例中，我们还将使用极其有限的位数：仅 1024。它很快就会填满。当它填满时，它将显示越来越多的误报 - 您会看到响应时间随着误报率的填满而增加。

该服务器使用与之前相同的模块，因此将 app.js 文件设置为：

var
  async           =   require('async'),
  Bloom           =   require('bloom-redis'),
  bodyParser      =   require('body-parser'),
  express         =   require('express'),
  redis           =   require('redis'),
  
  app,
  client,
  filter,
  
  currentDataKey  = 'current-data',
  usedDataKey     = 'used-data';
  
app = express();
client = redis.createClient();

filter = new Bloom.BloomFilter({ 
  client    : client,
  key       : 'stale-bloom-filter',
  //for illustration purposes, this is a super small filter. It should fill up at around 500 items, so for a production load, you'd need something much larger!
  size      : 1024,
  numHashes : 20
});

app.post(
  '/',
  bodyParser.text(),
  function(req,res,next) {
    var
      used;
      
    console.log('POST -', req.body); //log the current data being posted
    console.time('post'); //start measuring the time it takes to complete our filter and conditional verification process
    
    //async.series is used to manage multiple asynchronous function calls.
    async.series([
      function(cb) {
        filter.contains(req.body, function(err,filterStatus) {
          if (err) { cb(err); } else {
            used = filterStatus;
            cb(err);
          }
        });
      },
      function(cb) {
        if (used === false) {
          //Bloom filters do not have false negatives, so we need no further verification
          cb(null);
        } else {
          //it *may* be in the filter, so we need to do a follow up check
          //for the purposes of the tutorial, we'll add a 150ms delay in here since Redis can be fast enough to make it difficult to measure and the delay will simulate a slow database or API call
          setTimeout(function() {
            console.log('possible false positive');
            client.sismember(usedDataKey, req.body, function(err, membership) {
              if (err) { cb(err); } else {
                //sismember returns 0 if an member is not part of the set and 1 if it is.
                //This transforms those results into booleans for consistent logic comparison
                used = membership === 0 ? false : true;
                cb(err);
              }
            });
          }, 150);
        }
      },
      function(cb) {
        if (used === false) {
          console.log('Adding to filter');
          filter.add(req.body,cb);
        } else {
          console.log('Skipped filter addition, [false] positive');
          cb(null);
        }
      },
      function(cb) {
        if (used === false) {
          client.multi()
            .set(currentDataKey,req.body) //unused data is set for easy access to the 'current-data' key
            .sadd(usedDataKey,req.body) //and added to a set for easy verification later
            .exec(cb); 
        } else {
          cb(null);
        }
      }
      ],
      function(err, cb) {
        if (err) { next(err); } else {
          console.timeEnd('post'); //logs the amount of time since the console.time call above
          res.send({ saved : !used }); //returns if the item was saved, true for fresh data, false for stale data.
        }
      }
    );
});

app.get('/',function(req,res,next) {
  //just return the fresh data
  client.get(currentDataKey, function(err,data) {
    if (err) { next(err); } else {
      res.send(data);
    }
  });
});

app.listen(8012);

Copy after login

由于使用浏览器 POST 到服务器可能会很棘手，所以让我们使用curl 来测试。

curl --data“您的数据放在这里”--header“内容类型：text/plain”http://localhost:8012/

可以使用快速 bash 脚本来显示填充整个过滤器的外观：

#!/bin/bash
for i in `seq 1 500`;
do
  curl --data “data $i" --header "Content-Type: text/plain" http://localhost:8012/
done

Copy after login

观察填充或完整的过滤器很有趣。由于这个很小，你可以使用 redis-cli 轻松查看。通过在添加项目之间从终端运行 redis-cli get stale-filter ，您将看到各个字节增加。完整的过滤器将为每个字节 \xff 。此时，过滤器将始终返回正值。

结论

布隆过滤器并不是万能的解决方案，但在适当的情况下，布隆过滤器可以为其他数据结构提供快速、有效的补充。

如果您仔细注意开发工具中的往返时间，您会发现使用用户名请求单个路径的次数越多，所需的时间就越长。虽然检查过滤器需要固定的时间，但在本例中，我们正在检查是否存在更多项目。布隆过滤器能够告诉您的信息有限，因此您正在测试每个项目是否存在。当然，在我们的示例中，它相当简单，但测试数百个项目效率很低。

The above is the detailed content of Explore the power of Bloom Filters using Node.js and Redis. For more information, please follow other related articles on the PHP Chinese website!