I believe concurrency is familiar to everyone. This article mainly introduces to you the relevant information about using async and enterproxy to control the number of concurrencies. The article introduces it in detail through sample code, which is of certain significance to everyone's study or work. The reference learning value, friends who need it, please study together below.
Let’s talk about concurrency and parallelism
Concurrency, in the operating system, means that several programs are started in a period of time Between running and completion, these programs are all running on the same processor, but only one program is running on the processor at any point in time.
Concurrency is something we often mention. Whether it is a web server or app, concurrency is everywhere. In the operating system, it refers to several programs in a period of time between starting and running, and these programs They all run on the same processor, and only one program is running on the processor at any point in time. Many websites have limits on the number of concurrent connections, so when requests are sent too quickly, the return value will be empty or an error will be reported. What's more, some websites may block your IP because you send too many concurrent connections and think you are making malicious requests.
Compared with concurrency, parallelism may be a lot unfamiliar. Parallelism refers to a group of programs executing at an independent and asynchronous speed, which is not equal to overlap in time (occurring at the same moment). Multiple CPU cores can be implemented by increasing the number of CPU cores. Programs (tasks) are carried out simultaneously. That's right, parallel tasks can be performed simultaneously
Use enterproxy to control the number of concurrencies
enterproxy is a tool that Pu Lingda made a major contribution to , bringing about a change in the thinking of event-based programming, using the event mechanism to decouple complex business logic, solving the criticism of the coupling of callback functions, turning serial waiting into parallel waiting, and improving execution efficiency in multi-asynchronous collaboration scenarios
How do we use enterproxy to control the number of concurrencies? Usually if we don't use enterproxy and homemade counters, if we grab three sources:
This deeply nested, serial way
var render = function (template, data) { _.template(template, data); }; $.get("template", function (template) { // something $.get("data", function (data) { // something $.get("l10n", function (l10n) { // something render(template, data, l10n); }); }); });
Remove this deep nesting in the past Method, our conventional way of writing is to maintain a counter by ourselves
(function(){ var count = 0; var result = {}; $.get('template',function(data){ result.data1 = data; count++; handle(); }) $.get('data',function(data){ result.data2 = data; count++; handle(); }) $.get('l10n',function(data){ result.data3 = data; count++; handle(); }) function handle(){ if(count === 3){ var html = fuck(result.data1,result.data2,result.data3); render(html); } } })();
Here, enterproxy can play the role of this counter. It helps you manage whether these asynchronous operations are completed. After completion, it will automatically call you to provide processing function, and pass the captured data as parameters
var ep = new enterproxy(); ep.all('data_event1','data_event2','data_event3',function(data1,data2,data3){ var html = fuck(data1,data2,data3); render(html); }) $.get('http:example1',function(data){ ep.emit('data_event1',data); }) $.get('http:example2',function(data){ ep.emit('data_event2',data); }) $.get('http:example3',function(data){ ep.emit('data_event3',data); })
enterproxy also provides APIs required for many other scenarios. You can learn this API by yourself enterproxy
Use async to control the number of concurrency
If we have 40 requests to make, many websites may think you are making a malicious request because you send too many concurrent connections. Block your IP.
So we always need to control the number of concurrency, and then slowly crawl these 40 links.
Use mapLimit in async to control the number of concurrency at one time to 5, and only capture 5 links at a time.
async.mapLimit(arr, 5, function (url, callback) { // something }, function (error, result) { console.log("result: ") console.log(result); })
We should first know what concurrency is, why we need to limit the number of concurrencies, and what solutions are available. Then you can go to the documentation to see how to use the API. The async documentation is a great way to learn these syntaxes.
Simulate a set of data, the data returned here is false, and the return delay is random.
var concurreyCount = 0; var fetchUrl = function(url,callback){ // delay 的值在 2000 以内,是个随机的整数 模拟延时 var delay = parseInt((Math.random()* 10000000) % 2000,10); concurreyCount++; console.log('现在并发数是 ' , concurreyCount , ' 正在抓取的是' , url , ' 耗时' + delay + '毫秒'); setTimeout(function(){ concurreyCount--; callback(null,url + ' html content'); },delay); } var urls = []; for(var i = 0;i<30;i++){ urls.push('http://datasource_' + i) }
Then we use async.mapLimit to crawl concurrently and obtain the results.
async.mapLimit(urls,5,function(url,callback){ fetchUrl(url,callbcak); },function(err,result){ console.log('result: '); console.log(result); })
The simulation is excerpted from alsotang
The following results are obtained after running the output
We found that the number of concurrency starts from 1, but increases to At 5 o'clock, it is no longer increasing. When there are tasks, continue to crawl, and the number of concurrent connections is always controlled at 5.
Complete the node simple crawler system
Because the concurrency controlled by eventproxy is used in the tutorial example of "node teaching is not guaranteed" by senior alsotang Quantity, let’s complete a simple node crawler that uses async to control the number of concurrencies.
The crawling target is the homepage of this site (manual face protection)
The first step, first we need to use the following modules:
url: used for url parsing, here url.resolve() is used to generate a legal domain name
async: a practical module that provides powerful functions and asynchronous JavaScript work
cheerio: A fast, flexible, jQuery core implementation specially customized for the server
superagent: A very convenient client request in nodejs Agent module
Install the dependent module through npm
The second step is to introduce the dependent module through require and determine the crawling object URL:
var url = require("url"); var async = require("async"); var cheerio = require("cheerio"); var superagent = require("superagent"); var baseUrl = 'http://www.chenqaq.com';
Step 3: Use superagent to request the target URL, and use cheerio to process baseUrl to get the target content URL, and save it in the array arr
superagent.get(baseUrl) .end(function (err, res) { if (err) { return console.error(err); } var arr = []; var $ = cheerio.load(res.text); // 下面和jQuery操作是一样一样的.. $(".post-list .post-title-link").each(function (idx, element) { $element = $(element); var _url = url.resolve(baseUrl, $element.attr("href")); arr.push(_url); }); // 验证得到的所有文章链接集合 output(arr); // 第四步:接下来遍历arr,解析每一个页面需要的信息 })
We need a function to verify the captured url object , it’s very simple. We only need a function to traverse arr and print it out:
function output(arr){ for(var i = 0;i<arr.length;i++){ console.log(arr[i]); } }
第四步:我们需要遍历得到的URL对象,解析每一个页面需要的信息。
这里就需要用到async控制并发数量,如果你上一步获取了一个庞大的arr数组,有多个url需要请求,如果同时发出多个请求,一些网站就可能会把你的行为当做恶意请求而封掉你的ip
async.mapLimit(arr,3,function(url,callback){ superagent.get(url) .end(function(err,mes){ if(err){ console.error(err); console.log('message info ' + JSON.stringify(mes)); } console.log('「fetch」' + url + ' successful!'); var $ = cheerio.load(mes.text); var jsonData = { title:$('.post-card-title').text().trim(), href: url, }; callback(null,jsonData); },function(error,results){ console.log('results '); console.log(results); }) })
得到上一步保存url地址的数组arr,限制最大并发数量为3,然后用一个回调函数处理 「该回调函数比较特殊,在iteratee方法中一定要调用该回调函数,有三种方式」
callback(null)
调用成功
callback(null,data)
调用成功,并且返回数据data追加到results
callback(data)
调用失败,不会再继续循环,直接到最后的callback
好了,到这里我们的node简易的小爬虫就完成了,来看看效果吧
嗨呀,首页数据好少,但是成功了呢。
参考资料
Node.js 包教不包会 - alsotang
enterproxy
async
async Documentation
上面是我整理给大家的,希望今后会对大家有帮助。
相关文章:
The above is the detailed content of How to control the number of concurrencies using async and enterproxy. For more information, please follow other related articles on the PHP Chinese website!