Home > Web Front-end > JS Tutorial > Use async and enterproxy to control the number of concurrencies

Use async and enterproxy to control the number of concurrencies

小云云
Release: 2018-01-03 09:06:48
Original
1480 people have browsed it

I believe that concurrency is familiar to everyone. This article mainly introduces to you the relevant information about using async and enterproxy to control the number of concurrency. The article introduces it in detail through sample code, which is of certain significance to everyone's study or work. Referring to the value of learning, friends in need can follow the editor to learn together.

Let’s talk about concurrency and parallelism

Concurrency, in the operating system, means that several programs are in a period of time from starting to running to completion, and this Several programs are running on the same processor, but only one program is running on the processor at any time.

Concurrency is something we often mention. Whether it is a web server or app, concurrency is everywhere. In the operating system, it refers to several programs in a period of time between starting and running, and these programs They all run on the same processor, and only one program is running on the processor at any point in time. Many websites have limits on the number of concurrent connections, so when requests are sent too quickly, the return value will be empty or an error will be reported. What's more, some websites may block your IP because you send too many concurrent connections and think you are making malicious requests.

Compared with concurrency, parallelism may be a lot unfamiliar. Parallelism refers to a group of programs executing at an independent and asynchronous speed, which is not equal to overlap in time (occurring at the same moment). Multiple CPU cores can be implemented by increasing the number of CPU cores. Programs (tasks) are carried out simultaneously. That's right, multitasking can be done in parallel

Use enterproxy to control the number of concurrencies

enterproxy is a tool that Pu Lingda contributed to, bringing a kind of event-based programming Changes in thinking, using the event mechanism to decouple complex business logic, solving the criticism of the coupling of callback functions, turning serial waiting into parallel waiting, improving execution efficiency in multi-asynchronous collaboration scenarios

How do we use enterproxy control Number of concurrencies? Usually if we don't use enterproxy and homemade counters, if we grab three sources:

This deeply nested, serial way

 var render = function (template, data) {
 _.template(template, data);
 };
$.get("template", function (template) {
 // something
 $.get("data", function (data) {
 // something
 $.get("l10n", function (l10n) {
 // something
 render(template, data, l10n);
 });
 });
});
Copy after login

Remove this deep nesting in the past Method, our conventional way of writing is to maintain a counter by ourselves

(function(){
 var count = 0;
 var result = {};
 
 $.get('template',function(data){
 result.data1 = data;
 count++;
 handle();
 })
 $.get('data',function(data){
 result.data2 = data;
 count++;
 handle();
 })
 $.get('l10n',function(data){
 result.data3 = data;
 count++;
 handle();
 })

 function handle(){
 if(count === 3){
  var html = fuck(result.data1,result.data2,result.data3);
  render(html);
 }
 }
})();
Copy after login

Here, enterproxy can play the role of this counter. It helps you manage whether these asynchronous operations are completed. After completion, it will automatically call you to provide processing function, and pass the captured data as parameters

var ep = new enterproxy();
ep.all('data_event1','data_event2','data_event3',function(data1,data2,data3){
 var html = fuck(data1,data2,data3);
 render(html);
})

$.get('http:example1',function(data){
 ep.emit('data_event1',data);
})

$.get('http:example2',function(data){
 ep.emit('data_event2',data);
})

$.get('http:example3',function(data){
 ep.emit('data_event3',data);
})
Copy after login

enterproxy also provides APIs required for many other scenarios, you can learn this API by yourself enterproxy

Use async control Number of concurrent connections

If we have 40 requests to send, many websites may block your IP because you send too many concurrent connections and think you are making malicious requests.
So we always need to control the number of concurrencies, and then slowly crawl these 40 links.

Use mapLimit in async to control the number of concurrency at one time to 5, and only capture 5 links at a time.

 async.mapLimit(arr, 5, function (url, callback) {
 // something
 }, function (error, result) {
 console.log("result: ")
 console.log(result);
 })
Copy after login

We should first know what concurrency is, why we need to limit the number of concurrencies, and what solutions are available. Then you can go to the documentation to see how to use the API. The async documentation is a great way to learn these syntaxes.

Simulate a set of data, the data returned here is false, and the return delay is random.

var concurreyCount = 0;
var fetchUrl = function(url,callback){
 // delay 的值在 2000 以内,是个随机的整数 模拟延时
 var delay = parseInt((Math.random()* 10000000) % 2000,10);
 concurreyCount++;
 console.log('现在并发数是 ' , concurreyCount , ' 正在抓取的是' , url , ' 耗时' + delay + '毫秒');
 setTimeout(function(){
 concurreyCount--;
 callback(null,url + ' html content');
 },delay);
}

var urls = [];
for(var i = 0;i<30;i++){
 urls.push('http://datasource_' + i)
}
Copy after login

Then we use async.mapLimit to concurrently crawl and obtain the results.

async.mapLimit(urls,5,function(url,callback){
 fetchUrl(url,callbcak);
},function(err,result){
 console.log('result: ');
 console.log(result);
})
Copy after login

The simulation is excerpted from alsotang

The following results are obtained after running the output

We found that the number of concurrency starts from 1, but increases to At 5 o'clock, it is no longer increasing. When there are tasks, continue to crawl, and the number of concurrent connections is always controlled at 5.

Complete the node simple crawler system

Because the eventproxy used in the tutorial example of "Node includes teaching but not meeting" by alsotang senior controls the number of concurrency, we will complete a project using async A simple crawler that controls the number of concurrent nodes.

The crawling target is the homepage of this site (manual face protection)

The first step, first we need to use the following modules:

  • url: used for url parsing, here url.resolve() is used to generate a legal domain name

  • async: a practical module that provides powerful functions and asynchronous JavaScript work

  • cheerio: A fast, flexible, jQuery core implementation specially customized for the server

  • superagent: A very convenient client request in nodejs Agent module

Install dependent modules through npm


The second step is to introduce dependent modules through require to determine crawling Object URL:

var url = require("url");
var async = require("async");
var cheerio = require("cheerio");
var superagent = require("superagent");
var baseUrl = 'http://www.chenqaq.com';
Copy after login

Step 3: Use superagent to request the target URL, and use cheerio to process baseUrl to get the target content URL, and save it in the array arr

superagent.get(baseUrl)
 .end(function (err, res) {
 if (err) {
  return console.error(err);
 }
 var arr = [];
 var $ = cheerio.load(res.text);
 // 下面和jQuery操作是一样一样的..
 $(".post-list .post-title-link").each(function (idx, element) {
  $element = $(element);
  var _url = url.resolve(baseUrl, $element.attr("href"));
  arr.push(_url);
 });
 // 验证得到的所有文章链接集合
 output(arr);
 // 第四步:接下来遍历arr,解析每一个页面需要的信息
})
Copy after login

We need a function to verify the crawl The URL object is very simple. We only need a function to traverse arr and print it out:

function output(arr){
 for(var i = 0;i<arr.length;i++){
  console.log(arr[i]);
 }
}
Copy after login

Step 4: We need to traverse the obtained URL object and parse the information required for each page.

这里就需要用到async控制并发数量,如果你上一步获取了一个庞大的arr数组,有多个url需要请求,如果同时发出多个请求,一些网站就可能会把你的行为当做恶意请求而封掉你的ip

async.mapLimit(arr,3,function(url,callback){
 superagent.get(url)
  .end(function(err,mes){
   if(err){
    console.error(err);
    console.log('message info ' + JSON.stringify(mes));
   }
   console.log('「fetch」' + url + ' successful!');
   var $ = cheerio.load(mes.text);
   var jsonData = {
    title:$('.post-card-title').text().trim(),
    href: url,
   };
   callback(null,jsonData);
  },function(error,results){
   console.log('results ');
   console.log(results);
  })
 })
Copy after login

得到上一步保存url地址的数组arr,限制最大并发数量为3,然后用一个回调函数处理 「该回调函数比较特殊,在iteratee方法中一定要调用该回调函数,有三种方式」

  • callback(null) 调用成功

  • callback(null,data) 调用成功,并且返回数据data追加到results

  • callback(data) 调用失败,不会再继续循环,直接到最后的callback

好了,到这里我们的node简易的小爬虫就完成了,来看看效果吧

嗨呀,首页数据好少,但是成功了呢。

参考资料

Node.js 包教不包会 - alsotang

enterproxy

async

async Documentation

相关推荐:

如何设置apache的并发数量_PHP教程

如何设置apache的并发数量

PHP通过加锁实现并发抢购功能

The above is the detailed content of Use async and enterproxy to control the number of concurrencies. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template