Nodejs gets the web page content and binds the data event. The obtained data will be responded to in several times. If you want to match the global content, you need to wait for the request to end and operate the accumulated global data in the end event!
For example, if you want to find if there is www.baidu.com on the page, I won’t say more, just put the code:
//引入模块 var http = require("http"), fs = require('fs'), url = require('url'); //写入文件,把结果写入不同的文件 var writeRes = function(p, r) { fs.appendFile(p , r, function(err) { if(err) console.log(err); else console.log(r); }); }, //发请求,并验证内容,把结果写入文件 postHttp = function(arr, num) { console.log('第'+num+"条!") var a = arr[num].split(" - "); if(!a[0] || !a[1]) { return; } var address = url.parse(a[1]), options = { host : address.host, path: address.path, hostname : address.hostname, method: 'GET', headers: { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36' } } var req = http.request(options, function(res) { if (res.statusCode == 200) { res.setEncoding('UTF-8'); var data = ''; res.on('data', function (rd) { data += rd; }); res.on('end', function(q) { if(!~data.indexOf("www.baidu.com")) { return writeRes('./no2.txt', a[0] + '--' + a[1] + '\n'); } else { return writeRes('./has2.txt', a[0] + '--' + a[1] + "\n"); } }) } else { writeRes('./error2.txt', a[0] + '--' + a[1] + '--' + res.statusCode + '\n'); } }); req.on('error', function(e) { writeRes('./error2.txt', a[0] + '--' + a[1] + '--' + e + '\n'); }) req.end(); }, //读取文件,获取需要抓取的页面 openFile = function(path, coding) { fs.readFile(path, coding, function(err, data) { var res = data.split("\n"); for (var i = 0, rl = res.length; i < rl; i++) { if(!res[i]) continue; postHttp(res, i); }; }) }; openFile('./sites.log', 'utf-8');
You can understand the above code. If you have any unclear questions, please leave me a message. The specific details will depend on everyone’s application in practice.
The following will introduce to you Nodejs’ ability to crawl web pages
First PHP. Let’s talk about the advantages first: there are a lot of frameworks for crawling and parsing HTML online, and you can just use various tools directly, which is more worry-free. Disadvantages: First of all, speed/efficiency is a problem. Once when I downloaded a movie poster, crontab was executed regularly and no optimization was done. There were too many PHP processes opened, which directly overwhelmed the memory. Then the grammar is also very slow. There are too many keywords and symbols, and it is not concise enough. It gives people the feeling that it has not been carefully designed, and it is very troublesome to write.
Node.js. The advantage is efficiency, efficiency, and efficiency. Since the network is asynchronous, it is basically as powerful as hundreds of concurrent processes. The memory and CPU usage are very small. If there is no complex calculation and processing of the captured data, then the system bottleneck It basically depends on the bandwidth and the I/O speed of writing to databases such as MySQL. Of course, the opposite of the advantage is also the disadvantage. Asynchronous network means you need callback. At this time, if the business demand is linear, for example, you must wait for the completion of fetching the previous page and get the data before fetching the next page. Even more Layer dependencies, there will be terrible multi-layer callbacks! Basically at this time, the code structure and logic will be a mess. Of course, you can use step and other process control tools to solve these problems.
Finally, let’s talk about Python. If you don’t have extreme requirements for efficiency, then Python is recommended! First of all, Python's syntax is very concise, and the same statement can be saved many times on the keyboard. Then, Python is very suitable for data processing, such as packaging and unpacking of function parameters, list analysis, and matrix processing, which is very convenient.