Writing a crawler example using NodeJS-JS Tutorial-php.cn

This article mainly explains to you how to use NodeJS to learn crawlers, and explains the usage and effects by crawling the Encyclopedia of Embarrassing Things. Let's learn together.

1. Preface analysis

Usually we use Python/.NET language to implement crawlers, but now as a front-end developer, naturally Requires proficiency in NodeJS. Let's use NodeJS language to implement a crawler for Encyclopedia of Embarrassing Things. In addition, some of the codes used in this article are es6 syntax.

The dependent libraries required to implement this crawler are as follows.

request: Use get or post methods to obtain the source code of the web page. cheerio: Parse the web page source code and obtain the required data.

This article first introduces the dependency libraries required by the crawler and their use, and then uses these dependency libraries to implement a web crawler for Encyclopedia of Embarrassing Things.

2. request library

request is a lightweight http library that is very powerful and easy to use. You can use it to implement Http requests, and supports HTTP authentication, custom request headers, etc. Below is an introduction to some of the functions in the request library.

Install the request module as follows:

npm install request

Copy after login

After request is installed, it can be used. Now use request to request Baidu web pages.

const req = require(&#39;request&#39;);
req(&#39;http://www.baidu.com&#39;, (error, response, body) => {
 if (!error && response.statusCode == 200) {
 console.log(body)
 }
})

Copy after login

When the options parameter is not set, the request method defaults to a get request. The specific method I like to use the request object is as follows:

req.get({
 url: &#39;http://www.baidu.com&#39;
},(err, res, body) => {
 if (!err && res.statusCode == 200) {
 console.log(body)
 }
});

Copy after login

However, many times, we often do not get the information we need by directly requesting the html source code obtained from a URL. In general, request headers and web page encoding need to be taken into consideration.

Web page request header web page encoding

The following describes how to add a web page request header and set the correct encoding when requesting.

req.get({
 url : url,
 headers: {
  "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
  "Host" : "www.zhihu.com",
  "Upgrade-Insecure-Requests" : "1"
 },
 encoding : &#39;utf-8&#39;
}, (err, res, body)=>{
 if(!err)
  console.log(body);
})

Copy after login

Set the options parameter, add the headers attribute to set the request header; add the encoding attribute to set the encoding of the web page. It should be noted that if encoding: null, then the content obtained by the get request is a Buffer object, that is, the body is a Buffer object.

The functions introduced above are sufficient to meet the following needs

3. cheerio library

cheerio is a Server-side Jquery is loved by developers for its features such as being light, fast, and easy to learn. It is very easy to learn the cheerio library after having a basic knowledge of Jquery. It can quickly locate elements in web pages, and its rules are the same as Jquery's method of locating elements; it can also modify the content of elements in html and obtain their data in a very convenient form. The following mainly introduces cheerio to quickly locate elements in web pages and obtain their contents.

First install the cheerio library

npm install cheerio

Copy after login

The following is a piece of code, and then explains the usage of the cheerio library. Analyze the homepage of the blog park, and then extract the titles of the articles on each page.

First analyze the home page of the blog park. As shown below:

After analyzing the html source code, first obtain all titles through .post_item, and then obtain each .post_item For analysis, use a.titlelnk to match the a tag of each title. The following is implemented through code.

const req = require(&#39;request&#39;);
const cheerio = require(&#39;cheerio&#39;);

req.get({
 url: &#39;https://www.cnblogs.com/&#39;
 }, (err, res, body) => {
 if (!err && res.statusCode == 200) {
  let cnblogHtmlStr = body;
  let $ = cheerio.load(cnblogHtmlStr);
  $(&#39;.post_item&#39;).each((index, ele) => {
  let title = $(ele).find(&#39;a.titlelnk&#39;);
  let titleText = title.text();
  let titletUrl = title.attr(&#39;href&#39;);
  console.log(titleText, titletUrl);
  });
 }
 });

Copy after login

Of course, the cheerio library also supports chain calls, and the above code can also be rewritten as:

let cnblogHtmlStr = body;
let $ = cheerio.load(cnblogHtmlStr);
let titles = $(&#39;.post_item&#39;).find(&#39;a.titlelnk&#39;);
titles.each((index, ele) => {
 let titleText = $(ele).text();
 let titletUrl = $(ele).attr(&#39;href&#39;);
 console.log(titleText, titletUrl);

Copy after login

The above code is very simple, so I won’t elaborate on it in words. Below I summarize some of the points that I think are more important.

Use the find() method to obtain the node set A. If you use the elements in the A set as the root node again to locate its child nodes and obtain the content and attributes of the child elements, you need to The sub-elements in the A collection are $(A[i]) wrapped, just like $(ele) above. Use $(ele) in the above code. In fact, you can also use $(this), but since I am using the arrow function of es6, I changed each This pointer of the callback function in the method, therefore, I use $(ele); The cheerio library also supports chain calls, such as the above $('.post_item').find('a .titlelnk'), it should be noted that the cheerio object A calls the method find(). If A is a collection, then each sub-element in the collection A calls find( ) method and put back a result combined. If A calls text(), then each child element in A's collection calls text() and returns a string that is the union of the contents of all child elements ( merge directly, without delimiters).

Finally, I will summarize some of my more commonly used methods.

first() last() children([selector]): This method is similar to find, except that this method only searches child nodes, while find searches the entire descendant nodes.

4. Embarrassing Encyclopedia Crawler

通过上面对 request 和 cheerio 类库的介绍，下面利用这两个类库对糗事百科的页面进行爬取。

1、在项目目录中，新建 httpHelper.js 文件，通过 url 获取糗事百科的网页源码，代码如下：

//爬虫
const req = require(&#39;request&#39;);

function getHtml(url){
 return new Promise((resolve, reject) => {
  req.get({
   url : url,
   headers: {
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
    "Referer" : "https://www.qiushibaike.com/"
   },
   encoding : &#39;utf-8&#39;
  }, (err, res, body)=>{
   if(err) reject(err);
   else resolve(body);
  })
 });
}
exports.getHtml = getHtml;

Copy after login

2、在项目目录中，新建一个 Splider.js 文件，分析糗事百科的网页代码，提取自己需要的信息，并且建立一个逻辑通过更改 url 的 id 来爬取不同页面的数据。

const cheerio = require(&#39;cheerio&#39;);
const httpHelper = require(&#39;./httpHelper&#39;);
function getQBJok(htmlStr){
 let $ = cheerio.load(htmlStr);
 let jokList = $(&#39;#content-left&#39;).children(&#39;p&#39;);
 let rst = [];
 jokList.each((i, item)=>{
  let node = $(item);
  let titleNode = node.find(&#39;h2&#39;);
  let title = titleNode ? titleNode.text().trim() : &#39;匿名用户&#39;;
  let content = node.find(&#39;.content span&#39;).text().trim();
  let likeNumber = node.find(&#39;i[class=number]&#39;).text().trim();
  rst.push({
   title : title,
   content : content,
   likeNumber : likeNumber
  });
 });
 return rst;
}
async function splider(index = 1){
 let url = `https://www.qiushibaike.com/8hr/page/${index}/`;
 let htmlStr = await httpHelper.getHtml(url);
 let rst = getQBJok(htmlStr);
 return rst;
}
splider(1);

Copy after login

在获取糗事百科网页信息的时候，首先在浏览器中对源码进行分析，定位到自己所需要标签，然后提取标签的文本或者属性值，这样就完成了网页的解析。

Splider.js 文件入口是 splider 方法，首先根据传入该方法的 index 索引，构造糗事百科的 url，接着获取该 url 的网页源码，最后将获取的源码传入 getQBJok 方法，进行解析，本文只解析每条文本笑话的作者、内容以及喜欢个数。

直接运行 Splider.js 文件，即可爬取第一页的笑话信息。然后可以更改 splider 方法的参数，实现抓取不同页面的信息。

在上面已有代码的基础上，使用 koa 和 vue2.0 搭建一个浏览文本的页面，效果如下：

源码已上传到 github 上。下载地址：https://github.com/StartAction/SpliderQB ;

项目运行依赖 node v7.6.0 以上，首先从 Github 上面克隆整个项目。

git clone https://github.com/StartAction/SpliderQB.git

Copy after login

克隆之后，进入项目目录，运行下面命令即可。

node app.js

Copy after login

上面是我整理给大家的，希望今后会对大家有帮助。

利用jquery.page.js如何实现分页效果

The above is the detailed content of Writing a crawler example using NodeJS. For more information, please follow other related articles on the PHP Chinese website!