Node crawls Lagou.com data and exports it to an excel file-JS Tutorial-php.cn

This article mainly introduces the node crawling Lagou.com data and exporting it as an excel file. It has a certain reference value. Now I share it with everyone. Friends in need can refer to it

Preface

I have been learning node.js intermittently before. Today I will practice using Lagou.com and learn about the recent recruitment market through data! I am relatively new to node, and I hope I can learn and make progress together with everyone.

1. Summary

We first need to clarify the specific needs:

can be crawled through node index city position Related information
You can also enter node index start to directly crawl our predefined city and position arrays, and loop to crawl different job information in different cities
Store the final crawling results in the local ./data directory
Generate the corresponding excel file and store it locally

2. Related modules used by crawlers

fs: used to read and write system files and directories
async: process control
superagent: client request proxy module
node-xlsx: Export files in a certain format as excel

3. The main steps of the crawler:

Initialize the project

New project directory

Create the project in the appropriate disk directory Directory node-crwl-lagou

Initialization project

Enter the node-crwl-lagou folder

Execute npm init and initialize the package.json file

Install dependent packages

npm install async

npm install superagent

npm install node-xlsx

Processing of command line input

For the content entered on the command line, you can use process.argv to obtain it. It will return an array, and each item in the array is the content entered by the user.
Distinguish between node index regional position and node index start two inputs, the simplest is to determine the length of process.argv, if the length is four, directly call the crawler main program To crawl the data, if the length is three, we need to piece together the URL through the predefined city and position arrays, and then use async.mapSeries to call the main program in a loop. The homepage code for command analysis is as follows:

if (process.argv.length === 4) {
  let args = process.argv
  console.log('准备开始请求' + args[2] + '的' + args[3] + '职位数据');
  requsetCrwl.controlRequest(args[2], args[3])
} else if (process.argv.length === 3 && process.argv[2] === 'start') {
  let arr = []
  for (let i = 0; i <p>The predefined city and position arrays are as follows: </p><pre class="brush:php;toolbar:false">{
    "city": ["北京","上海","广州","深圳","杭州","南京","成都","西安","武汉","重庆"],
    "position": ["前端","java","php","ios","android","c++","python",".NET"]
}

Copy after login

The next step is the analysis of the main program part of the crawler.

Analyze the page and find the request address

First we open the homepage of Lagou.com, enter the query information (such as node), and then check the console to find the relevant request, as shown in the figure:

Node crawls Lagou.com data and exports it to an excel file

This post requesthttps://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false is what we need, through three requests Parameters to obtain different data, a simple analysis can tell: the parameter first is to mark whether the current page is the first page, true is yes, false is no; the parameter pn is the current The page number; the parameter kd is the query input content.

Requesting data through superagent

First of all, it needs to be clear that the entire program is asynchronous, and we need to use async.series to call it in sequence.
View the response returned by the analysis:

Node crawls Lagou.com data and exports it to an excel file

You can see that content.positionResult.totalCount is the total number of pages we need
We use superagent to directly call the post request , the console will prompt the following information:

{'success': False, 'msg': '您操作太频繁,请稍后再访问', 'clientIp': '122.xxx.xxx.xxx'}

Copy after login

This is actually one of the anti-crawler strategies. We only need to add a request header to it. The method of obtaining the request header is very simple, as follows:

Node crawls Lagou.com data and exports it to an excel file

Then use superagent to call the post request. The main code is as follows:

// 先获取总页数
    (cb) => {
      superagent
        .post(`https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&city=${city}&kd=${position}&pn=1`)
        .send({
          'pn': 1,
          'kd': position,
          'first': true
        })
        .set(options.options)
        .end((err, res) => {
          if (err) throw err
          // console.log(res.text)
          let resObj = JSON.parse(res.text)
          if (resObj.success === true) {
            totalPage = resObj.content.positionResult.totalCount;
            cb(null, totalPage);
          } else {
            console.log(`获取数据失败:${res.text}}`)
          }
        })
    },

Copy after login

After getting the total number of pages, we can passTotal number of pages/15Get the pn parameter, loop to generate all URLs and store them in urls:

(cb) => {
      for (let i=0;Math.ceil(i<totalpage><p>With all the URLs, it is not difficult to crawl to all the data. Continue to use the superagent's post method to loop requests For all URLs, every time the data is obtained, a json file is created in the data directory and the returned data is written. This seems simple, but there are two points to note: </p>
<ol class=" list-paddingleft-2">
<li><p>In order to prevent too many concurrent requests from causing the IP to be blocked: when looping URLs, you need to use the async.mapLimit method to control the concurrency to 3. After each request, it will take two seconds before sending the next request</p></li>
<li><p>在async.mapLimit的第四个参数中，需要通过判断调用主函数的第三个参数是否存在来区分一下是那种命令输入，因为对于<code>node index start</code>这个命令，我们使用得是async.mapSeries，每次调用主函数都传递了<code>(city, position, callback)</code>，所以如果是<code>node index start</code>的话，需要在每次获取数据完后将null传递回去，否则无法进行下一次循环</p></li>
</ol>
<p>主要代码如下：</p>
<pre class="brush:php;toolbar:false">// 控制并发为3
    (cb) => {
      async.mapLimit(urls, 3, (url, callback) => {
        num++;
        let page = url.split('&')[3].split('=')[1];
        superagent
          .post(url)
          .send({
            'pn': totalPage,
            'kd': position,
            'first': false
          })
          .set(options.options)
          .end((err, res) => {
            if (err) throw err
            let resObj = JSON.parse(res.text)
            if (resObj.success === true) {
              console.log(`正在抓取第${page}页，当前并发数量：${num}`);
              if (!fs.existsSync('./data')) {
                fs.mkdirSync('./data');
              }
              // 将数据以.json格式储存在data文件夹下
              fs.writeFile(`./data/${city}_${position}_${page}.json`, res.text, (err) => {
                if (err) throw err;
                // 写入数据完成后，两秒后再发送下一次请求
                setTimeout(() => {
                  num--;
                  console.log(`第${page}页写入成功`);
                  callback(null, 'success');
                }, 2000);
              });
            }
          })
      }, (err, result) => {
        if (err) throw err;
        // 这个arguments是调用controlRequest函数的参数，可以区分是那种爬取（循环还是单个）
        if (arguments[2]) {
          ok = 1;
        }
        cb(null, ok)
      })
    },
    () => {
      if (ok) {
        setTimeout(function () {
          console.log(`${city}的${position}数据请求完成`);
          indexCallback(null);
        }, 5000);
      } else {
        console.log(`${city}的${position}数据请求完成`);
      }
      // exportExcel.exportExcel() // 导出为excel
    }

Copy after login

导出的json文件如下：
Node crawls Lagou.com data and exports it to an excel file

json文件导出为excel

将json文件导出为excel有多种方式，我使用的是node-xlsx这个node包，这个包需要将数据按照固定的格式传入，然后导出即可，所以我们首先做的就是先拼出其所需的数据格式：

function exportExcel() {
  let list = fs.readdirSync('./data')
  let dataArr = []
  list.forEach((item, index) => {
    let path = `./data/${item}`
    let obj = fs.readFileSync(path, 'utf-8')
    let content = JSON.parse(obj).content.positionResult.result
    let arr = [['companyFullName', 'createTime', 'workYear', 'education', 'city', 'positionName', 'positionAdvantage', 'companyLabelList', 'salary']]
    content.forEach((contentItem) => {
      arr.push([contentItem.companyFullName, contentItem.phone, contentItem.workYear, contentItem.education, contentItem.city, contentItem.positionName, contentItem.positionAdvantage, contentItem.companyLabelList.join(','), contentItem.salary])
    })
    dataArr[index] = {
      data: arr,
      name: path.split('./data/')[1] // 名字不能包含 \ / ? * [ ]
    }
  })

// 数据格式
// var data = [
//   {
//     name : 'sheet1',
//     data : [
//       [
//         'ID',
//         'Name',
//         'Score'
//       ],
//       [
//         '1',
//         'Michael',
//         '99'
//
//       ],
//       [
//         '2',
//         'Jordan',
//         '98'
//       ]
//     ]
//   },
//   {
//     name : 'sheet2',
//     data : [
//       [
//         'AA',
//         'BB'
//       ],
//       [
//         '23',
//         '24'
//       ]
//     ]
//   }
// ]

// 写xlsx
  var buffer = xlsx.build(dataArr)
  fs.writeFile('./result.xlsx', buffer, function (err)
    {
      if (err)
        throw err;
      console.log('Write to xls has finished');

// 读xlsx
//     var obj = xlsx.parse("./" + "resut.xls");
//     console.log(JSON.stringify(obj));
    }
  );
}

Copy after login

导出的excel文件如下，每一页的数据都是一个sheet，比较清晰明了：
Node crawls Lagou.com data and exports it to an excel file