Detailed explanation of steps to use nodeJs crawler-JS Tutorial-php.cn

Detailed explanation of steps to use nodeJs crawler

php中世界最好的语言

Release： 2018-05-21 15:30:12

Original

1674 people have browsed it

This time I will bring you a detailed explanation of the steps for using the nodeJs crawler. What are the precautions when using the nodeJs crawler? Here are practical cases, let’s take a look.

Background

Recently I plan to review the nodeJs-related content I have seen before, and write a few crawlers to kill the boredom, and I discovered some during the crawling process Questions, record them for future reference.

Dependencies

The cheerio library that is widely available on the Internet is used to process the crawled content, superagent is used to process requests, and log4js is used to record logs.

Log configuration

Without further ado, let’s go directly to the code:

const log4js = require('log4js');
log4js.configure({
 appenders: {
  cheese: {
   type: 'dateFile',
   filename: 'cheese.log',
   pattern: '-yyyy-MM-dd.log',
   // 包含模型
   alwaysIncludePattern: true,
   maxLogSize: 1024,
   backups: 3 }
 },
 categories: { default: { appenders: ['cheese'], level: 'info' } }
});
const logger = log4js.getLogger('cheese');
logger.level = 'INFO';
module.exports = logger;

Copy after login

The above directly exports a logger object and directly calls the logger in the business file. Just use .info() and other functions to add log information, and logs will be generated on a daily basis. There is a lot of relevant information on the Internet.

Crawling content and processing

 superagent.get(cityItemUrl).end((err, res) => {
  if (err) {
   return console.error(err);
  }
  const $ = cheerio.load(res.text);
  // 解析当前页面,获取当前页面的城市链接地址
  const cityInfoEle = $('.newslist1 li a');
  cityInfoEle.each((idx, element) => {
   const $element = $(element);
   const sceneURL = $element.attr('href'); // 页面地址
   const sceneName = $element.attr('title'); // 城市名称
   if (!sceneName) {
    return;
   }
   logger.info(`当前解析到的目的地是: ${sceneName}, 对应的地址为: ${sceneURL}`);
   getDesInfos(sceneURL, sceneName); // 获取城市详细信息
   ep.after('getDirInfoComplete', cityInfoEle.length, (dirInfos) => {
    const content = JSON.parse(fs.readFileSync(path.join(dirname, './imgs.json')));
    dirInfos.forEach((element) => {
     logger.info(`本条数据为:${JSON.stringify(element)}`);
     Object.assign(content, element);
    });
    fs.writeFileSync(path.join(dirname, './imgs.json'), JSON.stringify(content));
   });
  });
 });

Copy after login

Use superagent to request the page. After the request is successful, use cheerio to load the page content, and then use matching rules similar to Jquery to find the target resource. .

Multiple resources are loaded, use eventproxy to proxy events, process one resource and punish one event, and process the data after all events are triggered.

The above is the most basic crawler. Next are some areas that may cause problems or require special attention. . .

Read and write local files

Create folder

function mkdirSync(dirname) {
 if (fs.existsSync(dirname)) {
  return true;
 }
 if (mkdirSync(path.dirname(dirname))) {
  fs.mkdirSync(dirname);
  return true;
 }
 return false;
}

Copy after login

Read and write files

   const content = JSON.parse(fs.readFileSync(path.join(dirname, './dir.json')));
   dirInfos.forEach((element) => {
    logger.info(`本条数据为:${JSON.stringify(element)}`);
    Object.assign(content, element);
   });
   fs.writeFileSync(path.join(dirname, './dir.json'), JSON.stringify(content));

Copy after login

Batch download resources

Downloaded resources may include pictures, audio, etc.

Use Bagpipe to handle asynchronous concurrency. Refer to

const Bagpipe = require('bagpipe');
const bagpipe = new Bagpipe(10);
  bagpipe.push(downloadImage, url, dstpath, (err, data) => {
   if (err) {
    console.log(err);
    return;
   }
   console.log(`[${dstpath}]: ${data}`);
  });

Copy after login

to download resources and use stream to complete file writing.

function downloadImage(src, dest, callback) {
 request.head(src, (err, res, body) => {
  if (src && src.indexOf('http') > -1 || src.indexOf('https') > -1) {
   request(src).pipe(fs.createWriteStream(dest)).on('close', () => {
    callback(null, dest);
   });
  }
 });
}

Copy after login

Encoding

Sometimes the web page content processed directly using cheerio.load is found to be encoded text after writing to the file. You can use

const $ = cheerio.load(buf, { decodeEntities: false });

Copy after login

to disable encoding,

ps: The encoding library and iconv-lite failed to convert utf-8 encoded characters into Chinese. It may be that you are not familiar with the API. You can pay attention to it later.

Finally, attach a regular pattern that matches all dom tags

const reg = /<.*?>/g;

Copy after login

I believe you have mastered the method after reading the case in this article. For more exciting information, please pay attention to other related articles on the PHP Chinese website!