使用Node.js實作簡易爬蟲的方法-js教程-PHP中文網

使用Node.js實作簡易爬蟲的方法

高洛峰

發布： 2017-03-19 17:24:40

原創

2069 人瀏覽過

為什麼選擇利用node來寫爬蟲呢？就是因為cheerio這個函式庫，全兼容jQuery語法，熟悉的話用起來真是爽

#依賴選擇

##cheerio：
Node .js 版的jQuery
http：封裝了一個HTPP伺服器和一個簡易的HTTP客戶端
iconv-lite：解決爬取gb2312網頁出現亂碼

初步實作

既然是要爬取網站內容，那我們就應該先去看看網站的基本組成

選取的是電影天堂作為目標網站，想要去爬取所有最新電影的下載連結

分析頁面

頁面結構如下：

使用Node.js實作簡易爬蟲的方法

我們可以看到每個電影的標題都在一個

class<a href="//m.sbmmt.com/wiki/164.html" target="_blank"></a>為ulink的a標籤下，再往上定位，我們可以看到最外部的盒子class為co_content8

ok，可以開工了

#首先引入依賴，並設定需要爬取的url

var cheerio = require('cheerio');
var http = require('http');
var iconv = require('iconv-lite');

var url = 'http://www.ygdy8.net/html/gndy/dyzz/index.html';

登入後複製

核心代碼

index.js

http.get(url, function(sres) {
  var chunks = [];
  sres.on('data', function(chunk) {
    chunks.push(chunk);
  });
  // chunks里面存储着网页的 html 内容，将它zhuan ma传给 cheerio.load 之后
  // 就可以得到一个实现了 jQuery 接口的变量，将它命名为 `$`
  // 剩下就都是 jQuery 的内容了
  sres.on('end', function() {
    var titles = [];
    //由于咱们发现此网页的编码格式为gb2312，所以需要对其进行转码，否则乱码
    //依据：“<meta>”
    var html = iconv.decode(Buffer.concat(chunks), 'gb2312');
    var $ = cheerio.load(html, {decodeEntities: false});
    $('.co_content8 .ulink').each(function (idx, element) {
      var $element = $(element);
      titles.push({
        title: $element.text()
      })
    })    
    console.log(titles);     
  });
});

登入後複製

運行node index

結果如下

成功取得電影title，那如果我想取得多個頁面的title呢，總不可能一個一個url去改吧。這當然有辦法，請往下看！使用Node.js實作簡易爬蟲的方法

取得多頁電影標題

我們只要將先前的程式碼封裝成一個

函數

並

遞迴執行就完成了核心程式碼index.js

var index = 1; //页面数控制
var url = 'http://www.ygdy8.net/html/gndy/dyzz/list_23_';
var titles = []; //用于保存title

function getTitle(url, i) {
  console.log("正在获取第" + i + "页的内容"); 
  http.get(url + i + '.html', function(sres) {
    var chunks = [];
    sres.on('data', function(chunk) {
      chunks.push(chunk);
    });
    sres.on('end', function() {
      var html = iconv.decode(Buffer.concat(chunks), 'gb2312');
      var $ = cheerio.load(html, {decodeEntities: false});
      $('.co_content8 .ulink').each(function (idx, element) {
        var $element = $(element);
        titles.push({
          title: $element.text()
        })
      })  
      if(i <code>結果如下</code><p><br>#取得電影下載連線<img src="https://img.php.cn/upload/article/000/000/013/79a1edcfe9244c5fb6f23f007f455aaf-2.png"    style="max-width:90%"  style="max-width:90%" title="使用Node.js實作簡易爬蟲的方法" alt="使用Node.js實作簡易爬蟲的方法"></p>如果是手動操作，我們需要一次操作，透過點擊進入電影詳情頁才能找到下載地址<h4>那我們透過node如何來實現呢</h4><p>#常規先來分析<br>頁面佈局</p><p><a href="//m.sbmmt.com/code/7955.html" target="_blank"></a><br>我們如果想要準確定位到下載鏈接，需要先找到<img src="https://img.php.cn/upload/article/000/000/013/45e6e69669b80c60f0e7eabd78b3a018-3.png"    style="max-width:90%"  style="max-width:90%" title="使用Node.js實作簡易爬蟲的方法" alt="使用Node.js實作簡易爬蟲的方法">id</p>為<p>Zoom<code>的p，下載鏈接就在這個</code>p<code>下的</code>tr<code>下的</code>a<code>標籤內。 </code><code>那我們就再</code>定義一個函數</p>，用來取得下載連結<p><a href="//m.sbmmt.com/code/8119.html" target="_blank">getBtLink()</a></p><pre class="brush:php;toolbar:false">function getBtLink(urls, n) { //urls里面包含着所有详情页的地址
  console.log("正在获取第" + n + "个url的内容");
  http.get('http://www.ygdy8.net' + urls[n].title, function(sres) {
    var chunks = [];
    sres.on('data', function(chunk) {
      chunks.push(chunk);
    });
    sres.on('end', function() {
      var html = iconv.decode(Buffer.concat(chunks), 'gb2312'); //进行转码
      var $ = cheerio.load(html, {decodeEntities: false});
      $('#Zoom td').children('a').each(function (idx, element) {
        var $element = $(element);
        btLink.push({
          bt: $element.attr('href')
        })
      })
      if(n 再次執行<p>node index</p><p><code></code><br><img src="https://img.php.cn/upload/article/000/000/013/2816c9cbd03b1466c255e54c10156e14-4.png"    style="max-width:90%"  style="max-width:90%" title="使用Node.js實作簡易爬蟲的方法" alt="使用Node.js實作簡易爬蟲的方法"><br>就這樣我們將3個頁面內所有電影的下載連結取得完畢，是不是很簡單？ <img src="https://img.php.cn/upload/article/000/000/013/8eb570e10f1a4e755ebffd92bd150760-5.png"    style="max-width:90%"  style="max-width:90%" title="使用Node.js實作簡易爬蟲的方法" alt="使用Node.js實作簡易爬蟲的方法"></p>保存資料<p></p>我們講這些資料爬取出來當然是要進行保存的啊，在這裡我選用了<h2>MongoDB</h2>來對其進行保存處理<p> <a href="//m.sbmmt.com/wiki/1523.html" target="_blank">資料保存函數</a>save()</p><p></p><pre class="brush:php;toolbar:false">function save() {
  var MongoClient = require('mongodb').MongoClient; //导入依赖
  MongoClient.connect(mongo_url, function (err, db) {
    if (err) {
      console.error(err);
      return;
    } else {
      console.log("成功连接数据库");
      var collection = db.collection('node-reptitle');
      collection.insertMany(btLink, function (err,result) { //插入数据
        if (err) {
          console.error(err);
        } else {
          console.log("保存数据成功");
        }
      })
      db.close();
    }
  });
}

登入後複製

這裡的操作很簡單，就沒必要上mongoose啦再運行node index

#這個Node.js實作的爬蟲就是這樣了，祝大家能爬到自己想要的資料；）使用Node.js實作簡易爬蟲的方法