Home  >  Article  >  Web Front-end  >  How can Node crawl headline videos in batches and save them (code implementation)

How can Node crawl headline videos in batches and save them (code implementation)

不言
不言Original
2018-09-19 17:02:252624browse

The content of this article is about how Node implements batch crawling and saving of headline videos (code implementation). It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

Introduction

The general routine for crawling videos or pictures in batches is to use a crawler to obtain a collection of file links, and then save the files one by one through methods such as writeFile. However, the video link of Toutiao cannot be captured in the html file (server-side rendering output) that needs to be crawled. The video link is dynamically calculated and added to the video tag based on the known key or hash value of the video through the algorithm or decryption method in certain js files when the page is rendered on the client side. This is also an anti-crawling measure for the website.

When we browse these pages, we can see the calculated file address through the audit element. However, when downloading in batches, it is obviously not advisable to manually obtain video links one by one. Fortunately, puppeteer provides the function of simulating access to Chrome, allowing us to crawl the final page rendered by the browser.

Project Start

Command
npm i
npm start

Notice: The process of installing puppeteer is a little slow, please wait patiently.

Configuration file
// 配置相关
module.exports =  {
  originPath: 'https://www.ixigua.com', // 页面请求地址
  savePath: 'D:/videoZZ' // 存放路径
}

Technical points

puppeteer

Official API

puppeteer provides a high-level API to control Chrome or Chromium.

puppeteer Main function:

  • Use web pages to generate PDFs and images

  • Crawl SPA applications and generate pre-rendered content (i.e. "SSR" server-side rendering)

  • Can capture content from the website

  • Automated form submission, UI testing, keyboard input, etc.

API used:

  • puppeteer.launch() Launch browser instance

  • browser .newPage() Create a new page

  • page.goto() Enter the specified webpage

  • page.screenshot() Screenshot

  • page.waitFor() The page waits, which can be time, a certain element, or a certain function

  • page.$eval() Gets a specified element, Equivalent to document.querySelector

  • ##page.$$eval() to obtain a certain type of element, equivalent to document.querySelectorAll

  • page.$( '#id .className') Get an element in the document, the operation is similar to jQuery

Code example

const puppeteer = require('puppeteer');
 
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  await page.screenshot({path: 'example.png'});
 
  await browser.close();
})();
Video file download method

  • Download video main method

const downloadVideo = async video => {
  // 判断视频文件是否已经下载
  if (!fs.existsSync(`${config.savePath}/${video.title}.mp4`)) {
    await getVideoData(video.src, 'binary').then(fileData => {
      console.log('下载视频中:', video.title)
      savefileToPath(video.title, fileData).then(res =>
        console.log(`${res}: ${video.title}`)
      )
    })
  } else {
    console.log(`视频文件已存在:${video.title}`)
  }
}
  • Get video data

getVideoData (url, encoding) {
  return new Promise((resolve, reject) => {
    let req = http.get(url, function (res) {
      let result = ''
      encoding && res.setEncoding(encoding)
      res.on('data', function (d) {
        result += d
      })
      res.on('end', function () {
        resolve(result)
      })
      res.on('error', function (e) {
        reject(e)
      })
    })
    req.end()
  })
}
  • Save video data to local

savefileToPath (fileName, fileData) {
  let fileFullName = `${config.savePath}/${fileName}.mp4`
  return new Promise((resolve, reject) => {
    fs.writeFile(fileFullName, fileData, 'binary', function (err) {
      if (err) {
        console.log('savefileToPath error:', err)
      }
      resolve('已下载')
    })
  })
}
Target website: 西瓜视频 Project function: Download the latest 20 videos under the headline account [Weichen Finance]
Project address:
Github address

The above is the detailed content of How can Node crawl headline videos in batches and save them (code implementation). For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn