Home>Article>Web Front-end> Detailed explanation of how to use Node.js to develop a simple image crawling function

Detailed explanation of how to use Node.js to develop a simple image crawling function

青灯夜游 forward: 2022-06-30 19:55:50 2470browse

How to useNodefor crawling? The following article will talk about using Node.js to develop a simple image crawling function. I hope it will be helpful to you!

The main purpose of the crawler is to collect some specific data that is publicly available on the Internet. Using this data, we can analyze some trends and compare them, or train models for deep learning, etc. In this issue, we will introduce anode.jspackage specially used for web crawling -node-crawler, and we will use it to complete a simple crawler case to crawl Pictures on the web page and downloaded locally.

Text

node-crawleris a lightweightnode.jscrawler tool that takes into account both efficiency and Convenience, supports distributed crawler system, supports hard coding, and supports http front-level proxy. Moreover, it is entirely written bynodejsand inherently supports non-blocking asynchronous IO, which provides great convenience for the crawler's pipeline operation mechanism. At the same time, it supports quick selection ofDOM(you can usejQuerysyntax). It can be said to be a killer function for the task of grabbing specific parts of the web page. There is no need to hand-write regular expressions, which improves Reptile development efficiency.

Installation and introduction

We first create a new project and create index.js as the entry file.

Then install the crawler librarynode-crawler.

# PNPM pnpm add crawler # NPM npm i -S crawler # Yarn yarn add crawler

Then userequireto introduce it.

// index.js const Crawler = require("crawler");

Create an instance

// index.js let crawler = new Crawler({ timeout:10000, jQuery:true, }) function getImages(uri) { crawler.queue({ uri, callback: (err, res, done) => { if (err) throw err; } }) }

From now on we will start to write a method to get the image of the html page.crawlerAfter instantiation, it is mainly placed in its queue for Write link and callback methods. This callback function will be called after each request is processed.

I would like to explain here thatCrawleruses therequestlibrary, so the parameter list available for configuration inCrawlerisrequestA superset of the parameters of the library, that is, all configurations in therequestlibrary are applicable toCrawler.

Element Capture

Maybe you also saw thejQueryparameter just now. You guessed it right, it can be captured using the syntax ofjQueryDOMelement.

// index.js let data = [] function getImages(uri) { crawler.queue({ uri, callback: (err, res, done) => { if (err) throw err; let $ = res.$; try { let $imgs = $("img"); Object.keys($imgs).forEach(index => { let img = $imgs[index]; const { type, name, attribs = {} } = img; let src = attribs.src || ""; if (type === "tag" && src && !data.includes(src)) { let fileSrc = src.startsWith('http') ? src : `https:${src}` let fileName = src.split("/")[src.split("/").length-1] downloadFile(fileSrc, fileName) // 下载图片的方法 data.push(src) } }); } catch (e) { console.error(e); done() } done(); } }) }

You can see that$was used to capture theimgtag in the request. Then we use the following logic to process the link to the completed image and strip out the name so that it can be saved and named later. An array is also defined here, its purpose is to save the captured image address. If the same image address is found in the next capture, the download will not be processed repeatedly.

The following is the information printed using$("img")on the Nuggets homepage html:

Detailed explanation of how to use Node.js to develop a simple image crawling function

Download pictures

Before downloading, we need to install anodejspackage——axios, yes you read that right,axiosNot only provided to the front end, it can also be used by the back end. But because downloading pictures needs to be processed into a data stream,responseTypeis set tostream. Then you can use thepipemethod to save the data flow file.

const { default: axios } = require("axios"); const fs = require('fs'); async function downloadFile(uri, name) { let dir = "./imgs" if (!fs.existsSync(dir)) { await fs.mkdirSync(dir) } let filePath = `${dir}/${name}` let res = await axios({ url: uri, responseType: 'stream' }) let ws = fs.createWriteStream(filePath) res.data.pipe(ws) res.data.on("close",()=>{ ws.close(); }) }

Because there may be a lot of pictures, so if you want to put them in one folder, you need to determine whether there is such a folder. If not, create one. Then use thecreateWriteStreammethod to save the obtained data stream to the folder in the form of a file.

Then we can try it. For example, we capture the pictures under the html of the Nuggets homepage:

// index.js getImages("https://juejin.cn/")

After execution, we can find that all the pictures in the static html have been captured.

node index.js

Detailed explanation of how to use Node.js to develop a simple image crawling function

Conclusion

At the end, you can also see that this code may not work SPA (Single Page Application). Since there is only one HTML file in a single-page application, and all the content on the web page is dynamically rendered, it remains the same. No matter what, you can directly handle its data request to collect the information you want. No.

One more thing to say is that many friends userequest.jswhen processing requests to download images. Of course, this is possible and even requires less code, but I want to say What's more, this library has been deprecated in 2020. It is better to replace it with a library that has been updated and maintained.

Detailed explanation of how to use Node.js to develop a simple image crawling function

For more node-related knowledge, please visit:nodejs tutorial!

The above is the detailed content of Detailed explanation of how to use Node.js to develop a simple image crawling function. For more information, please follow other related articles on the PHP Chinese website!

node.js 分布式 jquery 正则表达式 html require 回调函数 JS dom 异步 http

Statement：

This article is reproduced at:juejin.cn. If there is any infringement, please contact admin@php.cn delete

Previous article：Summary of array knowledge points in JavaScript Next article：Summary of array knowledge points in JavaScript

See more

Detailed explanation of how to use Node.js to develop a simple image crawling function

Text

Installation and introduction

Create an instance

Element Capture

Related articles