Home>Article>Web Front-end> What is the puppeteer crawler? How crawlers work

What is the puppeteer crawler? How crawlers work

青灯夜游
青灯夜游 forward
2018-11-19 17:58:58 3869browse

The content of this article is to introduce what is the puppeteer crawler? How crawlers work. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

What is a puppeteer?

Crawleris also called a network robot. Maybe you use search engines every day. Crawlers are an important part of search engines, crawling content for indexing. Nowadays, big data and data analysis are very popular. So where does the data come from? It can be crawled through web crawlers. Then let me discuss web crawlers.

What is the puppeteer crawler? How crawlers work

The working principle of the crawler

As shown in the figure, this is the flow chart of the crawler. It can be seen that the crawling journey of the crawler is started through a seed URL. By downloading the web page, the content in the web page is parsed and stored. At the same time, the URL in the parsed web page is removed from duplication and added to the queue waiting to be crawled. Then get the next URL waiting to be crawled from the queue and repeat the above steps. Isn't it very simple?

Breadth (BFS) or depth (DFS) priority strategy

It is also mentioned above that after crawling a web page, wait for crawling Select a URL from the queue to crawl, so how to choose? Should you select the URL in the current crawled web page or continue to select the same level URL in the current URL? The same-level URL here refers to the URL from the same web page, which is the difference between crawling strategies.

What is the puppeteer crawler? How crawlers work

Breadth First Strategy (BFS)

The breadth first strategy is to crawl the URL of a current web page completely first. Then crawl the URL crawled from the URL in the current web page. This is BFS. If the relationship diagram above represents the relationship between web pages, then the crawling strategy of BFS will be: (A->(B,D, F,G)->(C,F));

Depth First Strategy (DFS)

Depth First Strategy crawls a web page and then continues Crawl the URL parsed from the web page until the crawl is completed.
(A->B->C->D->E->F->G)

##Download page

Downloading a web page seems very simple, just like entering a link in the browser, and the browser will display it after the download is completed. Of course the result is not that simple.

Simulated login

For some web pages, you need to log in to see the content on the web page. How does the crawler log in? In fact, the login process is to obtain the access credentials (cookie, token...)

let cookie = ''; let j = request.jar() async function login() { if (cookie) { return await Promise.resolve(cookie); } return await new Promise((resolve, reject) => { request.post({ url: 'url', form: { m: 'username', p: 'password', }, jar: j }, function(err, res, body) { if (err) { reject(err); return; } cookie = j.getCookieString('url'); resolve(cookie); }) }) }
Here is a simple chestnut, log in to obtain the cookie, and then bring the cookie with each request.

Get web content

Some web content is rendered on the server side. There is no CGI to obtain data and the content can only be parsed from html. However, the content of some websites is not simple. Obtaining content, websites like LinkedIn are not simply able to obtain web page content. The web page needs to be executed through the browser to obtain the final html structure. So how to solve it? I mentioned browser execution earlier, but do I have a programmable browser? Puppeteer, the open source headless browser project of the Google Chrome team, can use the headless browser to simulate user access, obtain the content of the most important web pages, and crawl the content.

Use puppeteer to simulate login

async function login(username, password) { const browser = await puppeteer.launch(); page = await browser.newPage(); await page.setViewport({ width: 1400, height: 1000 }) await page.goto('https://example.cn/login'); console.log(page.url()) await page.focus('input[type=text]'); await page.type(username, { delay: 100 }); await page.focus('input[type=password]'); await page.type(password, { delay: 100 }); await page.$eval("input[type=submit]", el => el.click()); await page.waitForNavigation(); return page; }
After executing

login(), you can get the content in html just like you logged in in the browser. , when letting w Oh Meng, you can also directly request CGI

async function crawlData(index, data) { let dataUrl = `https://example.cn/company/contacts?count=20&page=${index}&query=&dist=0&cid=${cinfo.cid}&company=${cinfo.encodename}&forcomp=1&searchTokens=&highlight=false&school=&me=&webcname=&webcid=&jsononly=1`; await page.goto(dataUrl); // ... }
Like some websites, the cookie will be the same every time you crawl it. You can also use a headless browser to crawl it, so you don’t have to crawl it every time. Worry about cookies every time you crawl.

Write at the end

Of course, crawlers are not only about these, but also analyze the website. , find a suitable crawler strategy. Regarding

puppeteer, it can not only be used for crawlers, because it can be programmed, a headless browser, and can be used for automated testing and so on.

The above is the detailed content of What is the puppeteer crawler? How crawlers work. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:segmentfault.com. If there is any infringement, please contact admin@php.cn delete