Web crawling: Summary of ways to implement web crawlers in PHP, crawling crawlers

Web crawling: Summary of ways to implement web crawlers in PHP, crawling crawlers_PHP tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Release： 2016-07-13 10:14:55

Original

1376 people have browsed it

Web crawling: Summary of ways to implement web crawlers with PHP, crawling crawlers

Source: http://www.ido321.com/1158.html

To capture the content in a certain web page, we need to parse the DOM tree. After finding the specified node, we can then capture the content we need. The process is a bit cumbersome. LZ has summarized several commonly used and easy-to-implement web crawling methods. If you are familiar with JQuery selectors, these frameworks will be quite simple.

1. Ganon

Project address: http://code.google.com/p/ganon/

Documentation: http://code.google.com/p/ganon/w/list

Test: Grab all the div elements whose class attribute value is focus on the homepage of my website, and output the class value

<span><?php
 <span>include</span> <span>'ganon.php'</span>;
 $html = file_get_dom(<span>'http://www.ido321.com/'</span>);
 <span>foreach</span>($html(<span>'div[class="focus"]'</span>) <span>as</span> $element) {
   <span>echo</span> $element-><span>class</span>, <span>"<br>\n"</span>; 
 }
?></span>

Copy after login

Result:

2. phpQuery

Project address: http://code.google.com/p/phpquery/

Documentation: https://code.google.com/p/phpquery/wiki/Manual

Test: Grab the article tag element on the homepage of my website, and then print the html value of the h2 tag below it

<span><?php
<span>include</span> <span>'phpQuery/phpQuery.php'</span>; 
phpQuery::newDocumentFile(<span>'http://www.ido321.com/'</span>); 
$artlist = pq(<span>"article"</span>); 
<span>foreach</span>($artlist <span>as</span> $title){ 
   <span>echo</span> pq($title)->find(<span>'h2'</span>)->html().<span>"<br/>"</span>; 
} 
?></span>

Copy after login

Result:

3. Simple-Html-Dom

Project address: http://simplehtmldom.sourceforge.net/
Document: http://simplehtmldom.sourceforge.net/manual.htm

Test: crawl all links on the homepage of my website

<span><?php
<span>include</span> <span>'simple_html_dom.php'</span>;
<span>//使用url和file都可以创建DOM</span>
$html = file_get_html(<span>'http://www.ido321.com/'</span>);

<span>//找到所有图片</span>
<span>// foreach($html->find('img') as $element)</span>
<span>//        echo $element->src . '<br>';</span>

<span>//找到所有链接</span>
<span>foreach</span>($html->find(<span>'a'</span>) <span>as</span> $element)
       <span>echo</span> $element->href . <span>'<br>'</span>; 
?></span>

Copy after login

Result: (Screenshot is part)

4. Snoopy

Project address: http://code.google.com/p/phpquery/

Documentation: http://code.google.com/p/phpquery/wiki/Manual

Test: crawl my website homepage

<span><?php
<span>include</span>(<span>"Snoopy.class.php"</span>);
$url = <span>"http://www.ido321.com"</span>;
$snoopy = <span>new</span> Snoopy;
$snoopy->fetch($url); <span>//获取所有内容</span>
 <span>echo</span> $snoopy->results; <span>//显示结果</span>
<span>// echo $snoopy->fetchtext ;//获取文本内容（去掉html代码）</span>
<span>// echo $snoopy->fetchlinks($url) ;//获取链接</span>
<span>// $snoopy->fetchform ;//获取表单 </span>
?></span>

Copy after login

Result:

5. Manually write crawlers

If you have good writing skills, you can handwrite a web crawler to crawl web pages. There are countless articles on the Internet that introduce this method, so I won’t go into details. If you are interested in knowing more, you can crawl the Baidu php web page.

ps: resource sharing

For common open source crawler projects, please visit: http://blog.chinaunix.net/uid-22414998-id-3774291.html

Next article: The father-in-law’s “ass theory”

PHP web crawler collects part of the content of a website

Owner, you can use the simpl_html_dom class to collect data. How to use it specifically? If you know jquery, I believe you will understand it after just a look. Good luck.

Crawler crawls web page keywords and abstracts for search

strip_tags($string)