Home >Backend Development >PHP Tutorial >PHP implements crawling Baidu search results and analyzes the data structure

PHP implements crawling Baidu search results and analyzes the data structure

PHP中文网
PHP中文网forward
2020-09-24 18:04:036356browse

PHP implements crawling Baidu search results and analyzes the data structure

Recommended: "PHP Video Tutorial"

PHP web crawler practice: crawl Baidu search results and analyze the data structure

Baidu’s search engine has an anti-crawler mechanism. I will try it directly with guzzle first. The code is as follows:

<?php
/**
 * Created by Benjiemin
 * Date: 2020/3/5
 * Time: 14:58
 */
require (&#39;./vendor/autoload.php&#39;);
use QL\QueryList;
//进入网页
$jar = new \GuzzleHttp\Cookie\CookieJar;
$client = new GuzzleHttp\Client([&#39;cookies&#39; => true]);
$ql = $client->request(&#39;GET&#39;, &#39;https://www.baidu.com&#39;, [
    &#39;cookies&#39; => $jar
]);
if($ql->getStatusCode()!=200){
    echo &#39;网站状态不正常&#39;;die;
}
echo  $ql->getBody();

PHP implements crawling Baidu search results and analyzes the data structure

Baidu intercepted it directly and entered the jump page. I will try to add a browser header file and try again.

The modified header is as follows:

$ql = $client->request(&#39;GET&#39;, &#39;https://www.baidu.com&#39;, [
    &#39;cookies&#39; => $jar,
    &#39;headers&#39; => [
    &#39;Accept-Encoding&#39; => &#39;gzip, deflate, br&#39;,
    &#39;Accept&#39;     => &#39;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8&#39;,
    &#39;Accept-Language&#39;      => &#39;zh-CN,zh;q=0.9,en;q=0.8&#39;,
    &#39;Cache-Control&#39;      => &#39;no-cache&#39;,
    &#39;Connection&#39;      => &#39;keep-alive&#39;,
    &#39;User-Agent&#39;      => &#39;Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36&#39;,
]
]);

I tested it and the website opened.

PHP implements crawling Baidu search results and analyzes the data structure

Let's continue, enter keywords, and search, and found that it was intercepted by security, so I felt that I couldn't do it directly with GuzzleHttp, so I continued with my artifact: jaeger /querylist and jaeger/querylist-puppeteer.

Installation steps:

1. Install dependencies

Before this, you must enable the proc_open function of php, otherwise the complete installation cannot be done

composer install jaeger/querylist
composer install jaeger/querylist-puppeteer

2. Install nodejs

yum install nodejs

3.Install npm

4.Install @nesk/puphpeteer

npm install @nesk/puphpeteer

5.PHP enables proc_open

The code is as follows:

<?php
/**
 * Created by Benjiemin
 * Date: 2020/3/5
 * Time: 14:58
 */
require (&#39;./vendor/autoload.php&#39;);
use QL\QueryList;
use QL\Ext\Chrome;
$ql = QueryList::getInstance();
// 注册插件,默认注册的方法名为: chrome
$ql->use(Chrome::class);
 $ql->chrome(function ($page,$browser) {
    $page->goto(&#39;https://www.baidu.com&#39;);
    // 这里故意设置一个很长的延长时间,让你可以看到chrome浏览器的启动
    sleep(3);
    //输入关键词
    $wd = &#39;简庆旺博客&#39;;
    $page->type("input[id=&#39;kw&#39;]",$wd);
    sleep(1);
    //点击搜索
    $page->click("input[type=&#39;submit&#39;]");
    //等待搜索结果
    sleep(3);
    //获取结果
    $html = $page->content();
    //用jquery选择器抽取结果
    $rules = array(
        &#39;title&#39;=>[&#39;#content_left h3 a&#39;,&#39;text&#39;],//标题
        &#39;url&#39;=>[&#39;#content_left h3 a&#39;,&#39;href&#39;],//跳转网址
        &#39;description&#39;=>[&#39;div .c-abstract&#39;,&#39;text&#39;],//描述
    );
    $ql = QueryList::html($html);
    $rt = $ql->rules($rules)->query()->getData();
    //如果有需要,可以把$rt入库,以及做其他操作
    sleep(10);
    $browser->close();
    // 返回值一定要是页面的HTML内容
    return $html;
},[
    &#39;headless&#39; => false, // 启动可视化Chrome浏览器,方便调试
    &#39;devtools&#39; => false, // 打开浏览器的开发者工具
])->find(&#39;title&#39;)->text();

$rt is my result set, print it as follows

PHP implements crawling Baidu search results and analyzes the data structure

PHP implements crawling Baidu search results and analyzes the data structure

The above is the detailed content of PHP implements crawling Baidu search results and analyzes the data structure. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:cnblogs.com. If there is any infringement, please contact admin@php.cn delete