PHP and phpSpider: How to deal with the JS challenge of website anti-crawling?
With the development of Internet technology, websites’ defenses against crawler scripts are becoming more and more powerful. Websites often use Javascript technology to anti-crawl, because Javascript can dynamically generate page content, making it difficult for simple crawler scripts to obtain complete data. This article will introduce how to use PHP and phpSpider to deal with the JS challenge of website anti-crawling.
phpSpider is a lightweight crawler framework based on PHP. It provides a simple and easy-to-use API and rich functions, suitable for handling various web page crawling tasks. Its advantage is that it can simulate browser behavior, including executing Javascript code, which allows us to bypass the JS anti-crawler mechanism of the website.
First, we need to install phpSpider. You can install it through Composer and execute the following command in the project directory:
composer require dungsit/php-spider
After the installation is complete, we can use phpSpider to write crawler scripts in the project.
First, we need to create a new phpSpider instance and set the crawled target URL, HTTP header information, etc. The following is an example:
<?php require 'vendor/autoload.php'; use phpspidercorephpspider; $configs = array( 'name' => 'example', 'log_show' => true, 'domains' => array( 'example.com', ), 'scan_urls' => array( 'http://www.example.com' ), 'list_url_regexes' => array( "http://www.example.com/w+", ), 'content_url_regexes' => array( "http://www.example.com/[a-z]+/d+", ), 'fields' => array( array( 'name' => 'title', 'selector' => '//h1', 'required' => true, ), array( 'name' => 'content', 'selector' => '//div[@class="content"]', 'required' => true, ), ), ); $spider = new phpspider($configs); $spider->start();
In the above example, we specify the starting page URL that needs to be crawled by setting the scan_urls
field and the list_url_regexes
field. Specify the URL regular expression of the list page, and the content_url_regexes
field specifies the URL regular expression of the content page. In the next fields
field, we can set the field name to be captured, the field selector and whether it is a required field.
Since our goal is to bypass the JS anti-crawler mechanism of the website, we need to use a plug-in in phpSpider to execute Javascript code. You can use the ExecuteJsPlugin
plug-in to achieve this function, which is based on the browser packaging library Goutte
to execute Javascript code. Here is an example of how to use the ExecuteJsPlugin
plugin in phpSpider:
<?php require 'vendor/autoload.php'; use phpspidercorephpspider; use phpspidercoreequests; use phpspidercoreselector; use phpspiderpluginsexecute_jsExecuteJsPlugin; // 设置目标网站的域名和UA requests::set_global('domain', 'example.com'); requests::set_global('user_agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'); $configs = array( 'name' => 'example', 'log_show' => true, 'domains' => array( 'example.com', ), 'scan_urls' => array( 'http://www.example.com' ), 'list_url_regexes' => array( "http://www.example.com/w+", ), 'content_url_regexes' => array( "http://www.example.com/[a-z]+/d+", ), 'fields' => array( array( 'name' => 'title', 'selector' => '//h1', 'required' => true, ), array( 'name' => 'content', 'selector' => '//div[@class="content"]', 'required' => true, ), ), 'plugins' => array( new ExecuteJsPlugin(), ), ); $spider = new phpspider($configs); $spider->start();
In the above example, we first introduced the execute_jsExecuteJsPlugin
plugin. Then, we set the domain name and user agent (UA) of the target website, which is to allow phpSpider to simulate browser requests when visiting the target website. Next, we added the ExecuteJsPlugin
instance in the plugins
field.
After using this plug-in, we can use Javascript expressions in the field's selector to locate elements. For example, we set the selector to '//div[@class="content"]/q'
, which means that we will select the sub-element q whose class attribute of the div element is "content". In this way, phpSpider can execute this Javascript code to obtain the data.
To sum up, we can use the phpSpider framework and the ExecuteJsPlugin
plug-in to deal with the JS challenge of anti-crawling on the website. By simulating browser behavior, we can bypass the website's JS anti-crawler mechanism and easily obtain the required data. I hope this article can be helpful to your crawler development.
Code sample source: https://github.com/nmred/phpspider
The above is the detailed content of PHP and phpSpider: How to deal with the JS challenge of website anti-crawling?. For more information, please follow other related articles on the PHP Chinese website!