How to use PHP and phpSpider to crawl the entire website content?
In the modern Internet era, information acquisition has become more and more important. For some projects that require large amounts of data, full-site content crawling has become an effective method. After years of development, phpSpider has become a powerful PHP crawler tool, helping developers crawl website data more conveniently. This article will introduce how to use PHP and phpSpider to achieve full-site content crawling, and give corresponding code examples.
1. Preliminary preparations
Before we start, we need to install PHP and Composer.
php -r "copy('https://install.phpcomposer.com/installer', 'composer-setup.php');" php composer-setup.php php -r "unlink('composer-setup.php');"
cd your-project composer init
2. Install phpSpider
In the project directory, run the following command to install phpSpider:
composer require phpspider/phpspider
3. Write the code
Now, we can start writing the capture Got the script. Here's an example of crawling the entire site for a given website.
<?php require 'vendor/autoload.php'; use phpspidercorephpspider; use phpspidercoreselector; $configs = array( 'name' => '全站内容抓取', 'log_show' => true, 'domains' => array( 'example.com' ), 'scan_urls' => array( 'http://www.example.com' ), 'list_url_regexes' => array( "//category/.*/" ), 'content_url_regexes' => array( "//article/d+.html/" ), 'fields' => array( array( 'name' => 'title', 'selector' => "//title", 'required' => true ), array( 'name' => 'content', 'selector' => "//div[@class='content']", 'required' => true ) ) ); $spider = new phpspider($configs); $spider->on_extract_field = function($fieldName, $data) { if ($fieldName == 'content') { $data = strip_tags($data); } return $data; }; $spider->start();
In the above code, we first introduced the phpspider library and defined some crawling configurations. In the configuration, 'domains' contains the domain name of the website that needs to be crawled, 'scan_urls' contains the starting page to start crawling, 'list_url_regexes' and 'content_url_regexes' specify the URL rules for the list page and content page respectively.
Next, we define the fields that need to be captured, where 'name' specifies the field name, 'selector' specifies the XPath or CSS selector of the field in the web page, and 'required' specifies the field Is it necessary?
During the fetching process, we can process the fetched fields through the $spider->on_extract_field callback function. In the above example, we removed the HTML tags in the content field through the strip_tags function.
Finally, we start the crawler through the $spider->start() method.
4. Run the script
In the command line, enter the project directory and run the following command to run the crawl script you just wrote:
php your_script.php
The script will start Crawl the entire site content of the specified website and output the results to the command line window.
Summary
By using PHP and phpSpider, we can easily crawl the entire website content. When writing a crawl script, we need to define the crawl configuration and set the corresponding XPath or CSS selector according to the web page structure. At the same time, we can also process the captured data through callback functions to meet specific needs.
References
The above is the detailed content of How to use PHP and phpSpider to crawl the entire website content?. For more information, please follow other related articles on the PHP Chinese website!