When it comes to making a crawler, the first thing that everyone may think of is Python. In fact, PHP can also be used to write crawler programs. PHP has always been simple and easy to use. I personally tested that I can write a simple crawler program in 10 minutes using the PHPspider framework.
1. PHP environment installation
Like python, PHP also needs an environment. You can use PHP downloaded from the official website, or you can use XAMPP, PHPstudy and other integrated environments. PHP. An integrated environment is recommended, eliminating the need to install the Mysql database separately.
2. Composer installation
composer is a dependency package management tool under PHP, similar to PIP in Python.
The Chinese official website is https://www.phpcomposer.com/
. Just download and install it. Run cmd in win R and enter the composer command. If the following picture appears, the installation is successful.
3. PHPspider installation
Create a folder in any location. For example, if we want to capture the data of Jianshu, we You can create the jianshu folder on the D drive, then enter the folder with the cmd command, and run the command:
composer require owner888/phpspider
The following result is a successful installation.
Related recommendations: "php environment construction"
4. Start writing the first crawler
Now open the jianshu folder and you will find that there are some more things in it. Don't worry about it. Create a php file and start coding.
The development documentation is here: https://doc.phpspider.org/demo-start.html
I won’t talk about the basics here, just go to the code. , because we are doing a 10-minute quick tutorial.
The matching method uses XPach syntax.
<?php require '/vendor/autoload.php'; use phpspider\core\phpspider; /* Do NOT delete this comment */ /* 不要删除这段注释 */ $configs = array( 'name' => '简书', 'log_show' =>false, 'tasknum' => 1, //数据库配置 'db_config' => array( 'host' => '127.0.0.1', 'port' => 3306, 'user' => 'root', 'pass' => '', 'name' => 'demo', ), 'export' => array( 'type' => 'db', 'table' => 'jianshu', // 如果数据表没有数据新增请检查表结构和字段名是否匹配 ), //爬取的域名列表 'domains' => array( 'jianshu', 'www.jianshu.com' ), //抓取的起点 'scan_urls' => array( 'https://www.jianshu.com/c/V2CqjW?utm_medium=index-collections&utm_source=desktop' ), //列表页实例 'list_url_regexes' => array( "https://www.jianshu.com/c/\d+" ), //内容页实例 // \d+ 指的是变量 'content_url_regexes' => array( "https://www.jianshu.com/p/\d+", ), 'max_try' => 5, 'fields' => array( array( 'name' => "title", 'selector' => "//h1[@class='title']", 'required' => true, ), array( 'name' => "content", 'selector' => "//div[@class='show-content-free']", 'required' => true, ), ), ); $spider = new phpspider($configs); $spider->start();
Let’s explain the meaning of the syntax a little bit:
//h1[@class='title']
Get all h1 nodes with class value of title
//div[@class='show-content-free']
Get all divs with class value of show-content-free After finishing the code for node
, remember to create the corresponding database and data table according to the content to be captured, and the fields must be aligned.
Then cmd, enter:
php -f d:\jianshu\spider.php
Run as follows:
Open the data and take a look. Have you captured everything?
The above is the detailed content of How to install php crawler framework. For more information, please follow other related articles on the PHP Chinese website!