Home>Article>Backend Development> PHP crawler practice: extract required data from Baidu search results
With the rapid development of the Internet, the era of information explosion has arrived. In an era like this, search engines have become our main tool for obtaining information, and the massive amounts of data provided by these search engines are beyond our imagination. However, for researchers or data analysts in some specific fields, the information they need may only be a small part of the data in these search results. In this case, we need to use a crawler to get exactly the data we want.
In this article, we will use PHP to write a simple crawler program to extract the data we need from Baidu search results. The core of this program is to use PHP's cURL library to simulate HTTP requests, and then use regular expressions and other methods to parse the HTML page.
Before we start writing the crawler program, we need to clarify a few questions:
When thinking about what data we need to obtain, let’s take the keyword “PHP crawler” as an example. If we search this keyword on Baidu, we can see the following information:
Then, we can define our goal as extracting the title of each result from Baidu search results, Description and URL.
The first step to obtain data is to clarify the URL we want to obtain. In our example, the URL we need to get is this:https://www.baidu.com/s?wd=php crawler
. By typing "php crawler" into the Baidu search bar, we can automatically jump to this URL.
Next, we need to understand the format of the data we are going to parse. In our case, the search results exist in the form of HTML code similar to the following:
www.example.com PHP 爬虫是什么? - PHP 入门教程 - 极客学院
In the above HTML code snippet, you can see that each search result is nested within ae0e7649d1c55f416bf62f64a22dbf767
within the tag. Each search result has a title, which corresponds to the HTML formatff4e2ec8753d5ab7a1c6a9335e730d21
, where the link address is nested within the3499910bf9dac5ae3c52d5ede7383485
tag. Each search result has a description, corresponding to the HTML format21677049e5b751d69b1465bbd638c12a
. Each search result also has a URL containingclass="c-showurl"
within the3499910bf9dac5ae3c52d5ede7383485
tag.
Now that we have clarified the format of the data we want to obtain and the format of the HTML data we need to parse, we can start writing our crawler program.
We divided our PHP crawler code into three steps:
We can use PHP's cURL library to send HTTP requests, To obtain the HTML page of Baidu search results. In this example, we store the URL of the search page in the$url
variable. Then create a cURL handle and set many options, such as: set URL, set request header, set proxy, set timeout, set request method to GET, and finally execute this handle to obtain the HTML page.
In this example, we use many of the options provided by the cURL library. For example, set the request header to simulate the HTTP request sent by the browser, set the request method to GET, set the timeout, etc.
Parse HTML page
After obtaining the HTML page of Baidu search results, we need to parse it to obtain the information we need. In this example, we will use PHP's regular expressions to parse an HTML page.
The following is the regular expression we use to extract the title, description and link from the HTML page:
.*?.*? s*(.*?)s*.*?.*? (.*?)
In the above code, we use PHP's regular expression to match all searches result. We then use a loop to go through all the search results and extract the titles, descriptions and links we need. Since the title and description we get from HTML will contain HTML tags, we use thestrip_tags
function to remove them.
In the above code, we have obtained the data we need, and now we only need to return the results in the form of an array. We encapsulate our entire crawler program into a function, and return the obtained data in the form of an array:
.*?.*? s*(.*?)s*.*?.*? (.*?)
We can receive a keyword as a parameter, and then call this function to obtain the keyword in Titles, descriptions and links in Baidu search results.
In this article, we wrote a simple crawler program using PHP to extract the required data from Baidu search results. This program uses PHP's cURL library to simulate HTTP requests and uses methods such as regular expressions to parse HTML pages. Through this example, we can gain an in-depth understanding of how crawlers work and how to write crawlers using PHP. In actual projects, we can modify this program according to our needs to obtain the data we need.
The above is the detailed content of PHP crawler practice: extract required data from Baidu search results. For more information, please follow other related articles on the PHP Chinese website!