Home>Article>Backend Development> PHP crawler practice: extract required data from Baidu search results

PHP crawler practice: extract required data from Baidu search results

PHPz Original: 2023-06-13 10:22:40 1978browse

With the rapid development of the Internet, the era of information explosion has arrived. In an era like this, search engines have become our main tool for obtaining information, and the massive amounts of data provided by these search engines are beyond our imagination. However, for researchers or data analysts in some specific fields, the information they need may only be a small part of the data in these search results. In this case, we need to use a crawler to get exactly the data we want.

In this article, we will use PHP to write a simple crawler program to extract the data we need from Baidu search results. The core of this program is to use PHP's cURL library to simulate HTTP requests, and then use regular expressions and other methods to parse the HTML page.

Ideas

Before we start writing the crawler program, we need to clarify a few questions:

Goal: We want to crawl from the Baidu search results page What data?
URL: Which URL do we need to get the data?
Data format: What is the format of the data on Baidu search results page?

When thinking about what data we need to obtain, let’s take the keyword “PHP crawler” as an example. If we search this keyword on Baidu, we can see the following information:

Total number of search results
Title of each search result
each Description of each search result
The URL of each search result

Then, we can define our goal as extracting the title of each result from Baidu search results, Description and URL.

The first step to obtain data is to clarify the URL we want to obtain. In our example, the URL we need to get is this:https://www.baidu.com/s?wd=php crawler. By typing "php crawler" into the Baidu search bar, we can automatically jump to this URL.

Next, we need to understand the format of the data we are going to parse. In our case, the search results exist in the form of HTML code similar to the following:

   www.example.com  PHP 爬虫是什么? - PHP 入门教程 - 极客学院 
  2天前 -  PHP 爬虫是一种方便快捷的数据采集方式 ... 目前的爬虫主要是通过python 爬虫实现。相比于 PHP，PHP 一般用作...

In the above HTML code snippet, you can see that each search result is nested within ae0e7649d1c55f416bf62f64a22dbf767within the tag. Each search result has a title, which corresponds to the HTML formatff4e2ec8753d5ab7a1c6a9335e730d21, where the link address is nested within the3499910bf9dac5ae3c52d5ede7383485tag. Each search result has a description, corresponding to the HTML format21677049e5b751d69b1465bbd638c12a. Each search result also has a URL containingclass="c-showurl"within the3499910bf9dac5ae3c52d5ede7383485tag.

Now that we have clarified the format of the data we want to obtain and the format of the HTML data we need to parse, we can start writing our crawler program.

Writing code

We divided our PHP crawler code into three steps:

Get the HTML page of Baidu search results
Analysis HTML page
Return the parsed data in the form of an array

Get the HTML page of Baidu search results

We can use PHP's cURL library to send HTTP requests, To obtain the HTML page of Baidu search results. In this example, we store the URL of the search page in the$urlvariable. Then create a cURL handle and set many options, such as: set URL, set request header, set proxy, set timeout, set request method to GET, and finally execute this handle to obtain the HTML page.

In this example, we use many of the options provided by the cURL library. For example, set the request header to simulate the HTTP request sent by the browser, set the request method to GET, set the timeout, etc.

Parse HTML page

After obtaining the HTML page of Baidu search results, we need to parse it to obtain the information we need. In this example, we will use PHP's regular expressions to parse an HTML page.

The following is the regular expression we use to extract the title, description and link from the HTML page:

.*?.*?s*(.*?)s*.*?.*?(.*?)

.*?

/', $result, $matches ); // 提取搜索结果中的标题、描述和链接 $data = []; for ($i=0; $i strip_tags($matches[2][$i]), // 去除标题中的 HTML 标签 'description' => strip_tags($matches[3][$i]), // 去除描述中的 HTML 标签 'link' => $matches[1][$i] ]; }; // 关闭curl句柄 curl_close($ch);

In the above code, we use PHP's regular expression to match all searches result. We then use a loop to go through all the search results and extract the titles, descriptions and links we need. Since the title and description we get from HTML will contain HTML tags, we use thestrip_tagsfunction to remove them.

Return the results

In the above code, we have obtained the data we need, and now we only need to return the results in the form of an array. We encapsulate our entire crawler program into a function, and return the obtained data in the form of an array:

.*?.*?s*(.*?)s*.*?.*?(.*?)

.*?

PHP crawler practice: extract required data from Baidu search results

Ideas

Writing code

Get the HTML page of Baidu search results

Return the results

Conclusion

Related articles