Home>Article>Backend Development> PHP crawler practice: extract required data from Baidu search results

PHP crawler practice: extract required data from Baidu search results

PHPz
PHPz Original
2023-06-13 10:22:40 1978browse

With the rapid development of the Internet, the era of information explosion has arrived. In an era like this, search engines have become our main tool for obtaining information, and the massive amounts of data provided by these search engines are beyond our imagination. However, for researchers or data analysts in some specific fields, the information they need may only be a small part of the data in these search results. In this case, we need to use a crawler to get exactly the data we want.

In this article, we will use PHP to write a simple crawler program to extract the data we need from Baidu search results. The core of this program is to use PHP's cURL library to simulate HTTP requests, and then use regular expressions and other methods to parse the HTML page.

Ideas

Before we start writing the crawler program, we need to clarify a few questions:

  1. Goal: We want to crawl from the Baidu search results page What data?
  2. URL: Which URL do we need to get the data?
  3. Data format: What is the format of the data on Baidu search results page?

When thinking about what data we need to obtain, let’s take the keyword “PHP crawler” as an example. If we search this keyword on Baidu, we can see the following information:

  • Total number of search results
  • Title of each search result
  • each Description of each search result
  • The URL of each search result

Then, we can define our goal as extracting the title of each result from Baidu search results, Description and URL.

The first step to obtain data is to clarify the URL we want to obtain. In our example, the URL we need to get is this:https://www.baidu.com/s?wd=php crawler. By typing "php crawler" into the Baidu search bar, we can automatically jump to this URL.

Next, we need to understand the format of the data we are going to parse. In our case, the search results exist in the form of HTML code similar to the following:

www.example.com PHP 爬虫是什么? - PHP 入门教程 - 极客学院

2天前 - PHP 爬虫是一种方便快捷的数据采集方式 ... 目前的爬虫主要是通过python 爬虫实现。相比于 PHPPHP 一般用作...

In the above HTML code snippet, you can see that each search result is nested within ae0e7649d1c55f416bf62f64a22dbf767within the tag. Each search result has a title, which corresponds to the HTML formatff4e2ec8753d5ab7a1c6a9335e730d21, where the link address is nested within the3499910bf9dac5ae3c52d5ede7383485tag. Each search result has a description, corresponding to the HTML format21677049e5b751d69b1465bbd638c12a. Each search result also has a URL containingclass="c-showurl"within the3499910bf9dac5ae3c52d5ede7383485tag.

Now that we have clarified the format of the data we want to obtain and the format of the HTML data we need to parse, we can start writing our crawler program.

Writing code

We divided our PHP crawler code into three steps:

  1. Get the HTML page of Baidu search results
  2. Analysis HTML page
  3. Return the parsed data in the form of an array

Get the HTML page of Baidu search results

We can use PHP's cURL library to send HTTP requests, To obtain the HTML page of Baidu search results. In this example, we store the URL of the search page in the$urlvariable. Then create a cURL handle and set many options, such as: set URL, set request header, set proxy, set timeout, set request method to GET, and finally execute this handle to obtain the HTML page.


      

In this example, we use many of the options provided by the cURL library. For example, set the request header to simulate the HTTP request sent by the browser, set the request method to GET, set the timeout, etc.

Parse HTML page

After obtaining the HTML page of Baidu search results, we need to parse it to obtain the information we need. In this example, we will use PHP's regular expressions to parse an HTML page.

The following is the regular expression we use to extract the title, description and link from the HTML page:

.*?.*?s*(.*?)s*.*?.*?(.*?)
.*?
/', $result, $matches ); // 提取搜索结果中的标题、描述和链接 $data = []; for ($i=0; $i strip_tags($matches[2][$i]), // 去除标题中的 HTML 标签 'description' => strip_tags($matches[3][$i]), // 去除描述中的 HTML 标签 'link' => $matches[1][$i] ]; }; // 关闭curl句柄 curl_close($ch);

In the above code, we use PHP's regular expression to match all searches result. We then use a loop to go through all the search results and extract the titles, descriptions and links we need. Since the title and description we get from HTML will contain HTML tags, we use thestrip_tagsfunction to remove them.

Return the results

In the above code, we have obtained the data we need, and now we only need to return the results in the form of an array. We encapsulate our entire crawler program into a function, and return the obtained data in the form of an array:

.*?.*?s*(.*?)s*.*?.*?(.*?)
.*?
/', $result, $matches ); $data = []; for ($i=0; $i strip_tags($matches[2][$i]), 'description' => strip_tags($matches[3][$i]), 'link' => $matches[1][$i] ]; }; curl_close($ch); return $data; }

We can receive a keyword as a parameter, and then call this function to obtain the keyword in Titles, descriptions and links in Baidu search results.

Conclusion

In this article, we wrote a simple crawler program using PHP to extract the required data from Baidu search results. This program uses PHP's cURL library to simulate HTTP requests and uses methods such as regular expressions to parse HTML pages. Through this example, we can gain an in-depth understanding of how crawlers work and how to write crawlers using PHP. In actual projects, we can modify this program according to our needs to obtain the data we need.

The above is the detailed content of PHP crawler practice: extract required data from Baidu search results. For more information, please follow other related articles on the PHP Chinese website!

php 正则表达式 html 封装 cURL 循环 class 数据分析 http https 搜索引擎
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Previous article:Implement crawler using PHP and Selenium WebDriver Next article:Implement crawler using PHP and Selenium WebDriver

Related articles

See more