php curl custom get method to crawl web pages

Suppose we use the get method to request a web page. After getting the content of the web page, we can match the corresponding content.

We can use curl to encapsulate a function, assuming the function name is get. By passing in the URL, you can request the specified web page and return the HTML code of the specified web page. The code is as follows:

function get($url) { //初使化curl $ch = curl_init(); //请求的url，由形参传入 curl_setopt($ch, CURLOPT_URL, $url); //将得到的数据返回 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //不处理头信息 curl_setopt($ch, CURLOPT_HEADER, 0); //连接超过10秒超时 curl_setopt($ch, CURLOPT_TIMEOUT, 10); //执行curl $output = curl_exec($ch); //关闭资源 curl_close($ch); //返回内容 return $output; }

We now use the get method we wrote to request a list from NetEase and grab the title and url.

We can first pass in a URL in the get method. Get the html of the web page corresponding to this URL.

The URL is the news list page of the New Media Observation Network: http://www.xmtnews.com/events.

Collect the red area:

1. Get the html of the red range

This range starts from the following HTML code:

Ends at the following code:

Use preg_match to write a regular expression and match it to get red HTML for ranges. Assign the matched HTML to the variable $area.

The matching regular expression is as follows:

(.*?)<\/div>/mis'

2. Match the title and title URL in the red area

We Found that all titles are in the

tag. We use preg_match_all to write a regular expression expression.

preg_match_all('/(.*?)<\/a><\/h3>/mis', $area, $find);

Place the content matching the url and content into $find, print the $find array, and you can see the matching results.

If necessary, you can also read and display the title of each row and the URL of each row in a loop.

All code demonstrations are as follows:

(.*?)<\/div>/mis', $content, $match); //将正则匹配到的内容赋值给$area $area = $match[1]; preg_match_all('/(.*?)<\/a><\/h3>/', $area, $find); var_dump($find); function get($url) { //初使化curl $ch = curl_init(); //请求的url，由形参传入 curl_setopt($ch, CURLOPT_URL, $url); //将得到的数据返回 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //不处理头信息 curl_setopt($ch, CURLOPT_HEADER, 0); //连接超过10秒超时 curl_setopt($ch, CURLOPT_TIMEOUT, 10); //执行curl $output = curl_exec($ch); //关闭资源 curl_close($ch); //返回内容 return $output; } ?>

Continuing Learning

new file

(.*?)

<\/div>/mis', $content, $match); //将正则匹配到的内容赋值给$area $area = $match[1]; preg_match_all('/

(.*?)<\/a><\/h3>/', $area, $find); var_dump($find); function get($url) { //初使化curl $ch = curl_init(); //请求的url，由形参传入 curl_setopt($ch, CURLOPT_URL, $url); //将得到的数据返回 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //不处理头信息 curl_setopt($ch, CURLOPT_HEADER, 0); //连接超过10秒超时 curl_setopt($ch, CURLOPT_TIMEOUT, 10); //执行curl $output = curl_exec($ch); //关闭资源 curl_close($ch); //返回内容 return $output; } ?>

submit Reset Code

Automatic operation

Previous section Next Section

Tutorial List

get help

Course Recommendations
Courseware download

The courseware is not available for download at the moment. The staff is currently organizing it. Please pay more attention to this course in the future~

Students who have watched this course are also learning

About us Disclaimer Sitemap: php.cn：Public welfare online PHP training，Help PHP learners grow quickly！

php curl custom get method to crawl web pages

IntermediateFront-end Vue3 actual combat [handwritten vue project]

ElementaryAPIPOST tutorial [Popularization of technical concepts related to network communication]

IntermediateIssue 22_Comprehensive actual combat

ElementaryIssue 22_PHP Programming

ElementaryIssue 22_Front-end development

IntermediateBig data (MySQL) video tutorial full version

ElementaryGo language tutorial-full of practical information and no nonsense

ElementaryGO Language Core Programming Course

IntermediateJS advanced and BootStrap learning

IntermediateSQL optimization and troubleshooting (MySQL version)

IntermediateRedis+MySQL database interview tutorial

ElementaryDeliver food or learn programming?