php curl custom get method to crawl web pages

Suppose we use the get method to request a web page. After getting the content of the web page, we can match the corresponding content.

We can use curl to encapsulate a function, assuming the function name is get. By passing in the URL, you can request the specified web page and return the HTML code of the specified web page. The code is as follows:

function get($url) {
    //初使化curl
    $ch = curl_init();
    //请求的url，由形参传入
    curl_setopt($ch, CURLOPT_URL, $url);
    //将得到的数据返回
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    //不处理头信息
    curl_setopt($ch, CURLOPT_HEADER, 0);
    //连接超过10秒超时
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    //执行curl
    $output = curl_exec($ch);
    //关闭资源
    curl_close($ch);
    //返回内容
    return $output;
}

We now use the get method we wrote to request a list from NetEase and grab the title and url.

We can first pass in a URL in the get method. Get the html of the web page corresponding to this URL.

The URL is the news list page of the New Media Observation Network: http://www.xmtnews.com/events.

Collect the red area:

1. Get the html of the red range

This range starts from the following HTML code:

<section class="ov">

Ends at the following code:

<div class="hr-10"></div>

Use preg_match to write a regular expression and match it to get red HTML for ranges. Assign the matched HTML to the variable $area.

The matching regular expression is as follows:

<section class="ov">(.*?)<div class="hr-10"><\/div>/mis'

2. Match the title and title URL in the red area

We Found that all titles are in the <h3> tag. We use preg_match_all to write a regular expression expression.

preg_match_all('/<h3><a href="(.*?)" title=".*?" class="headers" target="_blank">(.*?)<\/a><\/h3>/mis', $area, $find);

Place the content matching the url and content into $find, print the $find array, and you can see the matching results.

If necessary, you can also read and display the title of each row and the URL of each row in a loop.

All code demonstrations are as follows:

<?php

$content = get('http://www.xmtnews.com/events');

preg_match('/<section class="ov">(.*?)<div class="hr-10"><\/div>/mis', $content, $match);

//将正则匹配到的内容赋值给$area
$area = $match[1];

preg_match_all('/<h3><a href="(.*?)" title=".*?" class="headers" target="_blank">(.*?)<\/a><\/h3>/', $area, $find);


var_dump($find);

function get($url) {

   //初使化curl
   $ch = curl_init();

   //请求的url，由形参传入
   curl_setopt($ch, CURLOPT_URL, $url);

   //将得到的数据返回
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

   //不处理头信息
   curl_setopt($ch, CURLOPT_HEADER, 0);

   //连接超过10秒超时
   curl_setopt($ch, CURLOPT_TIMEOUT, 10);

   //执行curl
   $output = curl_exec($ch);

   //关闭资源
   curl_close($ch);

   //返回内容
   return $output;
}
?>

Next Section

new file

<?php

$content = get('http://www.xmtnews.com/events');

preg_match('/<section class="ov">(.*?)<div class="hr-10"><\/div>/mis', $content, $match);

//将正则匹配到的内容赋值给$area
$area = $match[1];

preg_match_all('/<h3><a href="(.*?)" title=".*?" class="headers" target="_blank">(.*?)<\/a><\/h3>/', $area, $find);


var_dump($find);

function get($url) {

 //初使化curl
 $ch = curl_init();

 //请求的url，由形参传入
 curl_setopt($ch, CURLOPT_URL, $url);

 //将得到的数据返回
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

 //不处理头信息
 curl_setopt($ch, CURLOPT_HEADER, 0);

 //连接超过10秒超时
 curl_setopt($ch, CURLOPT_TIMEOUT, 10);

 //执行curl
 $output = curl_exec($ch);

 //关闭资源
 curl_close($ch);

 //返回内容
 return $output;
}

?>

submit Reset Code

Automatic operation

Full Screen