This is a series and I can’t finish it in one or two days, so I will publish it one by one
General outline:
1.curl data collection series single page collection function get_html
2.curl data collection series multi-page parallel collection function get_htmls
3.curl data collection series regular processing function get _matches
4.Curl data collection series code separation
5. Curl data collection series parallel logic control function web_spider
Single page collection is the most commonly used function in the data collection process. Sometimes this collection method can only be used under server access restrictions. It is slow but can be easily controlled, so write a commonly used curl function call. It’s very important
We are familiar with Baidu and NetEase, so we will use the collection of homepages of these two websites as examples
The simplest way to write:
Copy the code The code is as follows:
$ url = 'http://www.baidu.com';
$ch = curl_init($url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_TIMEOUT,5 );
$html = curl_exec($ch);
if($html !== false){
echo $html;
}
Due to frequent use You can use curl_setopt_array to write it in the form of a function:
Copy the code The code is as follows:
function get_html($url,$options = array ()){
$options[CURLOPT_RETURNTRANSFER] = true;
$options[CURLOPT_TIMEOUT] = 5;
$ch = curl_init($url);
curl_setopt_array($ch,$options);
$html = curl_exec($ch);
curl_close($ch);
if($html === false){
return false;
}
return $html ;
}
Copy code The code is as follows:
$url = 'http:/ /www.baidu.com';
echo get_html($url);
Sometimes you need to pass some specific parameters to get the correct page. For example, now you want to get the NetEase page:
Copy code The code is as follows:
$url = 'http://www.163.com';
echo get_html ($url);
You will see a blank with nothing, then use curl_getinfo to write a function to see what happens:
Copy code The code is as follows:
function get_info($url,$options = array()){
$options[CURLOPT_RETURNTRANSFER] = true;
$options[CURLOPT_TIMEOUT] = 5;
$ch = curl_init($url);
curl_setopt_array($ch,$options);
$html = curl_exec($ch);
$info = curl_getinfo($ch) ;
curl_close($ch);
return $info;
}
$url = 'http://www.163.com';
var_dump(get_info($url)) ;
You can see http_code 302 Redirected. At this time, you need to pass some parameters:
Copy code Code As follows:
$url = 'http://www.163.com';
$options[CURLOPT_FOLLOWLOCATION] = true;
echo get_html($url,$options);
You will find out why such a page is different from the one accessed by our computer? ? ?
It seems that the parameters are still not enough for the server to determine what device our client is on, so it returns a normal version
It seems that USERAGENT
Copy code The code is as follows:
$url = 'http: //www.163.com';
$options[CURLOPT_FOLLOWLOCATION] = true;
$options[CURLOPT_USERAGENT] = 'Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0';
echo get_html($url,$options);
OKNow the page has come out. Basically thisget_htmlfunction can basically achieve such extended functions
Of course there are other ways to achieve this. When you clearly know the NetEase webpage, you can simply collect it:
Copy the code The code is as follows:
$url = 'http://www.163.com/index.html';
echo get_html($url);
This also works Normal collection
http://www.bkjia.com/PHPjc/326895.htmlwww.bkjia.comtruehttp: //www.bkjia.com/PHPjc/326895.htmlTechArticleThis is a series that cannot be written in a day or two, so I will publish it one by one. The general outline: 1. curl data collection series single page collection function get_html 2. curl data collection series multi-page...