Use of single page collection function get_html based on curl data collection

Use of single page collection function get_html based on curl data collection_PHP tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Release： 2016-07-21 15:11:30

Original

1227 people have browsed it

This is a series and I can’t finish it in one or two days, so I will publish it one by one

General outline:

1.curl data collection series single page collection function get_html

2.curl data collection series multi-page parallel collection function get_htmls

3.curl data collection series regular processing function get _matches

4.Curl data collection series code separation

5. Curl data collection series parallel logic control function web_spider

Single page collection is the most commonly used function in the data collection process. Sometimes this collection method can only be used under server access restrictions. It is slow but can be easily controlled, so write a commonly used curl function call. It’s very important

We are familiar with Baidu and NetEase, so we will use the collection of homepages of these two websites as examples

The simplest way to write:

Copy the code The code is as follows:

$ url = 'http://www.baidu.com';
 $ch = curl_init($url);
 curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
 curl_setopt($ch,CURLOPT_TIMEOUT,5 );
 $html = curl_exec($ch);
 if($html !== false){
 echo $html;
 }

Due to frequent use You can use curl_setopt_array to write it in the form of a function:

Copy the code The code is as follows:

function get_html($url,$options = array ()){
 $options[CURLOPT_RETURNTRANSFER] = true;
 $options[CURLOPT_TIMEOUT] = 5;
 $ch = curl_init($url);
 curl_setopt_array($ch,$options); 
 $html = curl_exec($ch);
 curl_close($ch);
 if($html === false){
 return false;
 }
 return $html ;
 }

Copy code The code is as follows:

$url = 'http:/ /www.baidu.com';
echo get_html($url);

Sometimes you need to pass some specific parameters to get the correct page. For example, now you want to get the NetEase page:

Copy code The code is as follows:

$url = 'http://www.163.com';
echo get_html ($url);

You will see a blank with nothing, then use curl_getinfo to write a function to see what happens:

Copy code The code is as follows:

function get_info($url,$options = array()){
 $options[CURLOPT_RETURNTRANSFER] = true;
 $options[CURLOPT_TIMEOUT] = 5;
 $ch = curl_init($url);
 curl_setopt_array($ch,$options);
 $html = curl_exec($ch);
 $info = curl_getinfo($ch) ;
 curl_close($ch);
 return $info;
 }
 $url = 'http://www.163.com';
 var_dump(get_info($url)) ;

You can see http_code 302 Redirected. At this time, you need to pass some parameters:

Copy code Code As follows:

$url = 'http://www.163.com';
$options[CURLOPT_FOLLOWLOCATION] = true;
echo get_html($url,$options); 

You will find out why such a page is different from the one accessed by our computer? ? ?

It seems that the parameters are still not enough for the server to determine what device our client is on, so it returns a normal version

It seems that USERAGENT

Copy code The code is as follows:

$url = 'http: //www.163.com';
 $options[CURLOPT_FOLLOWLOCATION] = true;
 $options[CURLOPT_USERAGENT] = 'Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0'; 
echo get_html($url,$options);

OKNow the page has come out. Basically thisget_htmlfunction can basically achieve such extended functions

Of course there are other ways to achieve this. When you clearly know the NetEase webpage, you can simply collect it:

Copy the code The code is as follows:

 $url = 'http://www.163.com/index.html';
 echo get_html($url);

This also works Normal collection