Use of regular processing function get_matches based on curl data collection

Use of regular processing function get_matches based on curl data collection_PHP tutorial

WBOY

Release： 2016-07-21 15:11:34

Original

921 people have browsed it

Based on the previous two blog posts:

Usage of single page collection function get_html based on curl data collection

Usage of single page parallel collection function get_htmls based on curl data collection

We have obtained the html file we need. Now we need to process the obtained file to obtain the collected data we need.

For the parsing of HTML documents, there is no parsing class like XML, because HTML documents have many unpaired tags and are not strict. At this time, you need to use some other auxiliary classes. Simplehtmldom is a parsing class similar to JQuery that operates HTML documents. It is very convenient to get the data you want, but unfortunately it is slow. This is not the focus of our discussion here. I mainly use regular expressions to match the data I need to collect, so that I can quickly get the information I need to collect.

Considering that get_html can judge the returned data, but get_htmls cannot judge, the following two functions were written to facilitate debugging and calling:

Copy code The code is as follows:

function get_matches($pattern,$html,$err_msg,$multi=false,$flags=0,$offset=0){
 if(!$multi){ 
 If (! Preg_match ($ Pattern, $ HTML, $ matches, $ Flags, $ OFFSET)) {
 Echo $ ERR_MSG. "! Error message:". Get_preg_msg (). "N"; 
; return false;
                                                                                                                                                                                                                                                                                return false; ".get_preg_err_msg ()."n"; error_code = preg_last_error ();
 switch($error_code){
 case PREG_NO_ERROR :
 $err_msg = 'PREG_NO_ERROR';
 break;
 case PREG_INTERNAL _ERROR:
 $err_msg = 'PREG_INTERNAL_ERROR';
         break;
       case PREG_BACKTRACK_LIMIT_ERROR:
          $err_msg = 'PREG_BACKTRACK_LIMIT_ERROR';
                        case PREG_RECURSION_LIMIT_ERROR:
 $err_msg = 'PREG_RECURSION_LIMIT_ERROR';
 break;
 case PREG_BAD_UTF8_ERROR:
 $err_msg = 'PREG_BAD_UTF8_ERROR';
 break;
 case PREG_BAD_UTF8_OFFSET_ERROR:
 $err_msg = 'PREG_BAD_UTF8_OFFSET_ERROR';
 break;
 default:
 return 'Unknown error !';
 }
 return $err_msg.': '.$error_code;
 }


 can be called like this: 



 Copy the code 

 The code is as follows: 


$url = 'http://www.baidu.com';

$html = get_html($url);
$matches = get_matches('!!',$html,'No link found',true);
if($matches){

var_dump( $matches); }Or call it like this:

Copy code

The code is as follows:

$urls = array('http://www.baidu.com','http://www.hao123.com');
 $htmls = get_htmls($urls);
 foreach($htmls as $html){
 $matches = get_matches('!!',$html,'No link found',true);
 if($matches){
 var_dump($matches);
 }
 }

to get the required information, whether single page collection or multi-page collection , in the end PHP can still only process one page. Because of the use of get_matches, the returned value can be judged to be true or false, and the correct data can be obtained. Since the problem of exceeding the regular backtracking is encountered when using regular expressions, get_preg_err_msg is added to prompt the regular information.

Because when collecting data, the list page is often collected, and the content page is collected based on the content page link obtained from the list page, or more levels, then there will be a lot of nested loops, and the control of the code will feel inadequate. So can we separate the code of the collection list page from the code of the collection content page, or more levels of collection code, or even simplify the loop?