Home  >  Article  >  Backend Development  >  PHP learning CURL crawler example

PHP learning CURL crawler example

*文
*文Original
2017-12-22 09:59:344166browse

Many times we need to crawl some website resources in batches, and at this time we need to use crawlers. The basis of the crawler is to use CURL to simulate HTTP requests and then parse the data. This article will lead you to learn PHP's CURL by writing a simple web crawler.

First introduce some commonly used functions.

curl_init 初始化一个curl对话
curl_setopt 设置curl参数,即传输选项
curl_exec 执行请求
curl_close 关闭一个curl对话

Mainly the above four

curl_errno 返回最后一次错误码,php已经定义了诸多错误枚举编码
curl_errror 返回一个保护当前会话最近一次错误的字符串


Let’s go directly to the examples. The relevant explanations are in the comments


1. Download a webpage on the Internet and replace "Baidu" in the content with "Diaosi" and then output


2. Query by calling WebService The current weather in Beijing


3. Simulate the URL that requires login and capture the content of the webpage

 'promonkey', 
    'password' => '1q2w3e',
    'remember'=>1);
$data='username=zjzhoufy@126.com&password=1q2w3e&remember=1';
$curlobj = curl_init();            // 初始化
curl_setopt($curlobj, CURLOPT_URL, "http://www.imooc.com/user/login");     // 设置访问网页的URL
curl_setopt($curlobj, CURLOPT_RETURNTRANSFER, true);           // 执行之后不直接打印出来
// Cookie相关设置,这部分设置需要在所有会话开始之前设置
date_default_timezone_set('PRC'); // 使用Cookie时,必须先设置时区
curl_setopt($curlobj, CURLOPT_COOKIESESSION, TRUE); 
curl_setopt($curlobj, CURLOPT_HEADER, 0); 
curl_setopt($curlobj, CURLOPT_FOLLOWLOCATION, 1); // 这样能够让cURL支持页面链接跳转
curl_setopt($curlobj, CURLOPT_POST, 1);  
curl_setopt($curlobj, CURLOPT_POSTFIELDS, $data);  
curl_setopt($curlobj, CURLOPT_HTTPHEADER, array("application/x-www-form-urlencoded; charset=utf-8", 
    "Content-length: ".strlen($data)
    )); 
curl_exec($curlobj);   // 执行
curl_setopt($curlobj, CURLOPT_URL, "http://www.imooc.com/space/index");
curl_setopt($curlobj, CURLOPT_POST, 0);  
curl_setopt($curlobj, CURLOPT_HTTPHEADER, array("Content-type: text/xml"
    )); 
$output=curl_exec($curlobj);  // 执行
curl_close($curlobj);          // 关闭cURL
echo $output;
?>


4 . Capture the information of the logged in website and download the personal space page + customize the implementation of page link jump capture

= $curl_max_loops) 
    { 
        $curl_loops = 0; 
        return FALSE; 
    } 
    curl_setopt($ch, CURLOPT_HEADER, true); // 开启header才能够抓取到重定向到的新URL
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
    $data = curl_exec($ch); 
    // 分割返回的内容
    $h_len = curl_getinfo($ch, CURLINFO_HEADER_SIZE); 
    $header = substr($data,0,$h_len);
    $data = substr($data,$h_len - 1);
    $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE); 
    if ($http_code == 301 || $http_code == 302) { 
        $matches = array(); 
        preg_match('/Location:(.*?)\n/', $header, $matches); 
        $url = @parse_url(trim(array_pop($matches))); 
        // print_r($url); 
        if (!$url) 
        { 
            //couldn't process the url to redirect to 
            $curl_loops = 0; 
            return $data; 
        } 
        $last_url = parse_url(curl_getinfo($ch, CURLINFO_EFFECTIVE_URL)); 
        if (!isset($url['scheme'])) 
            $url['scheme'] = $last_url['scheme']; 
        if (!isset($url['host'])) 
            $url['host'] = $last_url['host']; 
        if (!isset($url['path'])) 
            $url['path'] = $last_url['path'];
        $new_url = $url['scheme'] . '://' . $url['host'] . $url['path'] . (isset($url['query'])?'?'.$url['query']:''); 
        curl_setopt($ch, CURLOPT_URL, $new_url); 
        return curl_redir_exec($ch); 
    } else { 
        $curl_loops=0; 
        return $data; 
    } 
} 
?>


Download a file from the FTP server to the local


6. Download an HTTPS resource on the network


Native PHP simulates http request

Yes In order to simply simulate an http request, it is a bit wasteful to use curl. In fact, PHP itself can already implement this function.


Needs to simulate POST/GET and other requests on the server side , that is, to implement simulation in a PHP program, how to do it? In other words, in a PHP program, if you are given an array, how do you POST/GET this array to another address? Of course, it's easy to do it using CURL, but what if you don't use the CURL library? In fact, there is already a related function implemented in PHP, and this function is stream_context_create() that I will talk about next.


Show you the code directly, this is the best way:

$data = array(
    'foo'=>'bar', 
    'baz'=>'boom', 
    'site'=>'www.nowamagic.net', 
    'name'=>'nowa magic'); 
$data = http_build_query($data); 
//$postdata = http_build_query($data);
$options = array(
    'http' => array(
        'method' => 'POST',
        'header' => 'Content-type:application/x-www-form-urlencoded',
        'content' => $data
        //'timeout' => 60 * 60 // 超时时间(单位:s)
    )
);
$url = "http://www.nowamagic.net/test2.php";
$context = stream_context_create($options);
$result = file_get_contents($url, false, $context);
echo $result;

http://www.nowamagic.net/test2.php code For:

$data = $_POST;
echo '
';
print_r( $data );
echo '
';

The running result is:

Array
(
    [foo] => bar
    [baz] => boom
    [site] => www.nowamagic.net
    [name] => nowa magic
)


Some key points to explain:


The above program The http_build_query() function is used to construct the URL string.


stream_context_create() is used to create context options for opening files, such as accessing with POST, using a proxy, sending headers, etc. Just create a stream, let's give another example:

$context = stream_context_create(array( 
    'http' => array( 
        'method'  => 'POST', 
        'header'  => sprintf("Authorization: Basic %s\r\n", base64_encode($username.':'.$password)). 
        "Content-type: application/x-www-form-urlencoded\r\n", 
        'content' => http_build_query(array('status' => $message)), 
        'timeout' => 5, 
    ), 
)); 
$ret = file_get_contents('http://twitter.com/statuses/update.xml', false, $context);


The context options created by stream_context_create can be used for streams (streams) and file systems (file systems). It is more useful for functions like file_get_contents, file_put_contents, and readfile that operate directly on file names without file handles. Adding headers to stream_context_create is only part of the function. You can also define proxies, timeouts, etc. This makes the function of accessing the web not weaker than curl.


stream_context_create() Function: Create and return a text data stream and apply various options. It can be used for timeout settings and proxy servers of fopen(), file_get_contents() and other processes. , request method, and special process of header information setting.


stream_context_create can also solve file_get_contents timeout processing by adding the timeout option:

$opts = array(
    'http'=>array(
    'method'=>"GET",
    'timeout'=>60,
  )
);
//创建数据流上下文
$context = stream_context_create($opts);
$html =file_get_contents('http://www.nowamagic.net', false, $context);
//fopen输出文件指针处的所有剩余数据:
//fpassthru($fp); //fclose()前使用


The above is the detailed content of PHP learning CURL crawler example. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn