Home >Backend Development >PHP Tutorial >thinkphp5 + beanbun realizes simple crawling of movie URLs and pictures

thinkphp5 + beanbun realizes simple crawling of movie URLs and pictures

零到壹度
零到壹度Original
2018-03-30 10:59:493704browse

This article mainly shares with you an article about thinkphp5 + beanbun to achieve simple grabbing of movie URLs and pictures. It has a good reference value and I hope it will be helpful to everyone. Let’s follow the editor to take a look, I hope it can help everyone.

First create two data tables to store the first-level url table and the picture table under this url

dywz data table

CREATE TABLE `think_dy2018` ( 
 `id` int(7) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键id',  
 `movieName` varchar(255) NOT NULL COMMENT '电影名',  
 `movieUrl` varchar(520) NOT NULL COMMENT '电影详情页地址',  
 `addTime` int(11) NOT NULL COMMENT '添加时间',  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=8808 DEFAULT CHARSET=utf8 COMMENT='dywz信息采集'
CREATE TABLE `think_dy2018imgs` ( 
 `id` int(8) unsigned NOT NULL AUTO_INCREMENT COMMENT '图片id',  
 `urlID` int(7) NOT NULL COMMENT '关联的电影ID',  
 `imgUrl` varchar(520) DEFAULT NULL COMMENT '图片地址',  
 `create_time` int(10) NOT NULL COMMENT '图片添加时间',  PRIMARY KEY (`id`),  
 KEY `urlID` (`urlID`)
 ) ENGINE=InnoDB AUTO_INCREMENT=1279 DEFAULT CHARSET=utf8 COMMENT='图片地址'

Next To write through thinkphp5, you must first install the extension Beanbun, and then analyze the movie website. After using Beanbun to crawl the page, use regular rules to filter out the movieName and movieUrl you want.

thinkphp5 code

1. Execute getList() to get the first-level page movieName, movieUrl
2. Execute getImage() to get the detailed information under each movie URL on the first-level page Big picture

<?php
/* 
+------------------------------------------------------------------------------------------- 
+ Title        : 爬虫控制器 
+ Version      : V1.0.0.2 
+ Initial-Time : 2018/3/27 + sgw 
+ Last-time    : 2018/3/27 + sgw + Desc         : 爬取网站电影信息 
+-------------------------------------------------------------------------------------------
*/
namespace app\index\controller;
use \Beanbun\Beanbun;
use Beanbun\Lib\Helper;
use \GuzzleHttp\Client;
use \think\Controller;
use \think\Db;

class Robot extends Controller{  
  /**     
  * 抓取 一级页面信息     
  */    
  public function getList(){   
       $beanbun = new Beanbun;        
       $urlList = [&#39;http://www.ygdy8.net/html/gndy/dyzz/index.html&#39;];        
       for($i=2;$i<=173;$i++ ){      
             $urlList[] = &#39;http://www.ygdy8.net/html/gndy/dyzz/list_23_&#39;.$i.&#39;.html&#39;;        
       }     
       
  $beanbun->seed = $urlList;        
  $beanbun->afterDownloadPage = function($beanbun) {           
     if (strlen($beanbun->page) < 100) {       
              $beanbun->error();           
        }       
       
       # 对抓取内容转码            
       $contents = mb_convert_encoding($beanbun->page,&#39;utf8&#39; ,&#39;gb2312&#39;);            
       file_put_contents(&#39;66.html&#39;, $contents);            
       $patter = &#39;/<td height="26">\s*<b>\s*<a href="(.+)".*>(.*)<\/a>\s*<\/b>/sU&#39;;            
       preg_match_all($patter, $contents, $m);            
       
       # 对抓取的数据分析插入数据库            
       if($m[0]){               
           $hrefs  = $m[1];                
           $titles = $m[2];                
           
           foreach ($hrefs as $key => $href){               
               $url =  Helper::formatUrl($href, $beanbun->url);                   
               $data[] = [                   
                    &#39;movieName&#39; => strip_tags($titles[$key]),                        
                    &#39;movieUrl&#39;  => $url,                        
                    &#39;addTime&#39;   => time()                    
               ];               
          }               
             
              Db::name(&#39;dy2018&#39;)->insertAll($data);           
        }        
    };       
             # 抓取页面之后回调        
             $beanbun->start();    
       }    
       
       /**     
          * 抓取 改url下面相信信息的imges      
          * 查询数据表中id,movieUrl。然后根据movieUrl爷们内容爬取图片(注意这里图片可能有多张,所以用循环对应同一个movieUrl的urlID)     
          *    
      */    
      public function getImage(){        
      # 返回bool        
      $result = Db::table(&#39;think_dywz&#39;)->column(&#39;id,movieUrl&#39;);        
      foreach ($result as $key => $value){       
           $result = $this->https_request($value);            
           $pattern = &#39;/<img border="0"\s+src="(.+)".*>/sU&#39;;            
           preg_match_all($pattern,$result,$m);           
           
          if ($m[0]) {             
             $imgs = $m[1];                
             foreach ($imgs as $k=> $v) {                  
               $data = [                      
                 &#39;imgUrl&#39; => $v,                       
                 &#39;urlID&#39;  => $key,                        
                 &#39;create_time&#39; => time(),                  
              ];                   
              
               Db::name(&#39;dywzimgs&#39;)->insert($data);               
            }            
         }        
     }    
  }    
  
      /**     
      * cURL万能函数     
      * @param [string] $url [请求地址]    
       * @param [arra] $data  [post的数据]     
       * @return mixed     
       */    
       private static function https_request($url, $data = null){     
          # 初始化一个cURL会话        
          $curl = curl_init();        
          
          //设置请求选项, 包括具体的url        
          curl_setopt($curl, CURLOPT_URL, $url);        
          curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);  //禁用后cURL将终止从服务端进行验证        
          curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, FALSE);        
          
          if (!empty($data)){         
             curl_setopt($curl, CURLOPT_POST, 1);  //设置为post请求类型            
             curl_setopt($curl, CURLOPT_POSTFIELDS, $data);  //设置具体的post数据      
         }        
         curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);        
         $response = curl_exec($curl);  //执行一个cURL会话并且获取相关回复        
         curl_close($curl);  //释放cURL句柄,关闭一个cURL会话        
         return $response;    
       }
   }

This is the complete code, I believe you are smart and have learned it.

The above is the detailed content of thinkphp5 + beanbun realizes simple crawling of movie URLs and pictures. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn