recently needs to collect data. It is really troublesome to use the save as method on the browser, and it is not conducive to storage and Retrieve. So I wrote a small crawler to crawl things on the Internet. So far, it has crawled nearly one million web pages. We are currently working on ways to process this data.
The structure of the crawler:
The principle of the crawler is actually very simple. It is to analyze the downloaded pages, find the links in them, then download these links, analyze and download again, and the cycle starts again. In terms of data storage, the database is the first choice for easy retrieval, and the development language only needs to support regular expressions. I chose mysql for the database, so I chose php for the development script. It supports perl compatible regular expressions, is very convenient to connect to mysql, supports http downloads, and can be deployed on both windows and linux systems.
Regular expression:
Regular expressions are a basic tool for processing text. To extract links and images from HTML, the regular expressions used are as follows.
Copy code The code is as follows:
"#] href=(['"])(. )\1#isU" Processing links
"#] src=(['"])(. )\1#isU" Processing images
Other questions:
Another issue that needs to be noted when writing a crawler is that URLs that have been downloaded cannot be downloaded repeatedly, and links to some web pages will form loops, so this problem needs to be dealt with. My approach is to calculate the MD5 of the URL that has been processed. value and store it in the database so that you can check whether it has been downloaded. Of course, there are better algorithms. If you are interested, you can look for them online.
Related agreements:
The crawler also has its own protocol. There is a robots.txt file that defines what the website is allowed to traverse. However, due to my limited time, this function was not implemented.
Other instructions:
PHP supports class programming, which is the main class of the crawler I wrote.
1. url processing web_site_info, mainly used to process url, analyze domain name, etc.
2. Database operation mysql_insert.php, handles operations related to the database.
3. History record processing, recording the URLs that have been processed.
4. Reptiles.
Existing problems and deficiencies
This crawler runs well when the amount of data is small, but when the amount of data is large, the efficiency of the history processing class is not very high. By indexing the relevant fields in the database structure, the speed is improved. There has been improvement, but data needs to be read continuously, which may be related to the array implementation of PHP itself. If 100,000 historical records are loaded at a time, the speed is very slow.
Does not support multi-threading and can only process one URL at a time.
PHP itself has a memory usage limit when running. Once when crawling a page with a depth of 20, the program ran out of memory and was killed.
The url below is for source code download.
http://xiazai.jb51.net/201506/other/net_spider.rar
When using it, first create the net_spider database in mysql, and then use db.sql to create related tables. Then set the mysql username and password in config.php.
Finally
Copy code The code is as follows:
php -f spider.php depth (numeric value) url
You can start working. Such as
Copy code The code is as follows:
php -f spider.php 20 http://news.sina.com.cn
Now I feel that it is not that complicated to be a crawler. What is difficult is the storage and retrieval of data. My current database has a largest data table of 15G, and I am trying to figure out how to process this data. Querying with mysql already feels a bit inadequate. I really admire Google on this point
<?php #加载页面 function curl_get($url){ $ch=curl_init(); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch,CURLOPT_HEADER,1); $result=curl_exec($ch); $code=curl_getinfo($ch,CURLINFO_HTTP_CODE); if($code!='404' && $result){ return $result; } curl_close($ch); } #获取页面url链接 function get_page_urls($spider_page_result,$base_url){ $get_url_result=preg_match_all("/<[a|A].*?href=[\'\"]{0,1}([^>\'\"\]*).*?>/",$spider_page_result,$out); if($get_url_result){ return $out[1]; }else{ return; } } #相对路径转绝对路径 function xdtojd($base_url,$url_list){ if(is_array($url_list)){ foreach($url_list as $url_item){ if(preg_match("/^(http:\/\/|https:\/\/|javascript:)/",$url_item)){ $result_url_list[]=$url_item; }else { if(preg_match("/^\//",$url_item)){ $real_url = $base_url.$url_item; }else{ $real_url = $base_url."/".$url_item; } #$real_url = 'http://www.sumpay.cn/'.$url_item; $result_url_list[] = $real_url; } } return $result_url_list; }else{ return; } } #删除其他站点url function other_site_url_del($jd_url_list,$url_base){ if(is_array($jd_url_list)){ foreach($jd_url_list as $all_url){ echo $all_url; if(strpos($all_url,$url_base)===0){ $all_url_list[]=$all_url; } } return $all_url_list; }else{ return; } } #删除相同URL function url_same_del($array_url){ if(is_array($array_url)){ $insert_url=array(); $pizza=file_get_contents("/tmp/url.txt"); if($pizza){ $pizza=explode("\r\n",$pizza); foreach($array_url as $array_value_url){ if(!in_array($array_value_url,$pizza)){ $insert_url[]=$array_value_url; } } if($insert_url){ foreach($insert_url as $key => $insert_url_value){ #这里只做了参数相同去重处理 $update_insert_url=preg_replace('/=[^&]*/','=leesec',$insert_url_value); foreach($pizza as $pizza_value){ $update_pizza_value=preg_replace('/=[^&]*/','=leesec',$pizza_value); if($update_insert_url==$update_pizza_value){ unset($insert_url[$key]); continue; } } } } }else{ $insert_url=array(); $insert_new_url=array(); $insert_url=$array_url; foreach($insert_url as $insert_url_value){ $update_insert_url=preg_replace('/=[^&]*/','=leesec',$insert_url_value); $insert_new_url[]=$update_insert_url; } $insert_new_url=array_unique($insert_new_url); foreach($insert_new_url as $key => $insert_new_url_val){ $insert_url_bf[]=$insert_url[$key]; } $insert_url=$insert_url_bf; } return $insert_url; }else{ return; } } $current_url=$argv[1]; $fp_puts = fopen("/tmp/url.txt","ab");//记录url列表 $fp_gets = fopen("/tmp/url.txt","r");//保存url列表 $url_base_url=parse_url($current_url); if($url_base_url['scheme']==""){ $url_base="http://".$url_base_url['host']; }else{ $url_base=$url_base_url['scheme']."://".$url_base_url['host']; } do{ $spider_page_result=curl_get($current_url); #var_dump($spider_page_result); $url_list=get_page_urls($spider_page_result,$url_base); #var_dump($url_list); if(!$url_list){ continue; } $jd_url_list=xdtojd($url_base,$url_list); #var_dump($jd_url_list); $result_url_arr=other_site_url_del($jd_url_list,$url_base); var_dump($result_url_arr); $result_url_arr=url_same_del($result_url_arr); #var_dump($result_url_arr); if(is_array($result_url_arr)){ $result_url_arr=array_unique($result_url_arr); foreach($result_url_arr as $new_url) { fputs($fp_puts,$new_url."\r\n"); } } }while ($current_url = fgets($fp_gets,1024));//不断获得url preg_match_all("/<a[^>]+href=[\"']([^\"']+)[\"'][^>]+>/",$spider_page_result,$out); # echo a href #var_dump($out[1]); ?>