PHP, crawler PHP implements the simplest crawler prototype

巴扎黑
Release: 2016-11-24 13:41:00
Original
1206 people have browsed it

The simplest crawler model should be like this: given an initial URL, the crawler pulls down the content, finds the URLs in the page, and starts crawling using these URLs as the starting point.

The following is the simplest crawler model implemented in PHP.

<?php
/**
 * 爬虫程序 -- 原型
 * 
 * BookMoth 2009-02-21
 */
/**
 * 从给定的url获取html内容
 *
 * @param string $url
 * @return string
 */
function _getUrlContent($url){
$handle = fopen($url, "r");
if($handle){
$content = stream_get_contents($handle,1024*1024);
return $content;
}else{
return false;
}
}
/**
 * 从html内容中筛选链接
 *
 * @param string $web_content
 * @return array
 */
function _filterUrl($web_content){
$reg_tag_a = &#39;/<[a|A].*?href=[/&#39;/"]{0,1}([^>/&#39;/"/ ]*).*?>/&#39;;
$result = preg_match_all($reg_tag_a,$web_content,$match_result);
if($result){
return $match_result[1];
}
}
/**
 * 修正相对路径
 *
 * @param string $base_url
 * @param array $url_list
 * @return array
 */
function _reviseUrl($base_url,$url_list){
$url_info = parse_url($base_url);
$base_url = $url_info["scheme"].&#39;://&#39;;
if($url_info["user"]&&$url_info["pass"]){
$base_url .= $url_info["user"].":".$url_info["pass"]."@";
}
$base_url .= $url_info["host"];
if($url_info["port"]){
$base_url .= ":".$url_info["port"];
}
$base_url .= $url_info["path"];
print_r($base_url);
if(is_array($url_list)){
foreach ($url_list as $url_item) {
if(preg_match(&#39;/^http/&#39;,$url_item)){
//已经是完整的url
$result[] = $url_item;
}else {
//不完整的url
$real_url = $base_url.&#39;/&#39;.$url_item;
$result[] = $real_url;
}
}
return $result;
}else {
return;
}
}
/**
 * 爬虫
 *
 * @param string $url
 * @return array
 */
function crawler($url){
$content = _getUrlContent($url);
if($content){
$url_list = _reviseUrl($url,_filterUrl($content));
if($url_list){
return $url_list;
}else {
return ;
}
}else{
return ;
}
}
/**
 * 测试用主程序
 *
 */
function main(){
$current_url = "http://hao123.com/";//初始url
$fp_puts = fopen("url.txt","ab");//记录url列表
$fp_gets = fopen("url.txt","r");//保存url列表
do{
$result_url_arr = crawler($current_url);
if($result_url_arr){
foreach ($result_url_arr as $url) {
fputs($fp_puts,$url."/r/n");
}
}
}while ($current_url = fgets($fp_gets,1024));//不断获得url
}
main();
?>
Copy after login


Related labels:
php
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!