Speaking of crawlers, everyone's first impression will be Python, but not everyone knows Python, so can other languages be used to write crawlers? Of course it is possible. Let’s introduce how to use PHP to write a crawler.
Get the html content of the page
1. Use the function file_get_contents to read the entire file into a string.
file_get_contents(path,include_path,context,start,max_length); file_get_contents('https://fengkui.net/');
In this way, the html content of the entire page can be read into a string, and then parsed.
2. Use CURL to make a request and obtain the html
/** * [curlHtml 获取页面信息] * @param [type] $url [网址] * @return [type] [description] */ function curlHtml($url) { $curl = curl_init(); curl_setopt_array($curl, array( CURLOPT_URL => "{$url}", CURLOPT_RETURNTRANSFER => true, CURLOPT_ENCODING => "", CURLOPT_MAXREDIRS => 10, CURLOPT_TIMEOUT => 30, CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1, CURLOPT_CUSTOMREQUEST => "GET", CURLOPT_HTTPHEADER => array( "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding: gzip, deflate, br", "Accept-Language: zh-CN,zh;q=0.9", "Cache-Control: no-cache", "Connection: keep-alive", "Pragma: no-cache", "Upgrade-Insecure-Requests: 1", "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36", "cache-control: no-cache" ), )); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false); $response = curl_exec($curl); $err = curl_error($curl); curl_close($curl); if ($err) return false; else return $response; }
Using Curl we can perform other operations, such as simulating browser login and other advanced operations.
Parse the page HTML and obtain the required data
1. Regularly obtain the content
/** * [get_tag_data 使用正则获取html内容] * @param [type] $html [爬取的页面内容] * @param [type] $tag [要查找的标签] * @param [type] $attr [要查找的属性名] * @param [type] $value [属性名对应的值] * @return [type] [description] */ function get_tag_data($html,$tag,$attr,$value){ $regex = "/<$tag.*?$attr=\".*?$value.*?\".*?>(.*?)<\/$tag>/is"; preg_match_all($regex,$html,$matches,PREG_PATTERN_ORDER); $data = isset($matches[1][0]) ? $matches[1][0] : ''; return $data; } $str = '<div class="feng">冯奎博客</div>'; $value = get_tag_data($str, 'div', 'class', 'feng');
2. Use Xpath to parse the data
XPath is the XML path language (XML Path Language), which is a language used to determine the location of certain parts of an XML document. See Baidu Encyclopedia (XPath) for specific usage methods and related introductions. Usage methods:
/** * [get_html_data 使用xpath对获取到的html内容进行处理] * @param [type] $html [爬取的页面内容] * @param [type] $path [Xpath语句] * @param integer $tag [类型 0内容 1标签内容 自定义标签] * @param boolean $type [单个 还是多个(默认单个时输出单个)] * @return [type] [description] */ function get_html_data($html,$path,$tag=1,$type=true) { $dom = new \DOMDocument(); @$dom->loadHTML("<?xml encoding='UTF-8'>" . $html); // 从一个字符串加载HTML并设置UTF8编码 $dom->normalize(); // 使该HTML规范化 $xpath = new \DOMXPath($dom); //用DOMXpath加载DOM,用于查询 $contents = $xpath->query($path); // 获取所有内容 $data = []; foreach ($contents as $value) { if ($tag==1) { $data[] = $value->nodeValue; // 获取不带标签内容 } elseif ($tag==2) { $data[] = $dom->saveHtml($value); // 获取带标签内容 } else { $data[] = $value->attributes->getNamedItem($tag)->nodeValue; // 获取attr内容 } } if (count($data)==1) { $data = $data[0]; } return $data; }
Recommended learning: "PHP Video Tutorial"
The above is the detailed content of An article explaining in detail how to write a crawler using PHP. For more information, please follow other related articles on the PHP Chinese website!