In fact, starting from PHP5, PHP has provided us with a powerful class for parsing and generating XML related operations. , which is the DOMDocument class we are going to talk about today. However, I estimate that most people will still like to use regular expressions to parse web content when crawling web pages. After learning this class today, you can try to use PHP's own method for parsing and analysis next time.
Parsing HTML
// 解析 HTML $baidu = file_get_contents('https://www.baidu.com'); $doc = new DOMDocument(); @$doc->loadHTML($baidu); // 百度输出框 $inputSearch = $doc->getElementById('kw'); var_dump($inputSearch); // object(DOMElement)#2 // .... echo $inputSearch->getAttribute('name'), PHP_EOL; // wd // 获取所有图片的链接 $allImageLinks = []; $imgs = $doc->getElementsByTagName('img'); foreach($imgs as $img){ $allImageLinks[] = $img->getAttribute('src'); } print_r($allImageLinks); // Array // ( // [0] => //www.baidu.com/img/baidu_jgylogo3.gif // [1] => //www.baidu.com/img/bd_logo.png // [2] => http://s1.bdstatic.com/r/www/cache/static/global/img/gs_237f015b.gif // ) // 利用 parse_url 分析链接 foreach($allImageLinks as $link){ print_r(parse_url($link)); } // Array // ( // [host] => www.baidu.com // [path] => /img/baidu_jgylogo3.gif // ) // Array // ( // [host] => www.baidu.com // [path] => /img/bd_logo.png // ) // Array // ( // [scheme] => http // [host] => s1.bdstatic.com // [path] => /r/www/cache/static/global/img/gs_237f015b.gif // )
Doesn’t it feel so clear and object-oriented? It feels like using the ORM library for database operations for the first time. Let’s look at it piece by piece.
$baidu = file_get_contents('https://www.baidu.com'); $doc = new DOMDocument(); @$doc->loadHTML($baidu);
The first is to load the document content. This is easier to understand. Use the loadHTML() method to load the HTML content directly. It also provides several other methods, namely: load() loads XML from a file; loadXML() loads XML from a string; loadHTMLFile() loads HTML from a file.
// 百度输出框 $inputSearch = $doc->getElementById('kw'); var_dump($inputSearch); // object(DOMElement)#2 // .... echo $inputSearch->getAttribute('name'), PHP_EOL; // wd
Next we use the same DOM operation API as front-end JS to operate elements in HTML. In this example, you get Baidu's text box and directly use the getElementById() method to get the DOMElement object with the id as the specified content. Then you can get its values, attributes, etc.
[Related recommendations: PHP video tutorial]
// 获取所有图片的链接 $allImageLinks = []; $imgs = $doc->getElementsByTagName('img'); foreach($imgs as $img){ $allImageLinks[] = $img->getAttribute('src'); } print_r($allImageLinks); // Array // ( // [0] => //www.baidu.com/img/baidu_jgylogo3.gif // [1] => //www.baidu.com/img/bd_logo.png // [2] => http://s1.bdstatic.com/r/www/cache/static/global/img/gs_237f015b.gif // ) // 利用 parse_url 分析链接 foreach($allImageLinks as $link){ print_r(parse_url($link)); } // Array // ( // [host] => www.baidu.com // [path] => /img/baidu_jgylogo3.gif // ) // Array // ( // [host] => www.baidu.com // [path] => /img/bd_logo.png // ) // Array // ( // [scheme] => http // [host] => s1.bdstatic.com // [path] => /r/www/cache/static/global/img/gs_237f015b.gif // )
This example is to get all the image links in the HTML document. Compared with regular expressions, it is much more convenient, and the code itself is self-explanatory, so there is no need to consider the problem of regular matching failure. Cooperating with another parse_url() method that comes with PHP, you can also analyze the link very conveniently and extract the content you want.
The parsing of XML is similar to the parsing of HTML. Both can be easily parsed using the method interface provided by DOMDocument and DOMElement. So what do we want to generate a standard format of XML? Of course, it is also very simple. There is no need to splice strings. You can use this class to perform object-based operations.
Generate an XML
// 生成一个XML文档 $xml = new DOMDocument('1.0', 'UTF-8'); $node1 = $xml->createElement('First', 'This is First Node.'); $node1->setAttribute('type', '1'); $node2 = $xml->createElement('Second'); $node2->setAttribute('type', '2'); $node2_child = $xml->createElement('Second-Child', 'This is Second Node Child.'); $node2->appendChild($node2_child); $xml->appendChild($node1); $xml->appendChild($node2); print $xml->saveXML(); /* <?xml version="1.0" encoding="UTF-8"?> <First type="1">This is First Node.</First> <Second type="2"><Second-Child>This is Second Node Child.</Second-Child></Second> */
In fact, as long as you have a little bit of front-end JS foundation, it is not difficult to see the meaning of this code. Use the createElement() method to create a DOMElement object, and then add properties and content to it. Use the appendChild() method to add subordinate nodes to the current DOMElement or DOMDocument. Finally, use saveXML() to generate standard XML format content.
Summary
Through the above two simple examples, I believe everyone is already very interested in the way this DOMDocument operates XML class file parsing. However, there is no relevant test on how different their performance is compared to the regular parsing method. However, under normal circumstances, the HMTL document of the website will not be too large. After all, each website will also consider its own loading speed. If the document If it is very large, the user experience will be poor, so there is basically no problem using this set of interfaces for daily crawler analysis and processing.
Test code:
https://github.com/zhangyue0503/dev-blog/blob/master/php/202002/source/PHP%E4%B8%AD%E4%BD%BF%E7%94%A8DOMDocument%E6%9D%A5%E5%A4%84%E7%90%86HTML%E3%80%81XML%E6%96%87%E6%A1%A3.php
Reference document:
https://www.php.net/manual/zh/class.domdocument.php