Detailed explanation of the basic use of xpath in python crawler-Python Tutorial-php.cn

This article mainly introduces the basic use of xpath in python crawler. Now I will share it with you and give you a reference. Let’s take a look together

1. Introduction

XPath is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in XML documents. XPath is a major element of the W3C XSLT standard, and both XQuery and XPointer are built on XPath expressions.

2. Installation

pip3 install lxml

Copy after login

##3 , use 1, import

from lxml import etree

Copy after login

2, basically use

from lxml import etree
wb_data = """
    <p>
      <ul>
         <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>

         <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>

         <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>

         <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>

         <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
       </ul>
     </p>

    """
html = etree.HTML(wb_data)
print(html)
result = etree.tostring(html)
print(result.decode("utf-8"))

Copy after login

Judging from the results below, our printer html is actually a python object, and etree.tostring(html) is the basic writing method of incomplete html, which completes the label that is missing arms and legs.

 <Element html at 0x39e58f0>
<html><body><p>
      <ul>
         <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>

         <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>

         <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>

         <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>

         <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>

       </li></ul>
     </p>
    </body></html>

Copy after login

3. Get the content of a certain tag (basic use). Note that to get all the content of the a tag, there is no need to add a after it. Forward slash, otherwise an error will be reported.

Writing method one

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;/html/body/p/ul/li/a&#39;)

print(html)

for i in html_data:

  print(i.text)

<Element html at 0x12fe4b8>

first item

second item

third item

fourth item

fifth item

Copy after login

Writing method two (just add a /text() directly after the tag where you need to find the content)

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;/html/body/p/ul/li/a/text()&#39;)

print(html)

for i in html_data:

  print(i) 

<Element html at 0x138e4b8>

first item

second item

third item

fourth item

fifth item

Copy after login

4. Open and read the html file

#使用parse打开html的文件

html = etree.parse(&#39;test.html&#39;)

html_data = html.xpath(&#39;//*&#39;)<br>#打印是一个列表，需要遍历

print(html_data)

for i in html_data:

  print(i.text)

Copy after login

html = etree.parse(&#39;test.html&#39;)

html_data = etree.tostring(html,pretty_print=True)

res = html_data.decode(&#39;utf-8&#39;)

print(res)

 

打印：

<p>

   <ul>

     <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>

     <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>

     <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>

     <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>

     <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a></li>

   </ul>

</p>

Copy after login

5. Print the attributes of a tag under the specified path (you can get the value of an attribute by traversing and find the content of the tag)

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;/html/body/p/ul/li/a/@href&#39;)

for i in html_data:

  print(i)

Copy after login

Print:

link1.html

link2.html
link3.html
link4.html
link5.html
6. We know that we use xpath to get ElementTree objects one by one, so if we need to find the content, we need to traverse to get the data. list.

It is found that the a tag attribute under the absolute path is equal to link2.html.

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;/html/body/p/ul/li/a[@href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]/text()&#39;)

print(html_data)

for i in html_data:

  print(i)

Copy after login

Print:

['second item']

second item

7. Above we find all absolute paths (each one is searched from the root), below we find relative paths, for example, find the a tag content under all li tags.

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li/a/text()&#39;)

print(html_data)

for i in html_data:

  print(i)

Copy after login

Print:

['first item', 'second item', 'third item', 'fourth item' , 'fifth item']

first item
second item
third item
fourth item
fifth item
8. Above we used the absolute path to find all the attributes of the a tag that are equal to the href attribute value. We used /---absolute path. Next we use the relative path to find the li tag under the l relative path. The value of the href attribute under the a tag. Note that double // is required after the a tag.

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li/a//@href&#39;)

print(html_data)

for i in html_data:

  print(i)

Copy after login

Print:

['link1.html', 'link2.html', 'link3.html', ' link4.html', 'link5.html']

link1.html
link2.html
link3.html
link4.html

link5.html
9. The methods of checking specific attributes under relative paths are similar to those under absolute paths, or they can be said to be the same.

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li/a[@href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]&#39;)

print(html_data)

for i in html_data:

  print(i.text)

Copy after login

Print:

[]

second item

10. Find the href attribute of the a tag in the last li tag

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li[last()]/a/text()&#39;)

print(html_data)

for i in html_data:

  print(i)

Copy after login

Print:

['fifth item']

fifth item
11. Find the href attribute of the a tag in the penultimate li tag

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li[last()-1]/a/text()&#39;)

print(html_data)

for i in html_data:

  print(i)

Copy after login

Print:

['fourth item']

fourth item
12. If you are extracting a page If the xpath path of a certain tag is as follows:

//*[@id="kw"]

Copy after login

Explanation: Use relative paths to find all tags whose attribute id is equal to kw.

Commonly used

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse
html = """<!DOCTYPE html>
<html>
  <head lang="en">
    <meta charset="UTF-8">
    <title></title>
  </head>
  <body>
    <ul>
      <li class="item-"><a id=&#39;i1&#39; href="link.html" rel="external nofollow" rel="external nofollow" >first item</a></li>
      <li class="item-0"><a id=&#39;i2&#39; href="llink.html" rel="external nofollow" >first item</a></li>
      <li class="item-1"><a href="llink2.html" rel="external nofollow" rel="external nofollow" >second item<span>vv</span></a></li>
    </ul>
    <p><a href="llink2.html" rel="external nofollow" rel="external nofollow" >second item</a></p>
  </body>
</html>
"""
response = HtmlResponse(url=&#39;http://example.com&#39;, body=html,encoding=&#39;utf-8&#39;)
# hxs = HtmlXPathSelector(response)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[2]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[@id]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[@id="i1"]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[@href="link.html" rel="external nofollow" rel="external nofollow" ][@id="i1"]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[contains(@href, "link")]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[starts-with(@href, "link")]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[re:test(@id, "i\d+")]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[re:test(@id, "i\d+")]/text()&#39;).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[re:test(@id, "i\d+")]/@href&#39;).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;/html/body/ul/li/a/@href&#39;).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//body/ul/li/a/@href&#39;).extract_first()
# print(hxs)
 
# ul_list = Selector(response=response).xpath(&#39;//body/ul/li&#39;)
# for item in ul_list:
#   v = item.xpath(&#39;./a/span&#39;)
#   # 或
#   # v = item.xpath(&#39;a/span&#39;)
#   # 或
#   # v = item.xpath(&#39;*/a/span&#39;)
#   print(v)

Copy after login

Related recommendations:

Summary of two methods for python crawlers to use real browsers to open web pages

The above is the detailed content of Detailed explanation of the basic use of xpath in python crawler. For more information, please follow other related articles on the PHP Chinese website!