This article mainly introduces the basic use of xpath in python crawler. Now I will share it with you and give you a reference. Let’s take a look together
1. Introduction
XPath is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in XML documents. XPath is a major element of the W3C XSLT standard, and both XQuery and XPointer are built on XPath expressions.
2. Installation
##
pip3 install lxml
##3 , use 1, import
from lxml import etree
2, basically use
from lxml import etree wb_data = """ <p> <ul> <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> </ul> </p> """ html = etree.HTML(wb_data) print(html) result = etree.tostring(html) print(result.decode("utf-8"))
Judging from the results below, our printer html is actually a python object, and etree.tostring(html) is the basic writing method of incomplete html, which completes the label that is missing arms and legs.
<Element html at 0x39e58f0> <html><body><p> <ul> <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> </li></ul> </p> </body></html>
3. Get the content of a certain tag (basic use). Note that to get all the content of the a tag, there is no need to add a after it. Forward slash, otherwise an error will be reported.
Writing method one
html = etree.HTML(wb_data) html_data = html.xpath('/html/body/p/ul/li/a') print(html) for i in html_data: print(i.text) <Element html at 0x12fe4b8> first item second item third item fourth item fifth item
Writing method two (just add a /text() directly after the tag where you need to find the content)
html = etree.HTML(wb_data) html_data = html.xpath('/html/body/p/ul/li/a/text()') print(html) for i in html_data: print(i) <Element html at 0x138e4b8> first item second item third item fourth item fifth item
4. Open and read the html file
#使用parse打开html的文件 html = etree.parse('test.html') html_data = html.xpath('//*')<br>#打印是一个列表,需要遍历 print(html_data) for i in html_data: print(i.text)
##
html = etree.parse('test.html') html_data = etree.tostring(html,pretty_print=True) res = html_data.decode('utf-8') print(res) 打印: <p> <ul> <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a></li> </ul> </p>
html = etree.HTML(wb_data) html_data = html.xpath('/html/body/p/ul/li/a/@href') for i in html_data: print(i)
link1.html
link2.htmllink3.htmllink4.html link5.htmlIt is found that the a tag attribute under the absolute path is equal to link2.html.6. We know that we use xpath to get ElementTree objects one by one, so if we need to find the content, we need to traverse to get the data. list.
html = etree.HTML(wb_data) html_data = html.xpath('/html/body/p/ul/li/a[@href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]/text()') print(html_data) for i in html_data: print(i)
['second item']
second item7. Above we find all absolute paths (each one is searched from the root), below we find relative paths, for example, find the a tag content under all li tags.
html = etree.HTML(wb_data) html_data = html.xpath('//li/a/text()') print(html_data) for i in html_data: print(i)
['first item', 'second item', 'third item', 'fourth item' , 'fifth item']
first itemsecond itemthird itemfourth itemfifth item8. Above we used the absolute path to find all the attributes of the a tag that are equal to the href attribute value. We used /---absolute path. Next we use the relative path to find the li tag under the l relative path. The value of the href attribute under the a tag. Note that double // is required after the a tag.
html = etree.HTML(wb_data) html_data = html.xpath('//li/a//@href') print(html_data) for i in html_data: print(i)
['link1.html', 'link2.html', 'link3.html', ' link4.html', 'link5.html']
link1.htmllink2.htmllink3.htmllink4.htmllink5.html
9. The methods of checking specific attributes under relative paths are similar to those under absolute paths, or they can be said to be the same.
html = etree.HTML(wb_data) html_data = html.xpath('//li/a[@href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]') print(html_data) for i in html_data: print(i.text)
[
second item
10. Find the href attribute of the a tag in the last li tag
html = etree.HTML(wb_data) html_data = html.xpath('//li[last()]/a/text()') print(html_data) for i in html_data: print(i)
['fifth item']
fifth item11. Find the href attribute of the a tag in the penultimate li tag
html = etree.HTML(wb_data) html_data = html.xpath('//li[last()-1]/a/text()') print(html_data) for i in html_data: print(i)
['fourth item']
fourth item12. If you are extracting a page If the xpath path of a certain tag is as follows:
//*[@id="kw"]
Commonly used
#!/usr/bin/env python # -*- coding:utf-8 -*- from scrapy.selector import Selector, HtmlXPathSelector from scrapy.http import HtmlResponse html = """<!DOCTYPE html> <html> <head lang="en"> <meta charset="UTF-8"> <title></title> </head> <body> <ul> <li class="item-"><a id='i1' href="link.html" rel="external nofollow" rel="external nofollow" >first item</a></li> <li class="item-0"><a id='i2' href="llink.html" rel="external nofollow" >first item</a></li> <li class="item-1"><a href="llink2.html" rel="external nofollow" rel="external nofollow" >second item<span>vv</span></a></li> </ul> <p><a href="llink2.html" rel="external nofollow" rel="external nofollow" >second item</a></p> </body> </html> """ response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8') # hxs = HtmlXPathSelector(response) # print(hxs) # hxs = Selector(response=response).xpath('//a') # print(hxs) # hxs = Selector(response=response).xpath('//a[2]') # print(hxs) # hxs = Selector(response=response).xpath('//a[@id]') # print(hxs) # hxs = Selector(response=response).xpath('//a[@id="i1"]') # print(hxs) # hxs = Selector(response=response).xpath('//a[@href="link.html" rel="external nofollow" rel="external nofollow" ][@id="i1"]') # print(hxs) # hxs = Selector(response=response).xpath('//a[contains(@href, "link")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract() # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract() # print(hxs) # hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract() # print(hxs) # hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first() # print(hxs) # ul_list = Selector(response=response).xpath('//body/ul/li') # for item in ul_list: # v = item.xpath('./a/span') # # 或 # # v = item.xpath('a/span') # # 或 # # v = item.xpath('*/a/span') # print(v)
Summary of two methods for python crawlers to use real browsers to open web pages
The above is the detailed content of Detailed explanation of the basic use of xpath in python crawler. For more information, please follow other related articles on the PHP Chinese website!