Explication détaillée de l'utilisation de base de XPath dans le robot d'exploration Python-Tutoriel Python-php.cn

Cet article présente principalement l'utilisation de base de XPath dans le robot d'exploration Python. Maintenant, je le partage avec vous et le donne comme référence. Jetons un coup d'œil ensemble

1 Introduction

XPath est un langage permettant de rechercher des informations dans des documents XML. XPath peut être utilisé pour parcourir des éléments et des attributs dans des documents XML. XPath est un élément majeur de la norme XSLT du W3C, et XQuery et XPointer sont construits sur des expressions XPath.

2.Installation

pip3 install lxml

Copier après la connexion

3. Utilisation

1. Importer

from lxml import etree

Copier après la connexion

from lxml import etree
wb_data = """
    <p>
      <ul>
         <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>

         <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>

         <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>

         <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>

         <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
       </ul>
     </p>

    """
html = etree.HTML(wb_data)
print(html)
result = etree.tostring(html)
print(result.decode("utf-8"))

Copier après la connexion

À en juger par les résultats ci-dessous, notre imprimante HTML est en fait un objet Python, et etree.tostring(html) est la manière de base d'écrire du HTML incomplet, en comblant les lacunes. Étiquettes pour bras et jambes.

 <Element html at 0x39e58f0>
<html><body><p>
      <ul>
         <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>

         <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>

         <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>

         <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>

         <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>

       </li></ul>
     </p>
    </body></html>

Copier après la connexion

3. Récupérer le contenu d'une certaine balise (utilisation de base) Notez que pour obtenir tout le contenu de la balise a, il n'est pas nécessaire après a. d'ajouter une barre oblique, sinon une erreur sera signalée.

Méthode d'écriture 1

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;/html/body/p/ul/li/a&#39;)

print(html)

for i in html_data:

  print(i.text)

<Element html at 0x12fe4b8>

first item

second item

third item

fourth item

fifth item

Copier après la connexion

Méthode d'écriture 2 (il suffit d'ajouter un /text() directement après la balise dont vous avez besoin à trouver) )

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;/html/body/p/ul/li/a/text()&#39;)

print(html)

for i in html_data:

  print(i) 

<Element html at 0x138e4b8>

first item

second item

third item

fourth item

fifth item

Copier après la connexion

4. Ouvrez et lisez le fichier html

#使用parse打开html的文件

html = etree.parse(&#39;test.html&#39;)

html_data = html.xpath(&#39;//*&#39;)<br>#打印是一个列表，需要遍历

print(html_data)

for i in html_data:

  print(i.text)

Copier après la connexion

html = etree.parse(&#39;test.html&#39;)

html_data = etree.tostring(html,pretty_print=True)

res = html_data.decode(&#39;utf-8&#39;)

print(res)

 

打印：

<p>

   <ul>

     <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>

     <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>

     <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>

     <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>

     <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a></li>

   </ul>

</p>

Copier après la connexion

5. Imprimez les attributs d'une balise sous le chemin spécifié (vous pouvez obtenir la valeur d'un attribut en parcourant et trouver le contenu du tag)

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;/html/body/p/ul/li/a/@href&#39;)

for i in html_data:

  print(i)

Copier après la connexion

Imprimer :

link1.html

link2. html
link3. html
link4.html
link5.html
6. Nous savons que nous utilisons XPath pour obtenir un objet ElementTree. par un, donc si nous devons rechercher Quant au contenu, vous devez également parcourir la liste pour obtenir les données.

Recherchez le contenu dont l'attribut de balise a sous le chemin absolu est égal à link2.html.

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;/html/body/p/ul/li/a[@href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]/text()&#39;)

print(html_data)

for i in html_data:

  print(i)

Copier après la connexion

Imprimer :

['deuxième article']

deuxième article
7. Ci-dessus, nous trouvons tous les chemins absolus (chacun est recherché à partir de la racine), ci-dessous nous trouvons les chemins relatifs, par exemple, trouvons le contenu de la balise a sous toutes les balises li.

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li/a/text()&#39;)

print(html_data)

for i in html_data:

  print(i)

Copier après la connexion

Imprimer :

['premier élément', 'deuxième élément', 'troisième élément', 'quatrième article', 'cinquième article']

premier article
deuxième article
troisième article
quatrième article
cinquième article
8. Ci-dessus, nous avons utilisé le chemin absolu pour trouver tous les attributs de la balise a qui sont égaux à la valeur de l'attribut href, en utilisant /---chemin absolu. Ensuite, nous utilisons le chemin relatif pour trouver li. sous le chemin relatif l. La valeur de l'attribut href sous la balise a sous la balise. Notez que double // est requis après la balise a.

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li/a//@href&#39;)

print(html_data)

for i in html_data:

  print(i)

Copier après la connexion

Imprimer :

['link1.html', 'link2.html', 'link3.html' , 'link4.html', 'link5.html']

link1.html
link2.html
link3.html
link4.html
link5.html
9. Les méthodes de vérification d'attributs spécifiques sous les chemins relatifs sont similaires à celles sous les chemins absolus, ou on peut dire qu'elles sont les mêmes.

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li/a[@href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]&#39;)

print(html_data)

for i in html_data:

  print(i.text)

Copier après la connexion

Impression :

[<Élément a à 0x216e468>]

deuxième article
10. Recherchez l'attribut href de la balise a dans la dernière balise li

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li[last()]/a/text()&#39;)

print(html_data)

for i in html_data:

  print(i)

Copier après la connexion

Imprimer :

['cinquième élément']

cinquième élément
11. Trouvez l'attribut href de la balise a dans l'avant-dernière balise li

html = etree.HTML(wb_data)

html_data = html.xpath(&#39;//li[last()-1]/a/text()&#39;)

print(html_data)

for i in html_data:

  print(i)

Copier après la connexion

Imprimer :

['quatrième article']

quatrième article
12. Si vous extrayez le chemin XPath d'une balise sur une page, vous pouvez utiliser la figure suivante :

//*[@id="kw"]

Copier après la connexion

Explication : Utilisez les chemins relatifs pour tout trouver balises. Balises avec un identifiant d’attribut égal à kw.

Couramment utilisé

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse
html = """<!DOCTYPE html>
<html>
  <head lang="en">
    <meta charset="UTF-8">
    <title></title>
  </head>
  <body>
    <ul>
      <li class="item-"><a id=&#39;i1&#39; href="link.html" rel="external nofollow" rel="external nofollow" >first item</a></li>
      <li class="item-0"><a id=&#39;i2&#39; href="llink.html" rel="external nofollow" >first item</a></li>
      <li class="item-1"><a href="llink2.html" rel="external nofollow" rel="external nofollow" >second item<span>vv</span></a></li>
    </ul>
    <p><a href="llink2.html" rel="external nofollow" rel="external nofollow" >second item</a></p>
  </body>
</html>
"""
response = HtmlResponse(url=&#39;http://example.com&#39;, body=html,encoding=&#39;utf-8&#39;)
# hxs = HtmlXPathSelector(response)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[2]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[@id]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[@id="i1"]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[@href="link.html" rel="external nofollow" rel="external nofollow" ][@id="i1"]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[contains(@href, "link")]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[starts-with(@href, "link")]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[re:test(@id, "i\d+")]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[re:test(@id, "i\d+")]/text()&#39;).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[re:test(@id, "i\d+")]/@href&#39;).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;/html/body/ul/li/a/@href&#39;).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//body/ul/li/a/@href&#39;).extract_first()
# print(hxs)
 
# ul_list = Selector(response=response).xpath(&#39;//body/ul/li&#39;)
# for item in ul_list:
#   v = item.xpath(&#39;./a/span&#39;)
#   # 或
#   # v = item.xpath(&#39;a/span&#39;)
#   # 或
#   # v = item.xpath(&#39;*/a/span&#39;)
#   print(v)

Copier après la connexion

Recommandations associées :

Un résumé de deux méthodes permettant aux robots d'exploration Python d'utiliser de vrais navigateurs pour ouvrir des pages Web

Ce qui précède est le contenu détaillé de. pour plus d'informations, suivez d'autres articles connexes sur le site Web de PHP en chinois!