Python uses four methods to achieve comparative analysis of all links in the current page-Python Tutorial-php.cn

Python uses four methods to achieve comparative analysis of all links in the current page

黄舟

Release： 2017-08-20 10:28:38

Original

2045 people have browsed it

This article mainly introduces Python's method of obtaining all links in the current page. It compares and analyzes four commonly used methods of obtaining page links in Python with examples. It also comes with the method of obtaining links within the iframe framework. Friends who need it You can refer to the following examples

This article describes the four methods of Python to obtain all links in the current page. Share it with everyone for your reference, the details are as follows:

&#39;&#39;&#39;
得到当前页面所有连接
&#39;&#39;&#39;
import requests
import re
from bs4 import BeautifulSoup
from lxml import etree
from selenium import webdriver
url = &#39;http://www.testweb.com&#39;
r = requests.get(url)
r.encoding = &#39;gb2312&#39;
# 利用 re （太黄太暴力！）
matchs = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\&#39;).+?(?=\&#39;)" , r.text)
for link in matchs:
  print(link)
print()
# 利用 BeautifulSoup4 （DOM树）
soup = BeautifulSoup(r.text,&#39;lxml&#39;)
for a in soup.find_all(&#39;a&#39;):
  link = a[&#39;href&#39;]
  print(link)
print()
# 利用 lxml.etree （XPath）
tree = etree.HTML(r.text)
for link in tree.xpath("//@href"):
  print(link)
print()
# 利用selenium（要开浏览器！）
driver = webdriver.Firefox()
driver.get(url)
for link in driver.find_elements_by_tag_name("a"):
  print(link.get_attribute("href"))
driver.close()

Copy after login

Note: If the page contains an iframe, all tags of the page contained in the iframe cannot be used. Four ways to get it! ! ! At this time:

# 再打开所有iframe查找全部的a标签
for iframe in soup.find_all(&#39;iframe&#39;):
  url_ifr = iframe[&#39;src&#39;] # 取得当前iframe的src属性值 
  rr = requests.get(url_ifr)
  rr.encoding = &#39;gb2312&#39;
  soup_ifr = BeautifulSoup(rr.text,&#39;lxml&#39;)
  for a in soup_ifr.find_all(&#39;a&#39;):
    link = a[&#39;href&#39;]
    m = re.match(r&#39;http:\/\/.*?(?=\/)&#39;,link)
    #print(link)
    if m:
      all_urls.add(m.group(0))

Copy after login

The above is the detailed content of Python uses four methods to achieve comparative analysis of all links in the current page. For more information, please follow other related articles on the PHP Chinese website!