Beautifulsoup和selenium的簡單使用

requests庫的複習

好久沒用requests了，因為一會兒要寫個簡單的爬蟲，所以還是隨便寫一點複習下。

import requests

r = requests.get(&#39;https://api.github.com/user&#39;, auth=(&#39;haiyu19931121@163.com&#39;, &#39;Shy18137803170&#39;))print(r.status_code)  # 状态码200print(r.json())  # 返回json格式print(r.text)  # 返回文本print(r.headers)  # 头信息print(r.encoding)  # 编码方式，一般utf-8# 当写入文件比较大时，避免内存耗尽，可以一次写指定的字节数或者一行。# 一次读一行，chunk_size=512为默认值for chunk in r.iter_lines():print(chunk)# 一次读取一块，大小为512for chunk in r.iter_content(chunk_size=512):print(chunk)登入後複製

注意iter_lines和iter_content傳回的都是位元組數據，若要寫入文件，不管是文字還是圖片，都需要以wb的方式開啟。

Beautifulsoup的使用

進入正題，早就聽說這個著名的庫，以前寫爬蟲用正則表達式雖然不麻煩，但有時會匹配不准確。使用Beautifulsoup可以準確地從HTML標籤中擷取資料。雖然是慢了點，但是簡單好使呀。

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse&#39;s story</title></head><body><p class="title"><b>The Dormouse&#39;s story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""# 就注意一点，第二个参数指定解析器，必须填上，不然会有警告。推荐使用lxmlsoup = BeautifulSoup(html_doc, &#39;lxml&#39;)登入後複製

緊接著上面的程式碼，看下面一些簡單的操作。 使用點屬性的行為，會得到第一個查找到的符合條件的資料。是find方法的簡寫。

soup.a
soup.find(&#39;p&#39;)登入後複製

上面的兩句是等價的。

# soup.body是一个Tag对象。是body标签中所有html代码print(soup.body)登入後複製

<body>
<p class="title"><b>The Dormouse&#39;s story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

登入後複製

# 获取body里所有文本，不含标签print(soup.body.text)# 等同于下面的写法soup.body.get_text()# 还可以这样写,strings是所有文本的生成器for string in soup.body.strings:print(string, end=&#39;&#39;)登入後複製

The Dormouse&#39;s story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

登入後複製

# 获得该标签里的文本。print(soup.title.string)登入後複製

The Dormouse&#39;s story

登入後複製

# Tag对象的get方法可以根据属性的名称获得属性的值，此句表示得到第一个p标签里class属性的值print(soup.p.get(&#39;class&#39;))# 和下面的写法等同print(soup.p[&#39;class&#39;])登入後複製

[&#39;title&#39;]

登入後複製

# 查看a标签的所有属性，以字典形式给出print(soup.a.attrs)登入後複製

{&#39;href&#39;: &#39;http://example.com/elsie&#39;, &#39;class&#39;: [&#39;sister&#39;], &#39;id&#39;: &#39;link1&#39;}

登入後複製

# 标签的名称soup.title.name登入後複製

title

登入後複製

soup.find_all(&#39;a&#39;, id=&#39;link1&#39;)
soup(&#39;a&#39;, id=&#39;link1&#39;)

登入後複製

find_all(self, name=None, attrs={}, recursive=True, text=None,
             limit=None, **kwargs)

登入後複製

# 传入字符串soup.find_all('p')# 传入正则表达式import re# 必须以b开头for tag in soup.find_all(re.compile("^b")):print(tag.name)# body# b# 含有t就行for tag in soup.find_all(re.compile("t")):print(tag.name)# html# title# 传入列表表示，一次查找多个标签soup.find_all(["a", "b"])# [The Dormouse&#39;s story,#  Elsie,#  Lacie,#  Tillie]

登入後複製

 <div class="code" style="position:relative; padding:0px; margin:0px;"><pre class="sourceCode python"># title不是html的直接子节点，但是会检索其下所有子孙节点soup.html.find_all("title")# [<title>The Dormouse&amp;#39;s story</title>]# 参数设置为False，只会找直接子节点soup.html.find_all("title", recursive=False)# []# title就是head的直接子节点，所以这个参数此时无影响a = soup.head.find_all("title", recursive=False)# [<title name="good">The Dormouse&amp;#39;s story</title>]</pre><div class="contentsignin">登入後複製</div></div>

find_all使用最多的當屬find_all / find方法了吧，前者查找所有符合條件的數據，傳回一個列表。後者則是這個清單中的第一個資料。 find_all有一個limit參數，限制清單的長度（即尋找符合條件的資料的數量）。當limit=1

其實就變成了

find方法。

find_all

同樣有簡寫方法。

# 查看所有id为link1的p标签soup.find_all(&#39;a&#39;, id=&#39;link1&#39;)

登入後複製

上面兩種寫法是等價的，第二種寫法便是簡寫。 <div class="code" style="position:relative; padding:0px; margin:0px;"><pre class="sourceCode python">soup.find_all(&#39;p&#39;, class_=&#39;story&#39;) soup.find_all(&#39;p&#39;, &#39;story&#39;) soup.find_all(&#39;p&#39;, attrs={"class": "story"})</pre><div class="contentsignin">登入後複製</div></div>name

name

就是想要搜尋的標籤，例如下面就是要找到所有的

標籤。不僅能填入字串，還能傳入正規表示式、列表、函數、True。

a = soup.find_all(text=&#39;Elsie&#39;)# 或者，4.4以上版本请使用texta = soup.find_all(string=&#39;Elsie&#39;)

登入後複製

傳入

True

的話，就沒有限制，什麼都找了。 recursive呼叫tag的

find_all()

 方法時，Beautiful Soup會檢索目前tag的所有子孫節點，如果只想搜尋tag的直接子節點,可以使用參數

recursive=False

。

# 所有div标签soup.select(&#39;div&#39;)# 所有id为username的元素soup.select(&#39;.username&#39;)# 所有class为story的元素soup.select(&#39;#story&#39;)# 所有div元素之内的span元素，中间可以有其他元素soup.select(&#39;div span&#39;)# 所有div元素之内的span元素，中间没有其他元素soup.select(&#39;div > span&#39;)# 所有具有一个id属性的input标签，id的值无所谓soup.select(&#39;input[id]&#39;)# 所有具有一个id属性且值为user的input标签soup.select(&#39;input[id="user"]&#39;)# 搜索多个，class为link1或者link2的元素都符合soup.select("#link1, #link2")

登入後複製

keyword和attrs

使用keyword，加上一個或多個限定條件，縮小尋找範圍。

import osimport requestsfrom bs4 import BeautifulSoup# exist_ok=True，若文件夹已经存在也不会报错os.makedirs(&#39;xkcd&#39;)
url = &#39;https://xkcd.com/&#39;headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) &#39;  &#39;Chrome/57.0.2987.98 Safari/537.36&#39;}def save_img(img_url, limit=1):
    r = requests.get(img_url, headers=headers)
    soup = BeautifulSoup(r.text, &#39;lxml&#39;)try:
        img = &#39;https:&#39; + soup.find(&#39;div&#39;, id=&#39;comic&#39;).img.get(&#39;src&#39;)except AttributeError:print(&#39;Image Not Found&#39;)else:print(&#39;Downloading&#39;, img)
        response = requests.get(img, headers=headers)with open(os.path.join(&#39;xkcd&#39;, os.path.basename(img)), &#39;wb&#39;) as f:for chunk in response.iter_content(chunk_size=1024*1024):
                f.write(chunk)# 每次下载一张图片，就减1limit -= 1# 找到上一张图片的网址if limit > 0:try:
            prev = &#39;https://xkcd.com&#39; + soup.find(&#39;a&#39;, rel=&#39;prev&#39;).get(&#39;href&#39;)except AttributeError:print(&#39;Link Not Exist&#39;)else:
            save_img(prev, limit)if __name__ == &#39;__main__&#39;:
    save_img(url, limit=20)print(&#39;Done!&#39;)

登入後複製

如果按類別查找，由於class關鍵字Python已經使用。可以用

class_

，或是不指定關鍵字，或使用

attrs

填入字典。

Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
...
Done!

登入後複製

上面三種方法等價。

class_

可以接受字串、正規表示式、函數、True。

text

搜尋文字值，好像使用string參數也是一樣的結果。

import osimport threadingimport requestsfrom bs4 import BeautifulSoup

os.makedirs(&#39;xkcd&#39;)

headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) &#39;  &#39;Chrome/57.0.2987.98 Safari/537.36&#39;}def download_imgs(start, end):for url_num in range(start, end):
        img_url = &#39;https://xkcd.com/&#39; + str(url_num)
        r = requests.get(img_url, headers=headers)
        soup = BeautifulSoup(r.text, &#39;lxml&#39;)try:
            img = &#39;https:&#39; + soup.find(&#39;div&#39;, id=&#39;comic&#39;).img.get(&#39;src&#39;)except AttributeError:print(&#39;Image Not Found&#39;)else:print(&#39;Downloading&#39;, img)
            response = requests.get(img, headers=headers)with open(os.path.join(&#39;xkcd&#39;, os.path.basename(img)), &#39;wb&#39;) as f:for chunk in response.iter_content(chunk_size=1024 * 1024):
                    f.write(chunk)if __name__ == &#39;__main__&#39;:# 下载从1到30，每个线程下载10个threads = []for i in range(1, 30, 10):
        thread_obj = threading.Thread(target=download_imgs, args=(i, i + 10))
        threads.append(thread_obj)
        thread_obj.start()# 阻塞，等待线程执行结束都会等待for thread in threads:
        thread.join()# 所有线程下载完毕，才打印print(&#39;Done!&#39;)

登入後複製

text參數也可以接受字串、正規表示式、True、清單。

CSS選擇器

還能使用CSS選擇器呢。使用select方法就好了，select始終回傳一個列表。 

列舉幾個常用的操作。

from selenium import webdriverfrom selenium.webdriver.common.keys import Keysimport time

browser = webdriver.Chrome()# Chrome打开百度首页browser.get(&#39;https://www.baidu.com/&#39;)# 找到输入区域input_area = browser.find_element_by_id(&#39;kw&#39;)# 区域内填写内容input_area.send_keys(&#39;The Zen of Python&#39;)# 找到"百度一下"search = browser.find_element_by_id(&#39;su&#39;)# 点击search.click()# 或者按下回车# input_area.send_keys(&#39;The Zen of Python&#39;, Keys.ENTER)time.sleep(3)
browser.get(&#39;https://www.zhihu.com/&#39;)
time.sleep(2)# 返回到百度搜索browser.back()
time.sleep(2)# 退出浏览器browser.quit()

登入後複製

一個爬蟲小範例
上面介紹了requests和beautifulsoup4的基本用法，使用這些已經可以寫一些簡單的爬蟲了。來試試吧。

此例子來自《Python程式設計快速上手－讓繁瑣的工作自動化》[美] AI Sweigart

這個爬蟲會大量下載XKCD漫畫網的圖片，可以指定下載的頁面數。

browser.back()  # 返回按钮browser.forward() # 前进按钮browser.refresh()  # 刷新按钮browser.close()  # 关闭当前窗口browser.quit()  # 退出浏览器

登入後複製

<div class="code" style="position:relative; padding:0px; margin:0px;"><div class="code" style="position:relative; padding:0px; margin:0px;"><pre class="sourceCode python">browser = webdriver.Chrome()
browser.get(&amp;#39;https://passport.csdn.net/account/login&amp;#39;)
browser.find_element_by_id(&amp;#39;username&amp;#39;).send_keys(&amp;#39;haiyu19931121@163.com&amp;#39;)
browser.find_element_by_id(&amp;#39;password&amp;#39;).send_keys(&amp;#39;**********&amp;#39;)
browser.find_element_by_class_name(&amp;#39;logging&amp;#39;).click()</pre><div class="contentsignin">登入後複製</div></div><div class="contentsignin">登入後複製</div></div>

多執行緒下載單一執行緒的速度有點慢，例如可以使用多執行緒，由於我們在取得

的時候，知道了每個網頁的網址是很規律的。它像這樣。只是最後的數字不一樣，所以我們可以很方便地使用

range###來遍歷。 ######rrreee###

来看下结果吧。

對Beautifulsoup和selenium用法的簡單介紹

初步了解selenium

selenium用来作自动化测试。使用前需要下载驱动，我只下载了Firefox和Chrome的。网上随便一搜就能下载到了。接下来将下载下来的文件其复制到将安装目录下，比如Firefox，将对应的驱动程序放到C:\Program Files (x86)\Mozilla Firefox,并将这个路径添加到环境变量中，同理Chrome的驱动程序放到C:\Program Files (x86)\Google\Chrome\Application并将该路径添加到环境变量。最后重启IDE开始使用吧。

模拟百度搜索

下面这个例子会打开Chrome浏览器，访问百度首页，模拟输入The Zen of Python，随后点击百度一下，当然也可以用回车代替。Keys下是一些不能用字符串表示的键，比如方向键、Tab、Enter、Esc、F1~F12、Backspace等。然后等待3秒，页面跳转到知乎首页，接着返回到百度，最后退出（关闭）浏览器。

from selenium import webdriverfrom selenium.webdriver.common.keys import Keysimport time

browser = webdriver.Chrome()# Chrome打开百度首页browser.get(&#39;https://www.baidu.com/&#39;)# 找到输入区域input_area = browser.find_element_by_id(&#39;kw&#39;)# 区域内填写内容input_area.send_keys(&#39;The Zen of Python&#39;)# 找到"百度一下"search = browser.find_element_by_id(&#39;su&#39;)# 点击search.click()# 或者按下回车# input_area.send_keys(&#39;The Zen of Python&#39;, Keys.ENTER)time.sleep(3)
browser.get(&#39;https://www.zhihu.com/&#39;)
time.sleep(2)# 返回到百度搜索browser.back()
time.sleep(2)# 退出浏览器browser.quit()登入後複製
登入後複製

對Beautifulsoup和selenium用法的簡單介紹

send_keys模拟输入内容。可以使用element的clear()方法清空输入。一些其他模拟点击浏览器按钮的方法如下

browser.back()  # 返回按钮browser.forward() # 前进按钮browser.refresh()  # 刷新按钮browser.close()  # 关闭当前窗口browser.quit()  # 退出浏览器登入後複製
登入後複製

查找方法

以下列举常用的查找Element的方法。

方法名	返回的WebElement
find_element_by_id(id)	匹配id属性值的元素
find_element_by_name(name)	匹配name属性值的元素
find_element_by_class_name(name)	匹配CSS的class值的元素
find_element_by_tag_name(tag)	匹配标签名的元素，如div
find_element_by_css_selector(selector)	匹配CSS选择器
find_element_by_xpath(xpath)	匹配xpath
find_element_by_link_text(text)	完全匹配提供的text的a标签
find_element_by_partial_link_text(text)	提供的text可以是a标签中文本中的一部分

登录CSDN

以下代码可以模拟输入账号密码，点击登录。整个过程还是很快的。

<div class="code" style="position:relative; padding:0px; margin:0px;"><div class="code" style="position:relative; padding:0px; margin:0px;"><pre class="sourceCode python">browser = webdriver.Chrome()
browser.get(&amp;#39;https://passport.csdn.net/account/login&amp;#39;)
browser.find_element_by_id(&amp;#39;username&amp;#39;).send_keys(&amp;#39;haiyu19931121@163.com&amp;#39;)
browser.find_element_by_id(&amp;#39;password&amp;#39;).send_keys(&amp;#39;**********&amp;#39;)
browser.find_element_by_class_name(&amp;#39;logging&amp;#39;).click()</pre><div class="contentsignin">登入後複製</div></div><div class="contentsignin">登入後複製</div></div>

對Beautifulsoup和selenium用法的簡單介紹