Simple use of Beautifulsoup and selenium

Review of requests library

I haven’t used it for a long timerequests, because I will write a simple crawler later, so I just write it casually A little review.

import requests

r = requests.get(&#39;https://api.github.com/user&#39;, auth=(&#39;haiyu19931121@163.com&#39;, &#39;Shy18137803170&#39;))print(r.status_code)  # 状态码200print(r.json())  # 返回json格式print(r.text)  # 返回文本print(r.headers)  # 头信息print(r.encoding)  # 编码方式，一般utf-8# 当写入文件比较大时，避免内存耗尽，可以一次写指定的字节数或者一行。# 一次读一行，chunk_size=512为默认值for chunk in r.iter_lines():print(chunk)# 一次读取一块，大小为512for chunk in r.iter_content(chunk_size=512):print(chunk)Copy after login

Note that iter_lines and iter_content return byte data. To write to a file, whether it is text or Pictures need to be opened in the wb way.

Using Beautifulsoup

Let’s get to the point. I have heard about this famous library for a long time. Although it was not troublesome to use regular expressions to write crawlers in the past, sometimes the matching would be inaccurate. Use Beautifulsoup to accurately extract data from HTML tags. Although it is a bit slow, it is simple and easy to use.

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse&#39;s story</title></head><body><p class="title"><b>The Dormouse&#39;s story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""# 就注意一点，第二个参数指定解析器，必须填上，不然会有警告。推荐使用lxmlsoup = BeautifulSoup(html_doc, &#39;lxml&#39;)Copy after login

Following the above code, look at some simple operations below. The behavior of using point attributes will get the first found data that meets the conditions. It is the abbreviation of find method.

soup.a
soup.find(&#39;p&#39;)Copy after login

The above two sentences are equivalent.

# soup.body是一个Tag对象。是body标签中所有html代码print(soup.body)Copy after login

<body>
<p class="title"><b>The Dormouse&#39;s story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

Copy after login

# 获取body里所有文本，不含标签print(soup.body.text)# 等同于下面的写法soup.body.get_text()# 还可以这样写,strings是所有文本的生成器for string in soup.body.strings:print(string, end=&#39;&#39;)Copy after login

The Dormouse&#39;s story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

Copy after login

# 获得该标签里的文本。print(soup.title.string)Copy after login

The Dormouse&#39;s story

Copy after login

# Tag对象的get方法可以根据属性的名称获得属性的值，此句表示得到第一个p标签里class属性的值print(soup.p.get(&#39;class&#39;))# 和下面的写法等同print(soup.p[&#39;class&#39;])Copy after login

[&#39;title&#39;]

Copy after login

# 查看a标签的所有属性，以字典形式给出print(soup.a.attrs)Copy after login

{&#39;href&#39;: &#39;http://example.com/elsie&#39;, &#39;class&#39;: [&#39;sister&#39;], &#39;id&#39;: &#39;link1&#39;}

Copy after login

# 标签的名称soup.title.nameCopy after login

title

Copy after login

find_all

The most commonly used method is undoubtedly the find_all / find method. The former finds all data that meets the conditions and returns a list. The latter is the first data in this list. find_all has a limit parameter that limits the length of the list (that is, the number of data that meets the search criteria). When limit=1 actually becomes the find method.

find_allThere are also abbreviations.

soup.find_all(&#39;a&#39;, id=&#39;link1&#39;)
soup(&#39;a&#39;, id=&#39;link1&#39;)Copy after login

The above two ways of writing are equivalent, and the second way of writing is an abbreviation.

find_all(self, name=None, attrs={}, recursive=True, text=None,
             limit=None, **kwargs)

Copy after login

name

name is the tag you want to search for. For example, the following is to find all p tags. Not only can you fill in strings, but you can also pass in regular expressions, lists, functions, and True.

# 传入字符串soup.find_all('p')# 传入正则表达式import re# 必须以b开头for tag in soup.find_all(re.compile("^b")):print(tag.name)# body# b# 含有t就行for tag in soup.find_all(re.compile("t")):print(tag.name)# html# title# 传入列表表示，一次查找多个标签soup.find_all(["a", "b"])# [The Dormouse&#39;s story,#  Elsie,#  Lacie,#  Tillie]

Copy after login

If you pass in True, there will be no restrictions and everything will be searched.

recursive

When calling the find_all() method of tag, Beautiful Soup will retrieve all descendant nodes of the current tag. If you only want to search for the direct child nodes of the tag, you can Use parameter recursive=False.

# title不是html的直接子节点，但是会检索其下所有子孙节点soup.html.find_all("title")# [The Dormouse&#39;s story]# 参数设置为False，只会找直接子节点soup.html.find_all("title", recursive=False)# []# title就是head的直接子节点，所以这个参数此时无影响a = soup.head.find_all("title", recursive=False)# [The Dormouse&#39;s story]Copy after login

keyword and attrs

Use keyword and add one or more qualifications to narrow the search scope.

# 查看所有id为link1的p标签soup.find_all(&#39;a&#39;, id=&#39;link1&#39;)Copy after login

If you search by class, Python has already used it because of the class keyword. You can use class_, or do not specify keywords, or use attrs to fill in the dictionary.

soup.find_all(&#39;p&#39;, class_=&#39;story&#39;)
soup.find_all(&#39;p&#39;, &#39;story&#39;)
soup.find_all(&#39;p&#39;, attrs={"class": "story"})Copy after login

The above three methods are equivalent. class_Can accept strings, regular expressions, functions, and True.

text

Search for text value, it seems that using string parameter also gives the same result.

a = soup.find_all(text=&#39;Elsie&#39;)# 或者，4.4以上版本请使用texta = soup.find_all(string=&#39;Elsie&#39;)Copy after login

The text parameter can also accept strings, regular expressions, True, and lists.

CSS Selector

You can also use CSS selector. Just use the select method, select always returns a list.

List several commonly used operations.

# 所有div标签soup.select(&#39;div&#39;)# 所有id为username的元素soup.select(&#39;.username&#39;)# 所有class为story的元素soup.select(&#39;#story&#39;)# 所有div元素之内的span元素，中间可以有其他元素soup.select(&#39;div span&#39;)# 所有div元素之内的span元素，中间没有其他元素soup.select(&#39;div > span&#39;)# 所有具有一个id属性的input标签，id的值无所谓soup.select(&#39;input[id]&#39;)# 所有具有一个id属性且值为user的input标签soup.select(&#39;input[id="user"]&#39;)# 搜索多个，class为link1或者link2的元素都符合soup.select("#link1, #link2")Copy after login

A small crawler example

The basic usage of requests and beautifulsoup4 is introduced above. Using these, you can already write some simple crawlers. Come and try it.

This example comes from "Get Started Quickly with Python Programming - Automate Cumbersome Work" [US] AI Sweigart

This crawler will download pictures from XKCD Comics Network in batches. You can specify the number of pages to download.

import osimport requestsfrom bs4 import BeautifulSoup# exist_ok=True，若文件夹已经存在也不会报错os.makedirs(&#39;xkcd&#39;)
url = &#39;https://xkcd.com/&#39;headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) &#39;  &#39;Chrome/57.0.2987.98 Safari/537.36&#39;}def save_img(img_url, limit=1):
    r = requests.get(img_url, headers=headers)
    soup = BeautifulSoup(r.text, &#39;lxml&#39;)try:
        img = &#39;https:&#39; + soup.find(&#39;div&#39;, id=&#39;comic&#39;).img.get(&#39;src&#39;)except AttributeError:print(&#39;Image Not Found&#39;)else:print(&#39;Downloading&#39;, img)
        response = requests.get(img, headers=headers)with open(os.path.join(&#39;xkcd&#39;, os.path.basename(img)), &#39;wb&#39;) as f:for chunk in response.iter_content(chunk_size=1024*1024):
                f.write(chunk)# 每次下载一张图片，就减1limit -= 1# 找到上一张图片的网址if limit > 0:try:
            prev = &#39;https://xkcd.com&#39; + soup.find(&#39;a&#39;, rel=&#39;prev&#39;).get(&#39;href&#39;)except AttributeError:print(&#39;Link Not Exist&#39;)else:
            save_img(prev, limit)if __name__ == &#39;__main__&#39;:
    save_img(url, limit=20)print(&#39;Done!&#39;)Copy after login

Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
...
Done!

Copy after login

Multi-threaded download

Single-threaded speed is a bit slow, for example, multi-threading can be used, because when we get prev, Knowing the URL of each web page is very regular. It goes like this. Only the last number is different, so we can easily use range to traverse.

import osimport threadingimport requestsfrom bs4 import BeautifulSoup

os.makedirs(&#39;xkcd&#39;)

headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) &#39;  &#39;Chrome/57.0.2987.98 Safari/537.36&#39;}def download_imgs(start, end):for url_num in range(start, end):
        img_url = &#39;https://xkcd.com/&#39; + str(url_num)
        r = requests.get(img_url, headers=headers)
        soup = BeautifulSoup(r.text, &#39;lxml&#39;)try:
            img = &#39;https:&#39; + soup.find(&#39;div&#39;, id=&#39;comic&#39;).img.get(&#39;src&#39;)except AttributeError:print(&#39;Image Not Found&#39;)else:print(&#39;Downloading&#39;, img)
            response = requests.get(img, headers=headers)with open(os.path.join(&#39;xkcd&#39;, os.path.basename(img)), &#39;wb&#39;) as f:for chunk in response.iter_content(chunk_size=1024 * 1024):
                    f.write(chunk)if __name__ == &#39;__main__&#39;:# 下载从1到30，每个线程下载10个threads = []for i in range(1, 30, 10):
        thread_obj = threading.Thread(target=download_imgs, args=(i, i + 10))
        threads.append(thread_obj)
        thread_obj.start()# 阻塞，等待线程执行结束都会等待for thread in threads:
        thread.join()# 所有线程下载完毕，才打印print(&#39;Done!&#39;)Copy after login

来看下结果吧。

A brief introduction to the usage of Beautifulsoup and selenium

初步了解selenium

selenium用来作自动化测试。使用前需要下载驱动，我只下载了Firefox和Chrome的。网上随便一搜就能下载到了。接下来将下载下来的文件其复制到将安装目录下，比如Firefox，将对应的驱动程序放到C:\Program Files (x86)\Mozilla Firefox,并将这个路径添加到环境变量中，同理Chrome的驱动程序放到C:\Program Files (x86)\Google\Chrome\Application并将该路径添加到环境变量。最后重启IDE开始使用吧。

模拟百度搜索

下面这个例子会打开Chrome浏览器，访问百度首页，模拟输入The Zen of Python，随后点击百度一下，当然也可以用回车代替。Keys下是一些不能用字符串表示的键，比如方向键、Tab、Enter、Esc、F1~F12、Backspace等。然后等待3秒，页面跳转到知乎首页，接着返回到百度，最后退出（关闭）浏览器。

from selenium import webdriverfrom selenium.webdriver.common.keys import Keysimport time

browser = webdriver.Chrome()# Chrome打开百度首页browser.get(&#39;https://www.baidu.com/&#39;)# 找到输入区域input_area = browser.find_element_by_id(&#39;kw&#39;)# 区域内填写内容input_area.send_keys(&#39;The Zen of Python&#39;)# 找到"百度一下"search = browser.find_element_by_id(&#39;su&#39;)# 点击search.click()# 或者按下回车# input_area.send_keys(&#39;The Zen of Python&#39;, Keys.ENTER)time.sleep(3)
browser.get(&#39;https://www.zhihu.com/&#39;)
time.sleep(2)# 返回到百度搜索browser.back()
time.sleep(2)# 退出浏览器browser.quit()Copy after login

A brief introduction to the usage of Beautifulsoup and selenium

send_keys模拟输入内容。可以使用element的clear()方法清空输入。一些其他模拟点击浏览器按钮的方法如下

browser.back()  # 返回按钮browser.forward() # 前进按钮browser.refresh()  # 刷新按钮browser.close()  # 关闭当前窗口browser.quit()  # 退出浏览器Copy after login

查找方法

以下列举常用的查找Element的方法。

方法名	返回的WebElement
find_element_by_id(id)	匹配id属性值的元素
find_element_by_name(name)	匹配name属性值的元素
find_element_by_class_name(name)	匹配CSS的class值的元素
find_element_by_tag_name(tag)	匹配标签名的元素，如div
find_element_by_css_selector(selector)	匹配CSS选择器
find_element_by_xpath(xpath)	匹配xpath
find_element_by_link_text(text)	完全匹配提供的text的a标签
find_element_by_partial_link_text(text)	提供的text可以是a标签中文本中的一部分

登录CSDN

以下代码可以模拟输入账号密码，点击登录。整个过程还是很快的。

browser = webdriver.Chrome()
browser.get(&#39;https://passport.csdn.net/account/login&#39;)
browser.find_element_by_id(&#39;username&#39;).send_keys(&#39;haiyu19931121@163.com&#39;)
browser.find_element_by_id(&#39;password&#39;).send_keys(&#39;**********&#39;)
browser.find_element_by_class_name(&#39;logging&#39;).click()Copy after login

A brief introduction to the usage of Beautifulsoup and selenium

以上差不多都是API的罗列，其中有自己的理解，也有照搬官方文档的。

by @sunhaiyu

2017.7.13

The above is the detailed content of A brief introduction to the usage of Beautifulsoup and selenium. For more information, please follow other related articles on the PHP Chinese website!