Python BeautifulSoupライブラリのインストールと紹介-Python チュートリアル-php.cn

1. はじめに

前回の記事では、ブログ、Wikipedia InfoBox、画像をクロールするために Python を使用してソースコードを解析する方法を紹介しました。記事のリンクは次のとおりです。【Python学習】Wikipediaプログラミング言語のメッセージボックスを簡単にクロールする
【Python学習】ブログの記事やアイデアをクロールする簡単なWebクローラー【Python学習】画像サイトギャラリー内の画像を簡単にクローリング

コアコードは以下の通り:

# coding=utf-8
import urllib
import re

#下载静态HTML网页
url=&#39;http://www.csdn.net/&#39;
content = urllib.urlopen(url).read()
open(&#39;csdn.html&#39;,&#39;w+&#39;).write(content)
#获取标题
title_pat=r&#39;(?<=<title>).*?(?=</title>)&#39;
title_ex=re.compile(title_pat,re.M|re.S)
title_obj=re.search(title_ex, content)
title=title_obj.group()
print title
#获取超链接内容 
href = r&#39;<a href=.*?>(.*?)</a>&#39;
m = re.findall(href,content,re.S|re.M)
for text in m:
    print unicode(text,&#39;utf-8&#39;)
    break #只输出一个url

ログイン後にコピー

出力結果は以下の通り:

>>>
CSDN.NET - 全球最大中文IT社区，为IT专业技术人员提供最全面的信息传播和服务平台
登录
>>>

ログイン後にコピー

コアコード画像のダウンロードの手順は次のとおりです。

import os
import urllib
class AppURLopener(urllib.FancyURLopener):
    version = "Mozilla/5.0"
urllib._urlopener = AppURLopener()
url = "http://creatim.allyes.com.cn/imedia/csdn/20150228/15_41_49_5B9C9E6A.jpg"
filename = os.path.basename(url)
urllib.urlretrieve(url , filename)

ログイン後にコピー

しかし、Web サイトのコンテンツをクロールするために HTML を分析する上記の方法には、次のような多くの欠点があります。
1. 正規表現は、HTML ソースコードに依存するのではなく、HTML ソースコードによって制約されます。より抽象的な構造。Web ページ構造では、小さな変更によりプログラムが中断される可能性があります。

2. プログラムは、実際の HTML ソースコードに基づいてコンテンツを分析する必要があり、& などの文字エンティティなどの HTML 機能に遭遇する可能性があり、などの別のコンテンツを指定する必要があります。アイコンのハイパーリンク、下付き文字など。

3. 正規表現は完全に読み取れるわけではなく、より複雑な HTML コードやクエリ式は乱雑になります。

「Python Basics Tutorial (2nd Edition)」では 2 つの解決策が採用されています。1 つ目は Tidy (Python ライブラリ) プログラムと XHTML 解析を使用することです。2 つ目は BeautifulSoup ライブラリを使用することです。

II. Beautiful Soupライブラリのインストールと紹介

Beautiful SoupはPython で書かれた HTML/XML パーサーを使用できます。不規則なマークアップを適切に処理し、解析ツリーを生成します。これは、解析ツリーの移動、検索、変更に使用されるシンプルで一般的に使用される操作を提供します。プログラミング時間を大幅に節約できます。この本にあるように、「それらの悪い Web ページはあなたが書いたものではなく、あなたはそこからデータを取得しようとしただけです。今では、HTML がどのように見えるかを気にする必要はなく、パーサーがそれを行います」それはあなたのためです。」
アドレスのダウンロード：インストールプロセスは以下に示すように次のように示されています：python setup.pyインストール

公式を使用して、TifulSoupの使用法の簡単な説明。「不思議の国のアリス」の例:

#!/usr/bin/python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse&#39;s story</title></head>
<body>
<p class="title"><b>The Dormouse&#39;s story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#获取BeautifulSoup对象并按标准缩进格式输出
soup = BeautifulSoup(html_doc)
print(soup.prettify())

ログイン後にコピー

出力コンテンツ

は、次のように標準のインデント形式構造

に従って出力されます:

<html>
 <head>
  <title>
   The Dormouse&#39;s story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse&#39;s story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

ログイン後にコピー

以下は簡単ですそして BeautifulSoup ライブラリの簡単な紹介: (参考: 公式ドキュメント)

&#39;&#39;&#39;获取title值&#39;&#39;&#39;
print soup.title
# <title>The Dormouse&#39;s story</title>
print soup.title.name
# title
print unicode(soup.title.string)
# The Dormouse&#39;s story

&#39;&#39;&#39;获取<p>值&#39;&#39;&#39;
print soup.p
# <p class="title"><b>The Dormouse&#39;s story</b></p>
print soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

&#39;&#39;&#39;从文档中找到<a>的所有标签链接&#39;&#39;&#39;
print soup.find_all(&#39;a&#39;)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
for link in soup.find_all(&#39;a&#39;):
    print(link.get(&#39;href&#39;))
    # //m.sbmmt.com/
    # //m.sbmmt.com/
    # //m.sbmmt.com/
print soup.find(id=&#39;link3&#39;)
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

ログイン後にコピー

記事内のすべてのテキストコンテンツを取得したい場合、コードは次のとおりです:

&#39;&#39;&#39;从文档中获取所有文字内容&#39;&#39;&#39;
print soup.get_text()
# The Dormouse&#39;s story
#
# The Dormouse&#39;s story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

ログイン後にコピー

同时在这过程中你可能会遇到两个典型的错误提示：
1.ImportError: No module named BeautifulSoup
当你成功安装BeautifulSoup 4库后，“from BeautifulSoup import BeautifulSoup”可能会遇到该错误。

其中的原因是BeautifulSoup 4库改名为bs4，需要使用“from bs4 import BeautifulSoup”导入。
2.TypeError: an integer is required
当你使用“print soup.title.string”获取title的值时，可能会遇到该错误。如下：

它应该是IDLE的BUG，当使用命令行Command没有任何错误。参考：stackoverflow。同时可以通过下面的代码解决该问题：
print unicode(soup.title.string)
print str(soup.title.string)

三. Beautiful Soup常用方法介绍

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:Tag、NavigableString、BeautifulSoup、Comment|
1.Tag标签
tag对象与XML或HTML文档中的tag相同，它有很多方法和属性。其中最重要的属性name和attribute。用法如下：

#!/usr/bin/python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse&#39;s story</title></head>
<body>
<p class="title" id="start"><b>The Dormouse&#39;s story</b></p>
"""
soup = BeautifulSoup(html)
tag = soup.p
print tag
# <p class="title" id="start"><b>The Dormouse&#39;s story</b></p>
print type(tag)
# <class &#39;bs4.element.Tag&#39;>
print tag.name
# p 标签名字
print tag[&#39;class&#39;]
# [u&#39;title&#39;]
print tag.attrs
# {u&#39;class&#39;: [u&#39;title&#39;], u&#39;id&#39;: u&#39;start&#39;}

ログイン後にコピー

使用BeautifulSoup每个tag都有自己的名字，可以通过.name来获取；同样一个tag可能有很多个属性，属性的操作方法与字典相同，可以直接通过“.attrs”获取属性。至于修改、删除操作请参考文档。
2.NavigableString
字符串常被包含在tag内，Beautiful Soup用NavigableString类来包装tag中的字符串。一个NavigableString字符串与Python中的Unicode字符串相同，并且还支持包含在遍历文档树和搜索文档树中的一些特性，通过unicode()方法可以直接将NavigableString对象转换成Unicode字符串。

print unicode(tag.string)
# The Dormouse&#39;s story
print type(tag.string)
# <class &#39;bs4.element.NavigableString&#39;>
tag.string.replace_with("No longer bold")
print tag
# <p class="title" id="start"><b>No longer bold</b></p>

ログイン後にコピー

这是获取“

The Dormouse's story

”中tag = soup.p的值，其中tag中包含的字符串不能编辑，但可通过函数replace_with()替换。
NavigableString 对象支持遍历文档树和搜索文档树中定义的大部分属性, 并非全部。尤其是一个字符串不能包含其它内容(tag能够包含字符串或是其它tag)，字符串不支持 .contents 或 .string 属性或 find() 方法。
如果想在Beautiful Soup之外使用 NavigableString 对象，需要调用 unicode() 方法，将该对象转换成普通的Unicode字符串，否则就算Beautiful Soup已方法已经执行结束，该对象的输出也会带有对象的引用地址。这样会浪费内存。

3.Beautiful Soup对象
该对象表示的是一个文档的全部内容，大部分时候可以把它当做Tag对象，它支持遍历文档树和搜索文档树中的大部分方法。
注意：因为BeautifulSoup对象并不是真正的HTML或XML的tag，所以它没有name和 attribute属性，但有时查看它的.name属性可以通过BeautifulSoup对象包含的一个值为[document]的特殊实行.name实现——soup.name。
Beautiful Soup中定义的其它类型都可能会出现在XML的文档中：CData , ProcessingInstruction , Declaration , Doctype 。与 Comment 对象类似，这些类都是 NavigableString 的子类，只是添加了一些额外的方法的字符串独享。
4.Command注释
Tag、NavigableString、BeautifulSoup几乎覆盖了html和xml中的所有内容，但是还有些特殊对象容易让人担心——注释。Comment对象是一个特殊类型的NavigableString对象。

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
print type(comment)
# <class &#39;bs4.element.Comment&#39;>
print unicode(comment)
# Hey, buddy. Want to buy a used parser?

ログイン後にコピー

介绍完这四个对象后，下面简单介绍遍历文档树和搜索文档树及常用的函数。
5.遍历文档树
一个Tag可能包含多个字符串或其它的Tag，这些都是这个Tag的子节点。BeautifulSoup提供了许多操作和遍历子节点的属性。引用官方文档中爱丽丝例子：
操作文档最简单的方法是告诉你想获取tag的name，如下：

soup.head# <head><title>The Dormouse&#39;s story</title></head>soup.title# <title>The Dormouse&#39;s story</title>soup.body.b# <b>The Dormouse&#39;s story</b>

ログイン後にコピー

注意：通过点取属性的放是只能获得当前名字的第一个Tag，同时可以在文档树的tag中多次调用该方法如soup.body.b获取标签中第一个标签。
如果想得到所有的标签，使用方法find_all()，在前面的Python爬取维基百科等HTML中我们经常用到它+正则表达式的方法。

soup.find_all(&#39;a&#39;)# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

ログイン後にコピー

子节点：在分析HTML过程中通常需要分析tag的子节点，而tag的 .contents 属性可以将tag的子节点以列表的方式输出。字符串没有.contents属性，因为字符串没有子节点。

head_tag = soup.head
head_tag
# <head><title>The Dormouse&#39;s story</title></head>

head_tag.contents
[<title>The Dormouse&#39;s story</title>]

title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse&#39;s story</title>
title_tag.contents
# [u&#39;The Dormouse&#39;s story&#39;]

ログイン後にコピー

通过tag的 .children 生成器,可以对tag的子节点进行循环：

for child in title_tag.children:
    print(child)
    # The Dormouse&#39;s story

ログイン後にコピー

子孙节点：同样 .descendants 属性可以对所有tag的子孙节点进行递归循环：

for child in head_tag.descendants:
    print(child)
    # <title>The Dormouse&#39;s story</title>
    # The Dormouse&#39;s story

ログイン後にコピー