html - python提取标签中的内容

Question

抓取了一个网页，网页中一部分内容如下： 我使用如下代码： {代码...} 但是只能输出：奥迪阿萨德，第一个之后的内容都不能输出，请问该如何解决？

黄舟 · Answer

lxml's element.text returns the content of the first node of this element, so this problem will occur. You can use the getText helper method to solve this problem:

# require lxml
# version: python2
def getText(elem):
    rc = []
    for node in elem.itertext():
        rc.append(node.strip())
    return ''.join(rc)

You can directly modify the last line here:

import codecs
#coding=utf-8
from lxml import etree

def getText(elem):
    rc = []
    for node in elem.itertext():
        rc.append(node.strip())
    return ''.join(rc)

f=codecs.open("1.html","r","utf-8")
content=f.read()
f.close()
tree=etree.HTML(content)
# 返回的是lxml.etree._Element,可以直接作为getText参数来调用。
node=tree.xpath("//p[@class='content']")[0]
print getText(node).encoding('gbk')

The getText here is just a simple implementation. For example, the following xml text will print abdc, which should meet your requirements.


    ab dc

巴扎黑 · Answer

#!/usr/bin/env python3
from bs4 import BeautifulSoup

f = open("1.html", "r")
html = BeautifulSoup( f.read() )
node = html.select(".content")[0]
print( node.prettify() )

html.select(".content")This may need more selectors to qualify. In addition, I just roughly wrote how BeautifulSoup works. For specific needs, you can check the manual: Beautiful Soup Document