html - python提取标签中的内容
阿神
阿神 2017-04-17 13:09:53
0
2
490

抓取了一个网页,网页中一部分内容如下:

我使用如下代码:

import codecs
#coding=utf-8
from lxml import etree
f=codecs.open("1.html","r","utf-8")
content=f.read()
f.close()
tree=etree.HTML(content)
node=tree.xpath("//p[@class='content']")[0]
print node.text.encoding('gbk')

但是只能输出:奥迪阿萨德,第一个之后的内容都不能输出,请问该如何解决?

阿神
阿神

闭关修行中......

reply all(2)
黄舟

lxml's element.text returns the content of the first node of this element, so this problem will occur. You can use the getText helper method to solve this problem:

# require lxml
# version: python2
def getText(elem):
    rc = []
    for node in elem.itertext():
        rc.append(node.strip())
    return ''.join(rc)

You can directly modify the last line here:

import codecs
#coding=utf-8
from lxml import etree

def getText(elem):
    rc = []
    for node in elem.itertext():
        rc.append(node.strip())
    return ''.join(rc)

f=codecs.open("1.html","r","utf-8")
content=f.read()
f.close()
tree=etree.HTML(content)
# 返回的是lxml.etree._Element,可以直接作为getText参数来调用。
node=tree.xpath("//p[@class='content']")[0]
print getText(node).encoding('gbk')

The getText here is just a simple implementation. For example, the following xml text will print abdc, which should meet your requirements.

<p class="content">
    a<em>b <em>d</em></em>c
</p>
巴扎黑
#!/usr/bin/env python3
from bs4 import BeautifulSoup

f = open("1.html", "r")
html = BeautifulSoup( f.read() )
node = html.select(".content")[0]
print( node.prettify() )

html.select(".content")This may need more selectors to qualify. In addition, I just roughly wrote how BeautifulSoup works. For specific needs, you can check the manual: Beautiful Soup Document

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template