抓取了一个网页,网页中一部分内容如下:
我使用如下代码:
import codecs
#coding=utf-8
from lxml import etree
f=codecs.open("1.html","r","utf-8")
content=f.read()
f.close()
tree=etree.HTML(content)
node=tree.xpath("//p[@class='content']")[0]
print node.text.encoding('gbk')
但是只能输出:奥迪阿萨德,第一个之后的内容都不能输出,请问该如何解决?
lxml's
element.text
returns the content of the first node of this element, so this problem will occur. You can use thegetText
helper method to solve this problem:You can directly modify the last line here:
The getText here is just a simple implementation. For example, the following xml text will print
abdc
, which should meet your requirements.html.select(".content")
This may need more selectors to qualify. In addition, I just roughly wrote howBeautifulSoup
works. For specific needs, you can check the manual: Beautiful Soup Document