python - 使用start事件读取xml不全
PHPz
PHPz 2017-04-17 13:03:48
0
1
231

要读取一个xml文件中每个item节点的review_idsummary等的子结点,样例如下:

 0095693 书本内容 P 书本的内容很好,对我很有帮助,就是字体的颜色是紫色的,看就了会觉得不清晰。 book 

完整例子可以从这里下载。编程环境为Mac 10.9.2,Python 2.7.6,代码如下:

import sys import os from xml.etree.ElementTree import iterparse, tostring def count_pos_neg(itemfile): pos_count = 0 neg_count = 0 try: for event, elem in iterparse(itemfile, events=["start",]): if elem.tag == "item": try: if processItem(elem)['polarity'] == "P": pos_count += 1 else: neg_count += 1 except Exception, e: print >> sys.stderr, "Ignoring item: %s" % e elem.clear() except SyntaxError, se: print >> sys.stderr, se return pos_count, neg_count def processItem(item): """ Process a review. Implement custom code here. Use 'item.find('tagname').text' to access the properties of a review. """ category = item.find("category").text polarity = item.find("polarity").text text = item.find("text").text summary = item.find("summary").text return {'polarity':polarity, 'summary':summary, 'text':text, 'category':category } if __name__ == "__main__": pc, nc = count_pos_neg(itemfile)

问题在于,每碰到第55个item节点,就会发生一次AttruibuteError,错误信息为

Ignoring item: 'NoneType' object has no attribute 'text'

我在使用evens=('end',)进行解析时,没有发生错误。这是否说明之前的错误与使用start解析有关?

PHPz
PHPz

学习是最好的投资!

reply all (1)
伊谢尔伦

文档说:

Note

iterparse() only guarantees that it has seen the “>” character of a starting tag when it emits a “start” event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point. The same applies to the element children; they may or may not be present.

If you need a fully populated element, look for “end” events instead.

start事件发生时这个元素的子元素还没有解析,所以你应该用end事件。

    Latest Downloads
    More>
    Web Effects
    Website Source Code
    Website Materials
    Front End Template
    About us Disclaimer Sitemap
    php.cn:Public welfare online PHP training,Help PHP learners grow quickly!