Character encoding issues when lxml processes xml-XML/RSS Tutorial-php.cn

Character encoding issues when lxml processes xml

黄舟

Release： 2017-04-18 09:16:02

Original

2502 people have browsed it

In order to simplify the problem, the content of xml is simplified into the following form:

<?xml version="1.0" encoding="gbk"?>
<DOCUMENT>
<da><![CDATA[中文，就是任性]]></da>
</DOCUMENT>

Copy after login

Its encoding is gbk, and one of the nodes is a Chinese character. Use lxml to extract the node. The following exception occurred during the value

lxml.etree.XMLSyntaxError: Extra content at the end of the document

Copy after login

The corresponding Python script at this time is:

tst = u&#39;<?xml version="1.0" encoding="gbk"?><DOCUMENT><da><![CDATA[中文，就是任性]]></da></DOCUMENT>&#39;
for event,element in etree.iterparse(BytesIO(tst.encode(&#39;utf-8&#39;))):
    print("%s, %s" % (element.tag, element.text))

Copy after login

But before simplification, another exception was reported

lxml.etree.XMLSyntaxError: input conversion failed due to input error, bytes 0x8B 0x2C 0xE6 0x9D

Copy after login

No matter which exception it is, the guess is that it is still related to the encoding form of the character.
After various attempts to no avail, I later saw this article on stackoverflow. The problem mentioned in the article is related to the encoding value in xml. I tried adding a piece of code

tst = u&#39;<?xml version="1.0" encoding="gbk"?><DOCUMENT><da><![CDATA[中文，就是任性]]></da></DOCUMENT>&#39;
tst = tst.replace(&#39;encoding="gbk"&#39;, &#39;encoding="utf-8"&#39;)
for event,element in etree.iterparse(BytesIO(tst.encode(&#39;utf-8&#39;))):
    print("%s, %s" % (element.tag, element.text))

Copy after login

Added a replacement statement, replacing the previous encoding="gbk" with encoding:"utf-8" and finally got the result:

da, 中文，就是任性
DOCUMENT, None

Copy after login

The above is the detailed content of Character encoding issues when lxml processes xml. For more information, please follow other related articles on the PHP Chinese website!