Convert XML/HTML Entities into Unicode String in Python
Question: How can I convert a string containing HTML entities into a Unicode string in Python? For example, the string "ǎ" should be converted to "ǎ" with a tone mark (u'u01ce').
Answer:
The Python standard library's HTMLParser has an undocumented function called unescape(). This function can convert HTML entities into their Unicode equivalents.
<code class="python">import HTMLParser h = HTMLParser.HTMLParser() h.unescape('&copy; 2010') # u'\xa9 2010' h.unescape('&#169; 2010') # u'\xa9 2010'</code>
For Python 3.4 and above, the following code will work using the html module:
<code class="python">import html html.unescape('&copy; 2010') # u'\xa9 2010' html.unescape('&#169; 2010') # u'\xa9 2010'</code>
The above is the detailed content of How to convert HTML entities to Unicode strings in Python?. For more information, please follow other related articles on the PHP Chinese website!