Using lxml to capture Chinese characters, the result is very painful, I don’t know how to deal with it...
comUrl="http://m.51job.com/search/codetail.php?coid=4108723"
res=requests.get(comUrl)
html=etree.HTML(res.text)
p=html.xpath("//aside")[1].xpath("./p") #结果为[<Element p at 0x7bf01c8>, <Element p at 0x78f4408>, <Element p at 0x69db388>]
p[0].xpath("./span/text()") #这个是想要抓取的字符
The result is captured like this [u'\xe6\x80\xa7\xe8\xb4\xa8']
unicode but the content is str encoding, how to convert this thing into Chinese?
Normally it should be '\xe6\x80\xa7\xe8\xb4\xa8' or u'\u6027\u8d28'
When this happens, it’s usually because requests have guessed the wrong encoding of the web page
So just specify the encoding of requests.
res.encoding ='utf-8'