In the past few days, when collecting a certain web page, most of the web pages were OK, but a small number of web pages had garbled characters. After debugging for a few days, I finally found that it was caused by some illegal characters.. This is recorded
1. Under normal circumstances, you can use
import chardet thischarset = chardet.detect(strs)["encoding"]
to obtain the encoding method of the file or page
Or directly grab the charset = xxxx of the page to get
2. When encountering special characters in the content, the specified encoding will also cause garbled characters. That is, caused by illegal characters in the content, you can use encoding to ignore the illegal characters The way characters are processed.
strs = strs.decode("UTF-8","ignore").encode("UTF-8")
The second parameter of decode indicates the way to take when illegal characters are encountered
This parameter defaults to throwing an exception.
The above is the complete content of the perfect solution to the problem of collecting Chinese garbled characters in python brought by the editor. I hope it will be helpful to everyone. Please support PHP Chinese website
For more articles related to python’s method of collecting Chinese garbled characters, please pay attention to the PHP Chinese website!