Home > Backend Development > Python Tutorial > Python method to collect Chinese garbled characters

Python method to collect Chinese garbled characters

高洛峰
Release: 2017-02-24 15:31:42
Original
1433 people have browsed it

In the past few days, when collecting a certain web page, most of the web pages were OK, but a small number of web pages had garbled characters. After debugging for a few days, I finally found that it was caused by some illegal characters.. This is recorded

1. Under normal circumstances, you can use

import chardet

thischarset = chardet.detect(strs)["encoding"]
Copy after login

to obtain the encoding method of the file or page

Or directly grab the charset = xxxx of the page to get

2. When encountering special characters in the content, the specified encoding will also cause garbled characters. That is, caused by illegal characters in the content, you can use encoding to ignore the illegal characters The way characters are processed.

strs = strs.decode("UTF-8","ignore").encode("UTF-8")
Copy after login

The second parameter of decode indicates the way to take when illegal characters are encountered

This parameter defaults to throwing an exception.

The above is the complete content of the perfect solution to the problem of collecting Chinese garbled characters in python brought by the editor. I hope it will be helpful to everyone. Please support PHP Chinese website

For more articles related to python’s method of collecting Chinese garbled characters, please pay attention to the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template