Detailed explanation of Python3's solution to difficult character encoding problems-Python Tutorial-php.cn

Detailed explanation of Python3's solution to difficult character encoding problems

PHPz

Release： 2017-04-02 13:23:49

Original

1562 people have browsed it

Python3 One of the most important improvements is to solve the big pit left by string and character encoding in Python2. Why is Python coding so painful? Some flaws in Python2 string design have been introduced:
- Using ASCII code as the default encoding method is very unfriendly to Chinese processing.
- Far-fetchedly dividing strings into two types, unicode and str, misleading developers

Of course, this is not a bug. As long as you pay more attention when processing, you can avoid these pitfalls. But in Python3 both problems are solved very well.

First, Python3 sets the system default encoding to UTF-8

>>> import sys
>>> sys.getdefaultencoding()
&#39;utf-8&#39;
>>>

Copy after login

Then, text characters and binary data are more clearly distinguished, represented by str and bytes respectively. All text characters are represented by the str type. str can represent all characters in the Unicode character set , while binary byte data is represented by a new data type , represented by bytes.

str

>>> a = "a"
>>> a
&#39;a&#39;
>>> type(a)
<class &#39;str&#39;>
>>> b = "禅"
>>> b
&#39;禅&#39;
>>> type(b)
<class &#39;str&#39;>

Copy after login

bytes

In Python3, adding 'b' before the character quotation marks clearly indicates that this is a

object of bytes type. In fact, It is a set of data consisting of a sequence of binary bytes. The bytes type can be characters in the ASCII range and other character data in hexadecimal form, but it cannot be represented by non-ASCII characters such as Chinese.

>>> c = b&#39;a&#39;>>> c
b&#39;a&#39;>>> type(c)
<class &#39;bytes&#39;>

>>> d = b&#39;\xe7\xa6\x85&#39;>>> d
b&#39;\xe7\xa6\x85&#39;>>> type(d)
<class &#39;bytes&#39;>
>>>

>>> e = b&#39;禅&#39;
  File "<stdin>", line 1SyntaxError: bytes can only contain ASCII literal characters.

Copy after login

The bytes type provides the same operations as str, supporting operations such as sharding, indexing, and basic numerical operations. However, the + operation cannot be performed on data of type str and bytes, although it is feasible in py2.

>>> b"a"+b"c"
b&#39;ac&#39;
>>> b"a"*2
b&#39;aa&#39;
>>> b"abcdef\xd6"[1:]
b&#39;bcdef\xd6&#39;
>>> b"abcdef\xd6"[-1]
214
>>> b"a" + "b"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can&#39;t concat bytes to str

Copy after login

encode and decode

Conversion between str and bytes can be done using the encode and decode methods.

encode is responsible for character to byte encoding conversion. By default, UTF-8 encoding is used.

>>> s = "Python之禅"
>>> s.encode()
b&#39;Python\xe4\xb9\x8b\xe7\xa6\x85&#39;
>>> s.encode("gbk")
b&#39;Python\xd6\xae\xec\xf8&#39;

Copy after login

decode is responsible for decoding and converting bytes to characters, and generally uses UTF-8 encoding format for conversion.

>>> b&#39;Python\xe4\xb9\x8b\xe7\xa6\x85&#39;.decode()
&#39;Python之禅&#39;
>>> b&#39;Python\xd6\xae\xec\xf8&#39;.decode("gbk")
&#39;Python之禅&#39;

Copy after login

The above is the detailed content of Detailed explanation of Python3's solution to difficult character encoding problems. For more information, please follow other related articles on the PHP Chinese website!