Python Basics--Learning Character Encoding-Python Tutorial-php.cn

Foreword

Character encoding is very easy to cause problems. We must keep a few words in mind:

1. Which encoding is used to save it, we must use which encoding to open it

2. The execution of the program is to read the file into the memory first

3. Unicode is the parent encoding and can only be encoded and decoded into other encoding formats

Utf-8, GBK these are sub-8 Encoding can only be decoded into Unicode

1. What is character encoding

We know that computers can only recognize binary, and the codes we usually write need to be converted into binary to be recognized by the computer. So, how do we convert the characters we write into binary? This process actually uses a standard to make the characters we write correspond to specific numbers one-to-one. This standard is called character encoding.

Characters ------ (Character encoding) -------> Numbers

2. Character encoding development history

　1.ASCII code

Computers originated in the United States, and character encoding also originated in the United States. But the characters used by the American people only have 26 letters, plus some special symbols. Unlike in China, primary school students have to know thousands of Chinese characters. So the American people use ASCII code (American Standard Code for Information Interchange) as character encoding. One Bytes represents one character. 1Bytes=8bit. There can be 2 to the 8th power, which is 256 different changes, but initially only the first 7 were used. bits, that is, 127 characters, which is enough for the people of the United States (of course due to cost considerations). Later, Latin was compiled into the 8th position. At this point, the ASCII codes are full, and English-speaking countries and Latin countries can play happily.

　2.GBK

Although China’s current technology is not as good as that of the US empire, we have a positive heart, so , in 1980, the State Administration of Standards issued the character encoding used in Chinese -> GBK, which uses two bytes to represent a Chinese character, so there are 2 to the 16th power, or 65536 combinations, which is enough for Chinese characters.

At the same time, other countries have also released their own national character encoding standards, such as Japan's shift_JIS, South Korea's Euc-kr, etc.

　3.Unicode

It is said that there were hundreds of character encodings in their heyday, and they did not support each other. It seems that people in all countries have a lot of backbone, but this is not conducive to the world. Interoperable, so Unicode came into being. In 1994, the International Organization for Standardization released Unicode, known as the Universal Code, which uses two bytes to represent a character and has 65,536 combinations, which can already cover most languages in the world.

　4.utf-8

Although Unicode is good, there is a problem. English that could be expressed in one byte can now To use two bytes, the storage space is doubled. This is obviously not perfect, so UTF-8 was created, which only uses 1 byte for English characters and 3 bytes for Chinese characters. .

5. All Unicode characters are two bytes, which is simple and crude. The conversion of characters into numbers is fast, but it takes up a lot of storage space

UTF-8 uses different lengths to represent different characters, saving space, but the conversion efficiency is not as fast as Unicode

The character encoding used in the memory is Unicode, and the memory is to speed up , so I would rather sacrifice a little space to ensure the speed

The hard disk and network transmission use utf-8, because the disk I/O or network I/O delay is much greater than utf -8 conversion efficiency, and bandwidth should be saved as much as possible in network transmission

3. Python interpreter execution

The first stage: the python interpreter starts, which is quite Then started a text editor

The second stage: the python interpreter serves as a text editor to open the t.py file and read the contents of the t.py file from the hard disk into the memory

The third stage: the python interpreter interprets and executes the code of t.py just loaded into the memory

In the second stage, the t.py file has a character encoding when it is saved , the same encoding method must be specified when the Python interpreter opens the file (the default encoding method of Python2 is ASCII, and the default encoding method of Python3 is utf-8). If the encoding format of the file saved is different from the default encoding method of the Python interpreter, just You need to write #coding: at the beginning of the file to tell the python interpreter not to use its default encoding method to read, but to read the file in the method specified by the header file, so that there will be no errors.

The third stage: Read the code that has been loaded into the memory (Unicode by default), and then execute it. During the execution, if an operation like defining a variable is encountered, a new memory space will be opened in the memory. Please note at this time that the newly opened memory space is not necessarily Unicode. The user can specify the encoding method when defining the variable. The memory space opened during definition is just a space and can store codes in any encoding format. Take Python3 as an example

4. Encoding and decoding

Saving files is to save the files in the memory To the hard disk

Reading files is to read the files from the hard disk into the memory

If the subcode wants to be converted to other codes, it must be converted to the parent code first, and then converted from the parent code to other subcodes

Decoding is decoding, which is the process of converting the subcode to the parent code Unicode

Encoding is encode, which is the process of converting Unicode into other encodings

As mentioned before, when a file is read into memory, it becomes Unicode encoding (of course this is the default, and can also be changed according to instructions), The process of reading files from the hard disk is to decode the utf-8 in the hard disk into Unicode

. When the file is saved, it is the process of saving it from the memory to the hard disk. The hard disk is encoded in utf-8 and needs to be encoded by Unicode. into utf-8

5. The difference between Python2 and Python3

1. The default encoding method of Python2 is ASCII. When opening a file saved in utf-8, an error will be reported. You should add it to the header file. #coding: utf-8

The str in Python2 is recognized as Bytes, so the str in Python2 is the result of being encoded. In fact, it will do one thing by default, which is to add a u in front of str. First Convert to Unicode, encode to bytes

There are two string types in Python2, str and Unicode. str can be converted to Unicode by adding a 'u' in front of it

2.python 3's default encoding is utf-8, you can directly open files saved with utf-8

str in Python3 is recognized as Unicode

There are also two string types in Python3 (bytes and str), but bytes is bytes, and str is unicode

6. Print to the terminal

First of all, you must know that the default setting of the Windows terminal The encoding method is GBK

The terminal is also an application and runs in memory, so the process of printing with print() is from memory to memory. So for unicode, no matter how you print, there will be no error. However, in Python2, except for the string with 'u', the other strings are Bytes. At this time, the terminal uses GBK encoding, while Python2 uses the specified utf-8 or default Ascii code, an error will occur when printing in the terminal.

These are my current understandings. If I realize that there is an error or something unclear, I will revise it. Alas, character encoding is a pitfall

The above is the detailed content of Python Basics--Learning Character Encoding. For more information, please follow other related articles on the PHP Chinese website!