Home  >  Article  >  Backend Development  >  Python character encoding explanation

Python character encoding explanation

巴扎黑
巴扎黑Original
2017-08-08 15:48:481152browse

The following editor will bring you a cliché about character encoding based on Python. The editor thinks it’s pretty good, so I’ll share it with you now and give it as a reference. Let’s follow the editor and take a look.

Preface

##Character encoding is very easy to go wrong, we must keep a few things in mind In a sentence:

1. Which encoding is used to save it, which encoding should be used to open it

2. The execution of the program is to first read the file into the memory

3.Unicode is the parent encoding and can only be encoded and decoded into other encoding formats

utf-8, GBK are sub-8 encodings and can only be decoded into Unicode

1. What is character encoding

We know that computers can only recognize binary, and the codes we usually write need to be converted into binary to be recognized by the computer. So, how do we convert the characters we write into binary? This process actually uses a standard to make the characters we write correspond to specific numbers one-to-one. This standard is called character encoding.

Character------(Character encoding)------->Number

2. Development history of character encoding

1.ASCII code

Computers originated in the United States, and character encoding also originated in the United States. But the characters used by the American people only have 26 letters, plus some special symbols. Unlike in China, primary school students have to know thousands of Chinese characters. So the American people use ASCII code (American Standard Code for Information Interchange) as character encoding. One Bytes represents one character. 1Bytes=8bit. There can be 2 to the 8th power, which is 256 different changes, but initially only the first 7 were used. bits, that is, 127 characters, which is enough for the American people (for cost reasons of course). Later, Latin was compiled into the 8th position. At this point, the ASCII codes are full, and English-speaking countries and Latin countries can play happily.

2.GBK

Although China’s current technology is not as good as that of the US empire, we have a positive heart, so , in 1980, the State Administration of Standards issued the character encoding used in Chinese -> GBK, which uses two bytes to represent a Chinese character, so there are 2 to the 16th power, or 65536 combinations, which is enough for Chinese characters.

At the same time, other countries have also released their own national character encoding standards, such as Japan's shift_JIS, South Korea's Euc-kr, etc.

3.Unicode

It is said that there were hundreds of character encodings in their heyday, and they did not support each other. It seems that people in all countries are very strong-minded, but this is not conducive to the interoperability of the world, so Unicode came into being. And born. In 1994, the International Organization for Standardization released Unicode, known as the Universal Code, which uses two bytes to represent a character and has 65,536 combinations, which can already cover most languages ​​​​in the world.

4.utf-8

Although Unicode is good, there is a problem. English that could be expressed in one byte can now To use two bytes, the storage space is doubled. This is obviously not perfect, so UTF-8 was created, which only uses 1 byte for English characters and 3 bytes for Chinese characters. .

5. All characters in Unicode are two bytes, which is simple and crude. It converts characters into numbers quickly, but takes up a lot of storage space

utf-8 uses different lengths to represent different characters, saving space, but the conversion efficiency is not as fast as Unicode

The character encoding used in the memory is Unicode, and the memory is to speed up, so I would rather sacrifice a little Space, but also ensure speed

Hard disk and network transmission use utf-8, because the disk I/O or network I/O delay is much greater than the conversion efficiency of utf-8, and the network transmission should Save bandwidth as much as possible

3. Python interpreter execution

First phase: Python interpreter starts, this It is equivalent to starting a text editor

Second stage:The python interpreter serves as a text editor to open the t.py file and copy the t.py file from the hard disk The content is read into the memory

The third stage:The python interpreter interprets and executes the code of t.py just loaded into the memory

The second stage, t. py file has a character encoding when saving, and the same encoding method must be specified when the Python interpreter opens the file (the default encoding method of Python2 is ASCII, and the default encoding method of Python3 is utf-8). If the encoding format of the file saving is different from that of the Python interpreter If the default encoding method of the interpreter is different, you need to write #coding: at the beginning of the file to tell the python interpreter not to use its default encoding method to read, but to use the method specified by the header file to read the file, so that Can't go wrong.

The third stage: Read the code that has been loaded into the memory (Unicode by default), and then execute it. During the execution, if an operation like defining a variable is encountered, a new memory space will be opened in the memory. Please note at this time that the newly opened memory space is not necessarily Unicode. The user can specify the encoding method when defining the variable. The memory space opened during definition is just a space and can store codes in any encoding format. Take Python3 as an example

4. Encoding and decoding

Saving the file is to save the file in the memory To the hard disk

Reading files is to read the files from the hard disk into the memory

Unicode is the parent encoding, utf-8, GBK are the child encodings. If the subcode wants to be converted to other codes, it must be converted to the parent code first, and then converted from the parent code to other subcodes

Decoding is decode, which is the process of converting the subcode to the parent code Unicode

Encoding is encoding, which is the process of converting Unicode into other encodings.

As mentioned before, when the file is read into the memory, it becomes Unicode encoding (of course this is the default, and can also be changed according to instructions), The process of reading files from the hard disk is to decode the utf-8 in the hard disk into Unicode

. When the file is saved, it is the process of saving it from the memory to the hard disk. The hard disk is encoded in utf-8 and needs to be encoded by Unicode. Into utf-8

5. The difference between Python2 and Python3

1. The default encoding of Python2 is ASCII, open utf-8 to save An error will be reported when entering the file. You should add #coding to the header file: utf-8

Str in Python2 is recognized as Bytes, so str in Python2 is the result of being encoded. In fact, it will be done by default. The thing is to add a u in front of str, convert it to Unicode first, and encode it into bytes

There are two string types in Python2, str and Unicode. str can be converted by adding a 'u' in front of it. Convert to Unicode

2. The default encoding method of python 3 is utf-8, you can directly open files saved with utf-8

str in Python3 is recognized as Unicode

There are also two string types (bytes and str) in Python3, but bytes is bytes and str is unicode

6. Print to the terminal

First of all, you need to know that the default encoding method of Windows terminal is GBK

The terminal is also an application and runs in the memory, so the process of printing with print() is from memory to memory middle. So for unicode, no matter how you print, there will be no error. However, in Python2, except for the string with 'u', the other strings are Bytes. At this time, the terminal uses GBK encoding, while Python2 uses the specified utf-8 or default Ascii code, an error will occur when printing in the terminal.

These are my current understanding. If I realize that there are errors or unclear expressions in the future, I will revise them. Alas, character encoding is a pitfall

The above is the detailed content of Python character encoding explanation. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn