How to solve java Chinese garbled characters-javaTutorial-php.cn

With the development and popularization of computers, countries around the world will design their own encoding styles in order to adapt to their own languages and characters. It is precisely because of this chaos that there are many encoding methods, so that the same binary number may be are interpreted into different symbols. In order to solve this incompatibility problem, the great idea Unicode encoding came into being! !

Unicode

Unicode is also called Unicode, Unicode, and Unicode. It was created to solve the limitations of traditional character encoding schemes. It sets a unified and unique code for each character in each language. Binary encoding to meet the requirements for cross-language and cross-platform text conversion and processing. You can imagine Unicode as a "large character container" that contains all the symbols in the world, and each symbol has its own unique encoding, which fundamentally solves the problem of garbled characters. So Unicode is an encoding of all symbols [2].

Unicode developed with the standard of the universal character set and was also published in the form of a book. It is an industry standard that organizes and codes most of the writing systems in the world, making it easier for computers to use way to present and process text. Unicode is still being continuously revised and has included more than 100,000 characters so far. It is widely recognized by the industry and is widely used in the internationalization and localization process of computer software.

We know that Unicode was created to solve the limitations of traditional character encoding schemes. For traditional encoding methods, they all have a common problem: they cannot support multi-language environments, which is not suitable for the open environment of the Internet. Allowed. At present, almost all computer systems support the basic Latin alphabet, and each supports different other encoding methods. In order to be compatible with them, Unicode reserves the first 256 characters for the characters defined by ISO 8859-1, so that the conversion of existing Western European languages does not require special considerations; and a large number of the same characters are repeatedly encoded into different character codes Go, allowing the old and complicated encoding methods to be directly converted to and from Unicode encoding without losing any information [1].

Implementation method

The Unicode encoding of a character is determined, but in the actual transmission process, due to the different design of different system platforms and the purpose of saving space, the implementation of Unicode encoding is different. The implementation of Unicode is called Unicode Transformation Format (UTF for short) [1].

Unicode is a character set, which mainly has three implementation methods: UTF-8, UTF-16, and UTF-32. Since UTF-8 is the current mainstream implementation method, UTF-16 and UTF-32 are relatively rarely used, so the following will mainly introduce UTF-8.

UCS

When it comes to Unicode, it may be necessary to know about UCS. UCS (Universal Character Set) is a standard character set defined by the ISO 10646 (or ISO/IEC 10646) standard formulated by ISO. It includes all other character sets, ensuring two-way compatibility with other character sets, that is, if you translate any text string to UCS format and then translate back to the original encoding, you will not lose any information.

UCS not only assigns a code to each character, but also gives it an official name. Hexadecimal numbers representing a UCS or Unicode value are usually preceded by "U+", for example "U+0041" represents the character "A".

Little endian & Big endian

Due to the different designs of each system platform, some platforms may have different understanding of characters (such as the understanding of byte order). This will result in the byte stream being interpreted as different content. For example, the hexadecimal value of a certain character is 4E59, which is split into 4E and 59. When read on the MAC, it starts with the low-order bit. Then when the MAC encounters the byte stream, it will be parsed as 594E. Find The character is "Kui", but on the Windows platform, reading starts from the high byte, which is 4E59, and the found character is "B". In other words, "B" saved on the Windows platform will become "Kui" on the MAC platform. This will inevitably cause confusion, so two methods are used to distinguish between Big endian and Little endian in Unicode encoding. That is, the first byte comes first, which is the big-endian mode, and the second byte comes first, which is the little-endian mode. So a question arises at this time: How does the computer know which encoding method a certain file uses?

It is defined in the Unicode specification that a character indicating the encoding sequence is added to the front of each file. The name of this character is called "ZERO WIDTH NO-BREAK SPACE", represented by FEFF. This is exactly two bytes, and FF is one greater than FE.

If the first two bytes of a text file are FE FF, it means that the file uses big-endian mode; if the first two bytes are FF FE, it means that the file uses small-endian mode.

UTF-8

UTF-8 is a variable-length character encoding for Unicode. It can use 1~4 bytes to represent a symbol, and the byte length changes according to different symbols. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII. This allows the original system that processes ASCII characters to continue to be used without or with only minor modifications. Therefore, it has gradually become the preferred encoding for email, web pages, and other applications that store or transmit text.

UTF-8 uses one to four bytes to encode each character. The encoding rules are as follows:

1) For single-byte symbols, the first bit of the byte is set to 0, and the next 7 bits are for this symbol. unicode code. So for English letters, UTF-8 encoding and ASCII code are the same.

2) For n-byte symbols (n>1), the first n bits of the first byte are set to 1, the n+1th bit is set to 0, and the first two bits of the following bytes are set to 10 . The remaining binary bits not mentioned are all the unicode code of this symbol.

The conversion table is as follows:

How to solve java Chinese garbled characters

According to the above conversion table, it becomes very simple to understand the conversion encoding rules of UTF-8: If the first bit of the first byte is 0, it means this byte It is a character alone; if it is 1, the number of consecutive 1s indicates how many bytes the character occupies.

Take the Chinese character "yan" as an example to demonstrate how to implement UTF-8 encoding [3].

It is known that the unicode of "strict" is 4E25 (100111000100101). According to the above table, it can be found that 4E25 is in the range of the third line (0000 0800-0000 FFFF), so the UTF-8 encoding of "strict" requires three Bytes, that is, the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, starting from the last binary digit of "strict", fill in the x in the format from back to front, and fill in the extra bits with 0. In this way, we get that the UTF-8 encoding of "Yan" is "11100100 10111000 10100101", which converted to hexadecimal is E4B8A5.

Conversion between Unicode and UTF-8

Through the above example, we can see that the Unicode code of "Yan" is 4E25 and the UTF-8 encoding is E4B8A5. They are different and need to be converted by the program. To achieve this, the simplest and most intuitive method on the Window platform is Notepad.

There are four options at the bottom of "Encoding (E)": ANSI, Unicode, Unicode big endian, UTF-8.

ANSI: The default encoding method of Notepad is ASCII encoding for English files and GB2312 encoding for Simplified Chinese files. Note: Different ANSI codes are incompatible with each other. When information is exchanged internationally, text belonging to two languages cannot be stored in the same ANSI-encoded text.

Unicode: UCS-2 encoding method, that is, directly using Two bytes store the Unicode code of the character. This method is the "little endian" method.

Unicode big endian: UCS-2 encoding method, "big endian" method.

UTF-8: Read above (UTF-8).

> Viewer" and get the following results:

ANSI: The two bytes "D1 CF" are exactly the GB2312 encoding of "strict".

Unicode: Four bytes "FF FE 25 4E", where "FF FE" represents the small end storage method, and the real encoding is "25 4E".

Unicode big endian: four bytes "FE FF 4E 25", "FE FF" represents the big end storage method, and the real encoding is "4E 25".

UTF-8: The encoding is six bytes "EF BB BF E4 B8 A5". The first three bytes "EF BB BF" indicate that this is UTF-8 encoding, and the last three bytes "E4B8A5" are "strict" For specific encoding, its storage order is consistent with the encoding order.