A character set is a collection of multiple characters. There are many types of character sets, and each character set contains a different number of characters. Common character set names: ASCII character set, GB2312 character set, BIG5 character set, GB 18030 character set, Unicode characters Set etc. In order for a computer to accurately process text in various character sets, character encoding is required so that the computer can recognize and store various text.
Chinese has a large number of characters, and it is also divided into two characters with different writing rules: Simplified Chinese and Traditional Chinese. Computers were originally designed based on English single-byte characters. Therefore, encoding Chinese characters is a technology for Chinese information exchange. Base. This article will discuss several typical character sets in chronological order of character sets, select several representative Chinese character sets, and study the historical origin, characteristics, and technical features.
ASCII character set
1. Origin of the name
ASCII (American Standard Code for Information Interchange, American Standard Code for Information Interchange) is a computer coding system based on the Roman alphabet.
2. Features
It is mainly used to display modern English and other Western European languages. It is the most common single-byte encoding system today and is equivalent to the international standard ISO 646.
3. Contains content
Control characters: Enter key, backspace, line feed key, etc.
Characters that can be displayed: English uppercase and lowercase characters, Arabic numerals and Western symbols
4. Technical characteristics
7 bits represent a character, a total of 128 characters
5.ASCII extended character set
The 7-bit encoded character set can only Supports 128 characters. In order to represent more commonly used European characters, ASCII has been extended. The ASCII extended character set uses 8 bits to represent a character, with a total of 256 characters.
The symbols extended by the ASCII extended character set include tabular symbols, calculation symbols, Greek letters and special Latin symbols.
GB2312 character set
1. Origin of the name
GB2312 is also known as GB2312-80 character set, the full name is "Chinese Character Coded Character Set for Information Exchange Basic Set", issued by the former China State Administration of Standards on May 1, 1981 implementation.
2. Features
GB2312 is China’s national standard simplified Chinese character set. The Chinese characters it contains have covered 99.75% of the frequency of use, basically meeting the computer processing needs of Chinese characters. It is widely used in mainland China and Singapore.
3. Content included
GB2312 includes simplified Chinese characters and general symbols, serial numbers, numbers, Latin letters, Japanese kana, Greek letters, Russian letters, Chinese pinyin symbols, and Chinese phonetic letters, a total of 7445 graphic characters. It includes 6763 Chinese characters, including 3755 first-level Chinese characters and 3008 second-level Chinese characters; 682 full-width characters including Latin letters, Greek letters, Japanese hiragana and katakana letters, and Russian Cyrillic letters.
4. Technical features
(1) Partition representation:
The collected Chinese characters are “partitioned” in GB2312, and each zone contains 94 Chinese characters/symbols. This representation is also called location code.
The characters included in each area are as follows: Areas 01-09 are special symbols; Areas 16-55 are first-level Chinese characters, sorted by pinyin; Areas 56-87 are second-level Chinese characters, sorted by radicals/strokes; Areas 10-15 and 88 Area -94 is not encoded.
(2)Double-byte representation
The first byte of the two bytes is the first byte, and the latter byte is the second byte. It is customary to call the first byte the "high byte" and the second byte the "low byte".
The "high byte" uses 0xA1-0xF7 (add 0xA0 to the area code of area 01-87), and the "low byte" uses 0xA1-0xFE (add 01-94 to 0xA0).
5. Encoding example
Take the first Chinese character "ah" in the GB2312 character set as an example. Its area code is 16, and its bit number is 01. The area code is 1601. In most computer programs, the high byte and low byte Add 0xA0 to each section to get the Chinese character processing code 0xB0A1 of the program. The calculation formula is: 0xB0=0xA0+16, 0xA1=0xA0+1.
BIG5 character set
1. Origin of the name
Also known as Big Five or Big Five, it was developed in 1984 by the Taiwan Information Industry Association and five software companies: Acer, MiTAC, Allison, It was founded by Zero One and FIC, so it is called Big Five.
The Big5 code was created because different manufacturers in Taiwan at that time launched different codes, such as Yitian code, IBM PS55, Wangan code, etc., which were incompatible with each other; on the other hand, the Taiwan government had not yet launched an official Chinese character code, and China Mainland China's GB2312 encoding does not include traditional Chinese characters.
2. Features
The Big5 character set contains a total of 13,053 Chinese characters. This character set is used in Taiwan, China. What is intriguing is that this character set repeatedly contains the same two characters: "兀" (0xA461 and 0xC94A) and "嗀" (0xDCD1 and 0xDDFC).
3. Character encoding method
Big5 code uses a double-byte storage method, using two bytes to encode a word. The first byte is called the "high byte" and the second byte is called the "low byte".The encoding range of the high-order byte is 0xA1-0xF9, and the encoding range of the low-order byte is 0x40-0x7E and 0xA1-0xFE.
The character types corresponding to each encoding range are as follows: 0xA140-0xA3BF are punctuation marks, Greek letters and special symbols. In addition, 0xA259-0xA261 stores the words for two-syllable units of measurement: 噙兛兞兝II兣嗧瓩玎; 0xA440- 0xC67E is a commonly used Chinese character, sorted by stroke first and then by radical; 0xC940-0xF9D5 is a less commonly used Chinese character, also sorted by stroke first and then by radical.
4.Limitations of Big5
Although the Big5 code contains more than 10,000 characters, it does not take into account the names of people, place names, dialects, chemistry and biology, etc. that are circulated in society. It does not include Japanese hiragana and Katakana letters.
For example, Taiwan regards " Zhu " as a variation of " Zhu", so the word " Zhu " is not included. Some radicals in the Kangxi dictionary (such as "亠", "疒", "辵", "綶", etc.), common names (such as "kun", "xuan", "cypress", "喆") ", etc.) are not included in the Big5.
GB18030 character set
1. Origin of the name
The full name of GB 18030 is GB18030-2000 "Expansion of the basic set of Chinese character encoding character sets for information exchange". It is a new Chinese character encoding country released by the Chinese government on March 17, 2000 Standard, software released on the Chinese market after August 31, 2001 must comply with this standard
2. Characteristics
The introduction of the GB 18030 character set standard has undergone extensive participation and demonstration, from well-known companies in the information technology industry at home and abroad, the Ministry of Information Industry Jointly implemented with the former State Administration of Quality and Technical Supervision.
The GB 18030 character set standard solves the problem of computer encoding of large character sets composed of Chinese characters, Japanese kana, Korean and Chinese minority characters. The total character encoding space of this standard exceeds 1.5 million encoding bits and contains 27,484 Chinese characters, covering Chinese, Japanese, Korean and Chinese minority scripts. It meets the multi-language, large font size, multi-purpose, and unified encoding format requirements for information exchange in East Asia, including mainland China, Hong Kong, Taiwan, Japan, and South Korea. It is also compatible with Unicode version 3.0 and fills in the content of the Unicode extended character vocabulary "Unified Chinese Character Extension A". And it is compatible with the previous national character encoding standards (GB2312, GB13000.1).
3. Encoding method
The GB 18030 standard uses three methods of single byte, double byte and four byte to encode characters. The single-byte part uses codes 0×00 to 0×7F (corresponding to the corresponding codes of the ASCII code). In the double-byte part, the first byte code ranges from 0×81 to 0×FE, and the last byte code bits are 0×40 to 0×7E and 0×80 to 0×FE respectively. The four-byte part uses 0×30 to 0×39 that are not used in GB/T 11383 as the suffix for the double-byte encoding expansion. The expanded four-byte encoding ranges from 0×81308130 to 0×FE39FE39. The first and three byte encoding code bits are all from 0×81 to 0×FE, and the second and four byte encoding code bits are all from 0×30 to 0×39.
4. Contents included
The content included in the double-byte part mainly includes 20,902 all CJK Chinese characters in GB13000.1, 13 related punctuation marks, ideographic descriptors, 80 supplementary Chinese characters and radicals/components, and double-byte encoding Euro symbol etc. The four-byte part contains all characters in GB 13000.1, including CJK Unified Chinese Character Extension A, except the above-mentioned double-byte characters.
Unicode Character Set
1. Origin of the name
Unicode character set encoding is the abbreviation of Universal Multiple-Octet Coded Character Set. It is a character developed by an organization called Unicode Consortium. An encoding system that supports the exchange, processing, and display of written text in the various languages of the world today. The encoding began to be developed in 1990 and was officially announced in 1994. The latest version is Unicode 4.1.0 on March 31, 2005.
2. Features
Unicode is a character encoding used on computers. It sets a unified and unique binary encoding for each character in each language to meet the requirements for cross-language and cross-platform text conversion and processing.
3. Encoding method
The Unicode standard always uses hexadecimal numbers, and is prefixed with "U+" when writing. For example, the encoding of the letter "A" is 004116 and the encoding of the character "?" is 20AC16. So the encoding of "A" is written as "U+0041".
4.UTF-8 encoding
UTF-8 is one of the ways to use Unicode. UTF is Unicode Translation Format, which means converting Unicode into a certain format.
UTF-8 facilitates the transmission of text in different languages and encodings between different computers over the network, allowing double-byte Unicode to be correctly transmitted on existing systems that handle single-byte processing.
UTF-8 uses variable length bytes to store Unicode characters. For example, ASCII letters continue to use 1 byte to store, accented characters, Greek letters or Cyrillic letters use 2 bytes to store, while commonly used Chinese characters use 3 byte. Auxiliary plane characters use 4 bytes.
5.UTF-16 and UTF-32 encoding
UTF-32, UTF-16 and UTF-8 are the character encoding schemes of the Unicode standard encoding character set. UTF-16 uses one or two unallocated 16-bit code units A sequence of encodings for Unicode code points; UTF-32 represents each Unicode code point as a 32-bit integer of the same value.
Solutions to garbled code problems in various php applications
1) Use tags to set page encoding
The function of this tag is to declare what character set encoding the client's browser uses to display the page. xxx can be GB2312, GBK, UTF-8 (and MySQL is different, MySQL is UTF8) and so on. Therefore, most pages can use this method to tell the browser what encoding to use when displaying this page, so as to avoid encoding errors and garbled characters. But sometimes we will find that this sentence still doesn't work. No matter which xxx is, the browser always uses the same encoding. I will talk about this later.
Please note that it belongs to HTML information and is just a statement, which only indicates that the server has passed the HTML information to the browser.
2) header("content-type:text/html; charset=xxx");
The function of this function header() is to send the information in the brackets to the http header. If the content in the brackets is as mentioned in the article, the function is basically the same as the label. If you compare the first one, you will find that the characters are similar. But the difference is that if there is this function, the browser will always use the xxx encoding you requested and will never be disobedient, so this function is very useful. Why is this happening? Then we have to talk about the difference between http header and HTML information:
The http header is a string sent by the server before sending HTML information to the browser using the http protocol. The tag belongs to HTML information, so the content sent by header() reaches the browser first. The popular point is that header() has a higher priority (I don’t know if I can say this). If a php page has both header("content-type:text/html;charset=xxx") and header("content-type:text/html;charset=xxx"), the browser will only recognize the former http header and not the meta. Of course, this function can only be used within php pages.
There is also a question left, why does the former definitely work, but the latter sometimes does not work? This is the reason why we want to talk about Apache next.
3) AddDefaultCharset
In the conf folder in the Apache root directory, there is the entire Apache configuration document httpd.conf.
Use a text editor to open httpd.conf. Line 708 (different versions may be different) contains AddDefaultCharset xxx, where xxx is the encoding name. The meaning of this line of code: Set the character set in the http header of the web page file in the entire server to your default xxx character set. Having this line is equivalent to adding a line of header("content-type: text/html; charset=xxx") to each file. Now you can understand why the browser always uses gb2312 even though it is set to utf-8.
If there is header("content-type:text/html; charset=xxx") in the web page, the default character set will be changed to the character set you set, so this function will always be useful. If you add a "#" in front of AddDefaultCharset xxx, comment out this sentence, and the page does not contain header("content-type..."), then it is the meta tag's turn to take effect.
The above priority order is listed below:
header("content-type:text/html; charset=xxx")
.. AddDefaultCharset xxx
..
If you are a web programmer, it is recommended for each of your pages Add a header("content-type:text/html;charset=xxx") to ensure that it can be displayed correctly on any server and has strong portability.
4) Default_charset configuration in php.ini:
Default_charset = "gb2312" in php.ini defines the default language character set of php. It is generally recommended to comment out this line and let the browser automatically select the language based on the charset in the web page header instead of making a mandatory requirement. This way, web services in multiple languages can be provided on the same server.
The above introduces how to use Torchlight 2 mod, a detailed explanation of various PHP encoding sets and under what circumstances they are used, including how to use Torchlight 2 mod. I hope it will be helpful to friends who are interested in PHP tutorials.