Detailed introduction to Unicode and JavaScript code examples ()
黄舟
Release: 2017-03-14 15:21:38
Original
1364 people have browsed it
Unicode is a common character encoding set, so how does Unicode support JavaScript? This article will discuss the JavaScript language's support for UnicodeCharacter Set. I hope readers can understand the concept and usage of character sets in JavaScript from an essential point of view.
1. What is Unicode?
Unicode originated from a very simple idea: include all the characters in the world in one set. As long as the computer supports this character set, it can display all characters, and there will no longer be garbled characters.
It starts from 0 and assigns a number to each symbol, which is called a "code point". For example, the symbol for code point 0 is null (indicating that all binary bits are 0).
U+0000 = null
Copy after login
In the above formula, U+ indicates that the hexadecimal number immediately following is the Unicode code point.
Currently, the latest version of Unicode is version 7.0, which contains a total of 109,449 symbols, including 74,500 Chinese, Japanese and Korean characters. It can be approximated that more than two-thirds of the existing symbols in the world come from East Asian scripts. For example, the code point for "good" in Chinese is 597D in hexadecimal.
U+597D = 好
Copy after login
With so many symbols, Unicode is not defined at once, but by partition. Each area can store 65536 (216) characters, which is called a plane. Currently, there are 17 (25) planes in total, which means that the size of the entire Unicode character set is now 221.
The first 65536 character bits are called the basic plane (abbreviation BMP). Its code point range is from 0 to 216-1. When written in hexadecimal, it is from U+0000 to U+FFFF. All the most common characters are placed on this plane, which is the first plane defined and announced by Unicode.
The remaining characters are placed in the auxiliary plane (abbreviated as SMP), and the code points range from U+010000 to U+10FFFF.
2. UTF-32 and UTF-8
Unicode only specifies the code point of each character. What kind of byte order is used to represent it? This code point involves the encoding method.
The most intuitive encoding method is that each code point is represented by four bytes, and the byte content corresponds to the code point one-to-one. This encoding method is called UTF-32. For example, code point 0 is represented by four bytes of 0, and code point 597D is preceded by two bytes of 0.
U+0000 = 0x0000 0000
U+597D = 0x0000 597D
Copy after login
The advantage of UTF-32 is that the conversion rules are simple and intuitive, and the search efficiency is high. The disadvantage is that it wastes space. For the same English text, it will be four times larger than ASCII encoding. This shortcoming is so fatal that no one actually uses this encoding method. The HTML 5 standard clearly stipulates that web pages must not be encoded into UTF-32.
What people really needed was a space-saving encoding method, which led to the birth of UTF-8. UTF-8 is a variable-length encoding method, with character lengths ranging from 1 byte to 4 bytes. The more commonly used characters are, the shorter the bytes are. The first 128 characters are represented by only 1 byte, which is exactly the same as the ASCII code.
The above is the detailed content of Detailed introduction to Unicode and JavaScript code examples (). For more information, please follow other related articles on the PHP Chinese website!
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn