JavaScript character set encoding and decoding in detail-JS Tutorial-php.cn

JavaScript character set encoding and decoding in detail

高洛峰

Release： 2017-02-04 09:31:58

Original

1542 people have browsed it

1. Character set

1)Character and byte (Character)

Character is the general term for various texts and symbols, including garbled characters; one character corresponds to 1~n bytes , one byte corresponds to 8 bits, each bit is represented by 0 or 1.

2) Character Set

Character set is a collection of multiple characters. Each character set contains a different number of characters. Common character set names: ASCII character set, GB2312 Character set, Unicode character set, etc.

3) Character Encoding

Character encoding is to convert symbols into computer-readable binary, and decoding is to convert binary into human-readable symbols.

Most character sets correspond to one encoding method (for example, GBK corresponds to GBK encoding), but there are many Unicode encodings, including UTF-8, UTF-16, UTF-32 and UTF-7.

The most commonly used web page is "UTF-8". UTF-8 uses one to four bytes to encode each character. It is a superset of ASCII, so existing ASCII text does not need to be converted

2. Browser system

1) Use decimal and hexadecimal in HTML attributes

Decimal can use "8" in HTML, hexadecimal, Then use "Z", which has one more x than the decimal system, and there are also 6 more characters a~f in the decimal code to represent 10~15.

2) Use decimal and hexadecimal in CSS attributes

CSS is compatible with the decimal form of HTML. In addition, hexadecimal can also use the form of "\6c" To represent.

3) JavaScript encoding encapsulation

You can directly execute string octal and hexadecimal encoding methods through eval, where octal is represented by "\56" and hexadecimal is represented by " \x5c" means.

If Chinese characters are used in the code and hexadecimal encoding is required, only hexadecimal Unicode encoding can be performed, and its representation is: "\u4ee3\u7801".

In "Web Front-end Hacking Technology Revealed", two methods are encapsulated for encoding and decoding. The following two methods are mainly used. The specific code can be viewed here.

The core codes are: "str.charCodeAt(char).toString(base)" and "String.fromCharCode(parseInt(code, base))"

charCodeAt() method returns An integer between 0 and 65535 representing the UTF-16 code unit at the given index

staticString.fromCharCode() method returns a string created using the specified sequence of Unicode values.

You can also encode and decode "MonyerJS" through an online web page.

4) HTML automatic decoding mechanism

For example, if you enter hexadecimal "Hello" in a web page, it will automatically be decoded into "hello".

There are also some well-known spaces " " that are also this mechanism.

3. Browser encoding

There are three pairs of functions in JavaScript that can encode and decode strings, namely:

escape/unescape, encodeURI/decodeURI, encodeURIComponent/decodeURIComponent .

The main difference is the number of characters that are not encoded.

1) There are 69 characters that are not encoded by escape

*, +, -, ., /, @, _, 0~9, a~z, A~Z and escape is correct When encoding unicode values other than 0 to 255, the %u**** format is output.

2) There are 82 characters that are not encoded by encodeURI

!, #, $, &, ', (,), *, +,,, -,.,/,:, ;,=,?,@,_,~,0~9,a~z,A~Z

3) There are 71 characters that encodeURIComponent does not encode

!,',( ,),*,-,.,_,~,0～9,a～z,A～Z

For more JavaScript character set encoding and decoding related articles, please pay attention to the PHP Chinese website!