How to Extract UCS-2 Code Points from UTF-8 Strings?-PHP Tutorial-php.cn

How to Extract UCS-2 Code Points from UTF-8 Strings?

Barbara Streisand

Release： 2024-11-01 17:45:30

Original

650 people have browsed it

How to Extract UCS-2 Code Points from UTF-8 Strings?

Determining UCS-2 Code Points for UTF-8 Characters

In various programming scenarios, it may be necessary to extract the UCS-2 code points associated with characters within a UTF-8 string. To address this requirement, it is prudent to leverage built-in utilities or delve into the complexities of the UTF-8 encoding format.

UTF-8 encodes characters using a variable-length byte sequence. Each code point is represented by 1 to 4 bytes, depending on its value. The following ranges apply:

U 0000 — U 007F: 1 byte (0xxxxxxx)
U 0080 — U 07FF: 2 bytes (110xxxxx 10xxxxxx)
U 0800 — U FFFF: 3 bytes (1110xxxx 10xxxxxx 10xxxxxx)
U 10000 — U 10FFFF: 4 bytes (11110xxx 10xxxxxx 10xxxxxx 10xxxxxx)

To determine the number of bytes in a code point, examine the first byte:

0x00: 1 byte
0xC0: 2 bytes
0xE0: 3 bytes
0xF0: 4 bytes
0x10: Continuation byte
0x11111: Invalid character

Once the byte count is known, the code point can be extracted through bit manipulation. Note that UCS-2 has a limited range and cannot represent characters above U FFFF.

The above is the detailed content of How to Extract UCS-2 Code Points from UTF-8 Strings?. For more information, please follow other related articles on the PHP Chinese website!