How to Convert UTF-8 Characters to UCS-2 Code Points in PHP?-PHP Tutorial-php.cn

How to Convert UTF-8 Characters to UCS-2 Code Points in PHP?

Linda Hamilton

Release： 2024-11-03 02:09:29

Original

472 people have browsed it

How to Convert UTF-8 Characters to UCS-2 Code Points in PHP?

Converting UTF-8 Characters to UCS-2 Code Points

In this article, we explore how to extract the UCS-2 code points of characters within a UTF-8 string. We will provide a detailed explanation of the process and an implementation in PHP versions 4 or 5.

Understanding UTF-8

UTF-8 is a character encoding standard that represents Unicode characters using one to four bytes. To determine the number of bytes for a particular character, examine the leading byte:

0xxxxxxx: 1-byte character
110xxxxx: 2-byte character
1110xxxx: 3-byte character
11110xxx: 4-byte character

Converting to UCS-2

UCS-2, also known as UTF-16, is a character encoding format that can represent most Unicode characters. The conversion from UTF-8 to UCS-2 considers the number of bytes per character as follows:

1-byte character: The code point is directly the UTF-8 byte value.
2-byte character: Shift the first byte left by 6 bits and bitwise OR it with the second byte.
3-byte character: Shift the first byte left by 12 bits, the second byte left by 6 bits, and bitwise OR them with the third byte.

Implementation in PHP 4/5

For PHP versions 4 or 5, you can implement a function to perform this conversion:

<code class="php">function utf8_char_to_ucs2($utf8) {
    if (!(ord($utf8[0]) & 0x80)) {
        return ord($utf8[0]);
    } elseif ((ord($utf8[0]) & 0xE0) == 0xC0) {
        return ((ord($utf8[0]) & 0x1F) << 6) | (ord($utf8[1]) & 0x3F);
    } elseif ((ord($utf8[0]) & 0xF0) == 0xE0) {
        return ((ord($utf8[0]) & 0x0F) << 12) | ((ord($utf8[1]) & 0x3F) << 6) | (ord($utf8[2]) & 0x3F);
    } else {
        return null; // Handle invalid characters or characters beyond UCS-2 range
    }
}</code>

Copy after login

Example Usage

<code class="php">$utf8 = "hello";
for ($i = 0; $i < strlen($utf8); $i++) {
    $ucs2_codepoint = utf8_char_to_ucs2($utf8[$i]);
    printf("Code point for '%s': %d\n", $utf8[$i], $ucs2_codepoint);
}</code>

Copy after login

This will output:

Code point for 'h': 104
Code point for 'e': 101
Code point for 'l': 108
Code point for 'l': 108
Code point for 'o': 111

Copy after login

The above is the detailed content of How to Convert UTF-8 Characters to UCS-2 Code Points in PHP?. For more information, please follow other related articles on the PHP Chinese website!