Converting UTF-8 Characters to UCS-2 Code Points
In this article, we explore how to extract the UCS-2 code points of characters within a UTF-8 string. We will provide a detailed explanation of the process and an implementation in PHP versions 4 or 5.
Understanding UTF-8
UTF-8 is a character encoding standard that represents Unicode characters using one to four bytes. To determine the number of bytes for a particular character, examine the leading byte:
Converting to UCS-2
UCS-2, also known as UTF-16, is a character encoding format that can represent most Unicode characters. The conversion from UTF-8 to UCS-2 considers the number of bytes per character as follows:
Implementation in PHP 4/5
For PHP versions 4 or 5, you can implement a function to perform this conversion:
<code class="php">function utf8_char_to_ucs2($utf8) { if (!(ord($utf8[0]) & 0x80)) { return ord($utf8[0]); } elseif ((ord($utf8[0]) & 0xE0) == 0xC0) { return ((ord($utf8[0]) & 0x1F) << 6) | (ord($utf8[1]) & 0x3F); } elseif ((ord($utf8[0]) & 0xF0) == 0xE0) { return ((ord($utf8[0]) & 0x0F) << 12) | ((ord($utf8[1]) & 0x3F) << 6) | (ord($utf8[2]) & 0x3F); } else { return null; // Handle invalid characters or characters beyond UCS-2 range } }</code>
Example Usage
<code class="php">$utf8 = "hello"; for ($i = 0; $i < strlen($utf8); $i++) { $ucs2_codepoint = utf8_char_to_ucs2($utf8[$i]); printf("Code point for '%s': %d\n", $utf8[$i], $ucs2_codepoint); }</code>
This will output:
Code point for 'h': 104 Code point for 'e': 101 Code point for 'l': 108 Code point for 'l': 108 Code point for 'o': 111
The above is the detailed content of How to Convert UTF-8 Characters to UCS-2 Code Points in PHP?. For more information, please follow other related articles on the PHP Chinese website!