This question is about obtaining the UCS-2 code points for a given UTF-8 string. The task is to convert individual characters into their corresponding UCS-2 code points, regardless of the language or complexity of the characters.
UCS-2 Code Point Representation
Each UCS-2 code point is stored in 1-4 bytes, based on the code point value:
Determining Byte Count
To determine the byte count for a character, examine the first byte:
Example C Code
Here is a sample C code to convert a UTF-8 character to a UCS-2 code point:
<code class="c">wchar_t utf8_char_to_ucs2(const unsigned char *utf8) { if(!(utf8[0] & 0x80)) // 0xxxxxxx return (wchar_t)utf8[0]; else if((utf8[0] & 0xE0) == 0xC0) // 110xxxxx return (wchar_t)(((utf8[0] & 0x1F) << 6) | (utf8[1] & 0x3F)); else if((utf8[0] & 0xF0) == 0xE0) // 1110xxxx return (wchar_t)(((utf8[0] & 0x0F) << 12) | ((utf8[1] & 0x3F) << 6) | (utf8[2] & 0x3F)); else return ERROR; // uh-oh, UCS-2 can't handle code points this high }</code>
Alternative Solutions
You can also use existing libraries like iconv or specific libraries for your programming language.
The above is the detailed content of How can I convert characters in a UTF-8 string to their corresponding UCS-2 code points?. For more information, please follow other related articles on the PHP Chinese website!