How can I convert characters in a UTF-8 string to their corresponding UCS-2 code points?-PHP Tutorial-php.cn

How can I convert characters in a UTF-8 string to their corresponding UCS-2 code points?

Barbara Streisand

Release： 2024-10-30 02:15:02

Original

1004 people have browsed it

How can I convert characters in a UTF-8 string to their corresponding UCS-2 code points?

Converting Characters to UCS-2 Code Points in UTF-8 String

This question is about obtaining the UCS-2 code points for a given UTF-8 string. The task is to convert individual characters into their corresponding UCS-2 code points, regardless of the language or complexity of the characters.

UCS-2 Code Point Representation

Each UCS-2 code point is stored in 1-4 bytes, based on the code point value:

1 byte: 0xxxxxxx
2 bytes: 110xxxxx 10xxxxxx
3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Determining Byte Count

To determine the byte count for a character, examine the first byte:

Leading 0: 1-byte character
Leading 110: 2-byte character
Leading 1110: 3-byte character
Leading 11110: 4-byte character
Leading 10: Non-initial byte of a multibyte character
Leading 11111: Invalid character

Example C Code

Here is a sample C code to convert a UTF-8 character to a UCS-2 code point:

<code class="c">wchar_t utf8_char_to_ucs2(const unsigned char *utf8)
{
  if(!(utf8[0] & 0x80))      // 0xxxxxxx
    return (wchar_t)utf8[0];
  else if((utf8[0] & 0xE0) == 0xC0)  // 110xxxxx
    return (wchar_t)(((utf8[0] & 0x1F) << 6) | (utf8[1] & 0x3F));
  else if((utf8[0] & 0xF0) == 0xE0)  // 1110xxxx
    return (wchar_t)(((utf8[0] & 0x0F) << 12) | ((utf8[1] & 0x3F) << 6) | (utf8[2] & 0x3F));
  else
    return ERROR;  // uh-oh, UCS-2 can't handle code points this high
}</code>

Copy after login

Alternative Solutions

You can also use existing libraries like iconv or specific libraries for your programming language.

The above is the detailed content of How can I convert characters in a UTF-8 string to their corresponding UCS-2 code points?. For more information, please follow other related articles on the PHP Chinese website!