Home > Backend Development > PHP Tutorial > How can I convert characters in a UTF-8 string to their corresponding UCS-2 code points?

How can I convert characters in a UTF-8 string to their corresponding UCS-2 code points?

Barbara Streisand
Release: 2024-10-30 02:15:02
Original
1004 people have browsed it

How can I convert characters in a UTF-8 string to their corresponding UCS-2 code points?

Converting Characters to UCS-2 Code Points in UTF-8 String

This question is about obtaining the UCS-2 code points for a given UTF-8 string. The task is to convert individual characters into their corresponding UCS-2 code points, regardless of the language or complexity of the characters.

UCS-2 Code Point Representation

Each UCS-2 code point is stored in 1-4 bytes, based on the code point value:

  • 1 byte: 0xxxxxxx
  • 2 bytes: 110xxxxx 10xxxxxx
  • 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
  • 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Determining Byte Count

To determine the byte count for a character, examine the first byte:

  • Leading 0: 1-byte character
  • Leading 110: 2-byte character
  • Leading 1110: 3-byte character
  • Leading 11110: 4-byte character
  • Leading 10: Non-initial byte of a multibyte character
  • Leading 11111: Invalid character

Example C Code

Here is a sample C code to convert a UTF-8 character to a UCS-2 code point:

<code class="c">wchar_t utf8_char_to_ucs2(const unsigned char *utf8)
{
  if(!(utf8[0] & 0x80))      // 0xxxxxxx
    return (wchar_t)utf8[0];
  else if((utf8[0] & 0xE0) == 0xC0)  // 110xxxxx
    return (wchar_t)(((utf8[0] & 0x1F) << 6) | (utf8[1] & 0x3F));
  else if((utf8[0] & 0xF0) == 0xE0)  // 1110xxxx
    return (wchar_t)(((utf8[0] & 0x0F) << 12) | ((utf8[1] & 0x3F) << 6) | (utf8[2] & 0x3F));
  else
    return ERROR;  // uh-oh, UCS-2 can't handle code points this high
}</code>
Copy after login

Alternative Solutions

You can also use existing libraries like iconv or specific libraries for your programming language.

The above is the detailed content of How can I convert characters in a UTF-8 string to their corresponding UCS-2 code points?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template