Unicode Encoding for String Literals in C 11
The introduction of new character and string literal types in C 11 has extended the language's capabilities in handling Unicode encodings. While there are now four character types (char, wchar_t, char16_t, char32_t) and five string literal types, the behavior and compatibility of these characters and strings with encoding mechanisms have specific rules.
Encoding Compatibility
The x character reference can be used with all string types, allowing the inclusion of character values represented in hexadecimal. However, u and U references are restricted to strings with UTF-encoded semantics. Character references are converted based on the encoding of the containing string.
String Length and Encoding
Although the number of Unicode code units contained in a string may vary depending on the encoding, the arrays representing string literals are fixed-width, with each element representing a single code unit. The number of code units used is determined by the Unicode encoding of the string.
UTF-Encoding Semantics
u"" string literals are specifically UTF-16 encoded, while u8"" string literals are encoded in UTF-8. UTF-16 encodings use char16_t code units, while UTF-8 encodings use variable-length byte sequences to represent code points.
Lone Surrogates
Lone surrogates (0xD800-0xDFFF) are not permitted as code points in u sequences. UTF-16 surrogate pairs must be used to represent Unicode characters in this range.
Encoding Awareness
Standard string manipulation functions do not inherently handle Unicode encoding semantics and treat UTF-encoded strings as a sequence of individual code units. However, input and output streams through locales allow for reading and writing Unicode-encoded values with proper contextualization.
The above is the detailed content of How Do C 11 String Literals Handle Different Unicode Encodings?. For more information, please follow other related articles on the PHP Chinese website!