How Does Python\'s `unicodedata.normalize()` Function Simplify Unicode Representations?-Python Tutorial-php.cn

How Does Python\'s `unicodedata.normalize()` Function Simplify Unicode Representations?

DDD

Release： 2024-11-22 16:12:15

Original

320 people have browsed it

How Does Python's `unicodedata.normalize()` Function Simplify Unicode Representations?

Normalizing Unicode in Python: Simplifying Unicode Representations

In Python, the unicodedata module provides the .normalize() function to simplify Unicode string representations. This function transforms decomposed Unicode entities into their simplest composite forms.

Consider the following example:

import unicodedata

char = "á"
print(len(char))  # Output: 1

[print(unicodedata.name(c)) for c in char]  # Output: ['LATIN SMALL LETTER A WITH ACUTE']

char = "á"
print(len(char))  # Output: 2

[print(unicodedata.name(c)) for c in char]  # Output: ['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']

Copy after login

The "á" character is composed of two code points: U 0061 (LATIN SMALL LETTER A) and U 0301 (COMBINING ACUTE ACCENT). Decomposed, these characters appear as "á."

To normalize this string, we can use .normalize('NFC'), which returns the composed form:

print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))  # Output: '\xe1'

Copy after login

Conversely, .normalize('NFD') returns the decomposed form:

print(ascii(unicodedata.normalize('NFD', '\u00E1')))  # Output: 'a\u0301'

Copy after login

Additional normalization forms exist to handle compatibility code points. NFKC and NFKD replace compatibility characters with their canonical forms. For example, U 2160 (ROMAN NUMERAL ONE) normalizes to "I" using NFKC:

print(unicodedata.normalize('NFKC', '\u2167'))  # Output: 'VIII'

Copy after login

It's important to note that normalization is not always reversible, as some characters may not have unique decomposed forms.

The above is the detailed content of How Does Python\'s `unicodedata.normalize()` Function Simplify Unicode Representations?. For more information, please follow other related articles on the PHP Chinese website!