Normalizing Unicode in Python: Simplifying Unicode Representations
In Python, the unicodedata module provides the .normalize() function to simplify Unicode string representations. This function transforms decomposed Unicode entities into their simplest composite forms.
Consider the following example:
import unicodedata char = "á" print(len(char)) # Output: 1 [print(unicodedata.name(c)) for c in char] # Output: ['LATIN SMALL LETTER A WITH ACUTE'] char = "á" print(len(char)) # Output: 2 [print(unicodedata.name(c)) for c in char] # Output: ['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']
The "á" character is composed of two code points: U 0061 (LATIN SMALL LETTER A) and U 0301 (COMBINING ACUTE ACCENT). Decomposed, these characters appear as "á."
To normalize this string, we can use .normalize('NFC'), which returns the composed form:
print(ascii(unicodedata.normalize('NFC', '\u0061\u0301'))) # Output: '\xe1'
Conversely, .normalize('NFD') returns the decomposed form:
print(ascii(unicodedata.normalize('NFD', '\u00E1'))) # Output: 'a\u0301'
Additional normalization forms exist to handle compatibility code points. NFKC and NFKD replace compatibility characters with their canonical forms. For example, U 2160 (ROMAN NUMERAL ONE) normalizes to "I" using NFKC:
print(unicodedata.normalize('NFKC', '\u2167')) # Output: 'VIII'
It's important to note that normalization is not always reversible, as some characters may not have unique decomposed forms.
The above is the detailed content of How Does Python\'s `unicodedata.normalize()` Function Simplify Unicode Representations?. For more information, please follow other related articles on the PHP Chinese website!