Home > Backend Development > Python Tutorial > How Does Python\'s `unicodedata.normalize()` Function Simplify Unicode Representations?

How Does Python\'s `unicodedata.normalize()` Function Simplify Unicode Representations?

DDD
Release: 2024-11-22 16:12:15
Original
242 people have browsed it

How Does Python's `unicodedata.normalize()` Function Simplify Unicode Representations?

Normalizing Unicode in Python: Simplifying Unicode Representations

In Python, the unicodedata module provides the .normalize() function to simplify Unicode string representations. This function transforms decomposed Unicode entities into their simplest composite forms.

Consider the following example:

import unicodedata

char = "á"
print(len(char))  # Output: 1

[print(unicodedata.name(c)) for c in char]  # Output: ['LATIN SMALL LETTER A WITH ACUTE']

char = "á"
print(len(char))  # Output: 2

[print(unicodedata.name(c)) for c in char]  # Output: ['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']
Copy after login

The "á" character is composed of two code points: U 0061 (LATIN SMALL LETTER A) and U 0301 (COMBINING ACUTE ACCENT). Decomposed, these characters appear as "á."

To normalize this string, we can use .normalize('NFC'), which returns the composed form:

print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))  # Output: '\xe1'
Copy after login

Conversely, .normalize('NFD') returns the decomposed form:

print(ascii(unicodedata.normalize('NFD', '\u00E1')))  # Output: 'a\u0301'
Copy after login

Additional normalization forms exist to handle compatibility code points. NFKC and NFKD replace compatibility characters with their canonical forms. For example, U 2160 (ROMAN NUMERAL ONE) normalizes to "I" using NFKC:

print(unicodedata.normalize('NFKC', '\u2167'))  # Output: 'VIII'
Copy after login

It's important to note that normalization is not always reversible, as some characters may not have unique decomposed forms.

The above is the detailed content of How Does Python\'s `unicodedata.normalize()` Function Simplify Unicode Representations?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template