如何從Python字串中刪除不可列印的字元？-Python教學-PHP中文網

如何從Python字串中刪除不可列印的字元？

Patricia Arquette

發布： 2024-10-22 06:57:02

原創

397 人瀏覽過

How to Remove Non-Printable Characters from Strings in Python?

從Python 中的字串中剝離不可列印的字元

與Perl 相比，Python 缺少POSIX 正規表示式類，因此很難檢測並使用正規表示式刪除不可列印的字元。

那麼，如何在 Python 中實現此目的？

一種方法是利用 unicodedata 模組。 unicodedata.category 函數將 Unicode 字元分為各種類別。例如，分類為 Cc（控制）的字元代表不可列印的字元。

利用這些知識，您可以建構一個符合所有控製字元的自訂字元類別：

<code class="python">import unicodedata
import re
import sys

all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)</code>

登入後複製

此函數有效地從輸入字串中移除所有不可列印的 ASCII 字元。

或者，您可以使用 Python 內建的 string.printable 方法來過濾掉不可列印的字元。但是，此方法不包括 Unicode 字符，因此它可能不適合所有用例。

要處理Unicode 字符，您可以在正則表達式中擴展字符類，如下所示：

<code class="python">control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))</code>

登入後複製

此擴展字符類包含基本控製字符以及常見的不可打印Unicode 字符。

透過相應地修改remove_control_chars 函數，您可以成功處理 ASCII 和 Unicode 不可列印字元。

以上是如何從Python字串中刪除不可列印的字元？的詳細內容。更多資訊請關注PHP中文網其他相關文章！