NoisOCR：用於模擬 OCR 後雜訊文字的 Python 函式庫-Python教學-PHP中文網

NoisOCR: A Python Library for Simulating Post-OCR Noisy Texts

NoisOCR 是一個 Python 函式庫，旨在模擬光學字元辨識 (OCR) 後產生的文字中的雜訊。這些文字可能包含錯誤或註釋，反映了在低品質文件或手稿中處理 OCR 的挑戰。此庫提供的功能有助於模擬 OCR 後文字中的常見錯誤，並將文字劃分為滑動視窗（帶或不帶連字號）。這有助於訓練用於拼字糾正的神經網路模型。

GitHub 儲存庫：NoisOCR

PyPI：PyPI 上的 NoisOCR

特徵

滑動視窗：將長文本分割成更小的片段，而不會破壞單字。
帶連字符的滑動視窗： 使用連字符將單字調整到字元限制內。
模擬文字錯誤：加入隨機錯誤來模擬 OCR 後的低準確度文字。
模擬文字註解：插入類似 BRESSAY 資料集中的註解來標記文字中的單字或片語。

安裝

您可以透過pip輕鬆安裝NoisOCR：

pip install noisocr

登入後複製

使用範例

1. 滑動視窗

此函數將文字分成有限大小的片段，保持單字完整。

import noisocr

text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50

windows = noisocr.sliding_window(text, max_window_size)

# Output:
# [
#   'Lorem Ipsum is simply dummy text of the printing', 
#   ...
#   'type and scrambled it to make a type specimen', 
#   'book.'
# ]

登入後複製

2. 連字號的滑動窗口

使用連字符時，函數會嘗試透過根據需要插入連字符來適應超出每個視窗字元限制的單字。此功能透過 PyHyphen 套件支援多種語言。

import noisocr

text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50

windows = noisocr.sliding_window_with_hyphenation(text, max_window_size, 'en_US')

# Output:
# [
#   'Lorem Ipsum is simply dummy text of the printing ',        
#   'typesetting industry. Lorem Ipsum has been the in-', 
#   ...
#   'scrambled it to make a type specimen book.'
# ]

登入後複製

3. 模擬文字錯誤

simulate_errors 函數允許使用者在文字上新增隨機錯誤，模擬 OCR 後文字常見的問題。拼字錯誤庫會產生錯誤，例如字元交換、缺少空格、多餘字元等等。

import noisocr

text = "Hello world."
text_with_errors = noisocr.simulate_errors(text, interactions=1)
# Output: Hello, wotrld!
text_with_errors = noisocr.simulate_errors(text, 2)
# Output: Hsllo,wlorld!
text_with_errors = noisocr.simulate_errors(text, 5)
# Output: fllo,w0rlr!

登入後複製

4. 模擬文字註釋

註釋模擬功能允許使用者根據一組註釋（包括來自 BRESSAY 資料集的註釋）向文字添加自訂標記。

import noisocr

text = "Hello world."
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, $$--xxx--$$
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, ##--world!--##
text_with_annotation = noisocr.simulate_annotation(text, 0.01)
# Output: Hello world.

登入後複製

程式碼概述

NoisOCR 庫的核心功能是基於利用諸如用於模擬錯誤的拼寫錯誤和用於管理跨不同語言的單字連字符的連字符等庫。以下是關鍵功能的解釋。

1.simulate_annotation函數

simulate_annotation 函數從文字中選取一個隨機單字並依照一組定義的註解進行註解。

import random

annotations = [
    '##@@???@@##', '$$@@???@@$$', '@@???@@', '##--xxx--##', 
    '$$--xxx--$$', '--xxx--', '##--text--##', '$$--text--$$',
    '##text##', '$$text$$', '--text--'
]

def simulate_annotation(text, annotations=annotations, probability=0.01):
    words = text.split()

    if len(words) > 1:
        target_word = random.choice(words)
    else:
        return text

    if random.random() < probability:
        annotation = random.choice(annotations)
        if 'text' in annotation:
            annotated_text = annotation.replace('text', target_word)
        else:
            annotated_text = annotation

        result_text = text.replace(target_word, annotated_text, 1)
        return result_text
    else:
        return text

登入後複製

2.simulate_errors函數

simulate_errors 函數將各種錯誤應用於文本，這些錯誤是從拼字錯誤庫中隨機選擇的。

import random
import typo

def simulate_errors(text, interactions=3, seed=None):
    methods = ["char_swap", "missing_char", "extra_char", "nearby_char", "similar_char", "skipped_space", "random_space", "repeated_char", "unichar"]

    if seed is not None:
        random.seed(seed)
    else:
        random.seed()

    instance = typo.StrErrer(text)
    method = random.choice(methods)
    method_to_call = getattr(instance, method)
    text = method_to_call().result

    if interactions > 0:
        interactions -= 1
        text = simulate_errors(text, interactions, seed=seed)

    return text

登入後複製

3.sliding_window和sliding_window_with_hyphenation函數

這些函數負責將文字分割成滑動窗口，帶或不帶連字符。

from hyphen import Hyphenator

def sliding_window_with_hyphenation(text, window_size=80, language='pt_BR'):
    hyphenator = Hyphenator(language)
    words = text.split()
    windows = []
    current_window = []
    remaining_word = ""

    for word in words:
        if remaining_word:
            word = remaining_word + word
            remaining_word = ""

        if len(" ".join(current_window)) + len(word) + 1 <= window_size:
            current_window.append(word)
        else:
            syllables = hyphenator.syllables(word)
            temp_word = ""
            for i, syllable in enumerate(syllables):
                if len(" ".join(current_window)) + len(temp_word) + len(syllable) + 1 <= window_size:
                    temp_word += syllable
                else:
                    if temp_word:
                        current_window.append(temp_word + "-")
                        remaining_word = "".join(syllables[i:]) + " "
                        break
                    else:
                        remaining_word = word + " "
                        break
            else:
                current_window.append(temp_word)
                remaining_word = ""

            windows.append(" ".join(current_window))
            current_window = []

    if remaining_word:
        current_window.append(remaining_word)
    if current_window:
        windows.append(" ".join(current_window))

    return windows

登入後複製