NoisOCR: OCR後のノイズの多いテキストをシミュレートするためのPythonライブラリ-Python チュートリアル-php.cn

NoisOCR: OCR後のノイズの多いテキストをシミュレートするためのPythonライブラリ

Susan Sarandon

リリース： 2024-10-13 06:16:30

オリジナル

919 人が閲覧しました

NoisOCR: A Python Library for Simulating Post-OCR Noisy Texts

NoisOCR は、光学式文字認識 (OCR) 後に生成されたテキストのノイズをシミュレートするように設計された Python ライブラリです。これらのテキストには、低品質の文書や原稿で OCR を処理する際の課題を反映して、エラーや注釈が含まれている場合があります。このライブラリは、OCR 後のテキストにおける一般的なエラーのシミュレーションや、ハイフネーションの有無にかかわらず、テキストをスライディングウィンドウに分割することを容易にする機能を提供します。これは、スペル修正のためのニューラルネットワークモデルのトレーニングに貢献できます。

GitHub リポジトリ: NoisOCR

PyPI: PyPI 上の NoisOCR

特徴

スライディングウィンドウ: 単語を区切ることなく、長いテキストを小さなセグメントに分割します。
ハイフネーションを使用したスライディングウィンドウ: 単語を文字数制限内に収めるにはハイフネーションを使用します。
テキストエラーのシミュレート: ランダムエラーを追加して、OCR 後の低精度テキストをシミュレートします。
テキスト注釈のシミュレート: BRESSAY データセットにあるような注釈を挿入して、テキスト内の単語や語句をマークします。

インストール

pip 経由で NoisOCR を簡単にインストールできます:

pip install noisocr

ログイン後にコピー

使用例

1. スライディングウィンドウ

この関数は、単語をそのままにしながら、テキストを限られたサイズのセグメントに分割します。

import noisocr

text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50

windows = noisocr.sliding_window(text, max_window_size)

# Output:
# [
#   'Lorem Ipsum is simply dummy text of the printing', 
#   ...
#   'type and scrambled it to make a type specimen', 
#   'book.'
# ]

ログイン後にコピー

2. ハイフネーション付きのスライディングウィンドウ

ハイフネーションを使用する場合、関数は必要に応じてハイフンを挿入することで、ウィンドウごとの文字制限を超える単語を収めようとします。この機能は、PyHyphen パッケージを通じて複数の言語をサポートします。

import noisocr

text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50

windows = noisocr.sliding_window_with_hyphenation(text, max_window_size, 'en_US')

# Output:
# [
#   'Lorem Ipsum is simply dummy text of the printing ',        
#   'typesetting industry. Lorem Ipsum has been the in-', 
#   ...
#   'scrambled it to make a type specimen book.'
# ]

ログイン後にコピー

3. テキストエラーのシミュレーション

simulator_errors 関数を使用すると、ユーザーはテキストにランダムなエラーを追加して、OCR 後のテキストでよく見られる問題をエミュレートできます。タイポライブラリは、文字の入れ替え、スペースの欠落、余分な文字などのエラーを生成します。

import noisocr

text = "Hello world."
text_with_errors = noisocr.simulate_errors(text, interactions=1)
# Output: Hello, wotrld!
text_with_errors = noisocr.simulate_errors(text, 2)
# Output: Hsllo,wlorld!
text_with_errors = noisocr.simulate_errors(text, 5)
# Output: fllo,w0rlr!

ログイン後にコピー

4. テキスト注釈のシミュレーション

注釈シミュレーション機能を使用すると、BRESSAY データセットの注釈を含む一連の注釈に基づいて、テキストにカスタムマーキングを追加できます。

import noisocr

text = "Hello world."
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, $$--xxx--$$
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, ##--world!--##
text_with_annotation = noisocr.simulate_annotation(text, 0.01)
# Output: Hello world.

ログイン後にコピー

コードの概要

NoisOCR ライブラリのコア機能は、エラーをシミュレートする typo や、さまざまな言語間での単語のハイフネーションを管理する hyphen などのライブラリを活用することに基づいています。以下は重要な機能の説明です。

1.simulator_annotation関数

simulator_annotation 関数は、テキストからランダムな単語を選択し、定義された一連の注釈に従ってそれに注釈を付けます。

import random

annotations = [
    '##@@???@@##', '$$@@???@@$$', '@@???@@', '##--xxx--##', 
    '$$--xxx--$$', '--xxx--', '##--text--##', '$$--text--$$',
    '##text##', '$$text$$', '--text--'
]

def simulate_annotation(text, annotations=annotations, probability=0.01):
    words = text.split()

    if len(words) > 1:
        target_word = random.choice(words)
    else:
        return text

    if random.random() < probability:
        annotation = random.choice(annotations)
        if 'text' in annotation:
            annotated_text = annotation.replace('text', target_word)
        else:
            annotated_text = annotation

        result_text = text.replace(target_word, annotated_text, 1)
        return result_text
    else:
        return text

ログイン後にコピー

2.simulator_errors関数

simulator_errors 関数は、タイプミスライブラリからランダムに選択されたさまざまなエラーをテキストに適用します。

import random
import typo

def simulate_errors(text, interactions=3, seed=None):
    methods = ["char_swap", "missing_char", "extra_char", "nearby_char", "similar_char", "skipped_space", "random_space", "repeated_char", "unichar"]

    if seed is not None:
        random.seed(seed)
    else:
        random.seed()

    instance = typo.StrErrer(text)
    method = random.choice(methods)
    method_to_call = getattr(instance, method)
    text = method_to_call().result

    if interactions > 0:
        interactions -= 1
        text = simulate_errors(text, interactions, seed=seed)

    return text

ログイン後にコピー

3.sliding_window関数とsliding_window_with_hyphenation関数

これらの関数は、ハイフネーションの有無にかかわらず、テキストをスライディングウィンドウに分割します。

from hyphen import Hyphenator

def sliding_window_with_hyphenation(text, window_size=80, language='pt_BR'):
    hyphenator = Hyphenator(language)
    words = text.split()
    windows = []
    current_window = []
    remaining_word = ""

    for word in words:
        if remaining_word:
            word = remaining_word + word
            remaining_word = ""

        if len(" ".join(current_window)) + len(word) + 1 <= window_size:
            current_window.append(word)
        else:
            syllables = hyphenator.syllables(word)
            temp_word = ""
            for i, syllable in enumerate(syllables):
                if len(" ".join(current_window)) + len(temp_word) + len(syllable) + 1 <= window_size:
                    temp_word += syllable
                else:
                    if temp_word:
                        current_window.append(temp_word + "-")
                        remaining_word = "".join(syllables[i:]) + " "
                        break
                    else:
                        remaining_word = word + " "
                        break
            else:
                current_window.append(temp_word)
                remaining_word = ""

            windows.append(" ".join(current_window))
            current_window = []

    if remaining_word:
        current_window.append(remaining_word)
    if current_window:
        windows.append(" ".join(current_window))

    return windows

ログイン後にコピー

結論

NoisOCR は、OCR 後のテキスト修正に取り組む人に不可欠なツールを提供し、デジタル化されたテキストにエラーや注釈が入りやすい現実のシナリオを簡単にシミュレートできるようにします。自動テスト、テキスト修正モデルの開発、または BRESSAY のようなデータセットの分析のいずれの場合でも、このライブラリは多用途でユーザーフレンドリーなソリューションです。

GitHub: NoisOCR でプロジェクトをチェックし、その改善に貢献してください!

以上がNoisOCR: OCR後のノイズの多いテキストをシミュレートするためのPythonライブラリの詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。