NoisOCR 是一个 Python 库,旨在模拟光学字符识别 (OCR) 后生成的文本中的噪声。这些文本可能包含错误或注释,反映了在低质量文档或手稿中处理 OCR 的挑战。该库提供的功能有助于模拟 OCR 后文本中的常见错误,并将文本划分为滑动窗口(带或不带连字符)。这有助于训练用于拼写纠正的神经网络模型。
GitHub 存储库:NoisOCR
PyPI:PyPI 上的 NoisOCR
您可以通过pip轻松安装NoisOCR:
pip install noisocr
此函数将文本分成有限大小的片段,保持单词完整。
import noisocr text = "Lorem Ipsum is simply dummy...type specimen book." max_window_size = 50 windows = noisocr.sliding_window(text, max_window_size) # Output: # [ # 'Lorem Ipsum is simply dummy text of the printing', # ... # 'type and scrambled it to make a type specimen', # 'book.' # ]
使用连字符时,该函数会尝试通过根据需要插入连字符来适应超出每个窗口字符限制的单词。此功能通过 PyHyphen 包支持多种语言。
import noisocr text = "Lorem Ipsum is simply dummy...type specimen book." max_window_size = 50 windows = noisocr.sliding_window_with_hyphenation(text, max_window_size, 'en_US') # Output: # [ # 'Lorem Ipsum is simply dummy text of the printing ', # 'typesetting industry. Lorem Ipsum has been the in-', # ... # 'scrambled it to make a type specimen book.' # ]
simulate_errors 函数允许用户向文本添加随机错误,模拟 OCR 后文本中常见的问题。拼写错误库会产生错误,例如字符交换、缺少空格、多余字符等等。
import noisocr text = "Hello world." text_with_errors = noisocr.simulate_errors(text, interactions=1) # Output: Hello, wotrld! text_with_errors = noisocr.simulate_errors(text, 2) # Output: Hsllo,wlorld! text_with_errors = noisocr.simulate_errors(text, 5) # Output: fllo,w0rlr!
注释模拟功能允许用户根据一组注释(包括来自 BRESSAY 数据集的注释)向文本添加自定义标记。
import noisocr text = "Hello world." text_with_annotation = noisocr.simulate_annotation(text, probability=0.5) # Output: Hello, $$--xxx--$$ text_with_annotation = noisocr.simulate_annotation(text, probability=0.5) # Output: Hello, ##--world!--## text_with_annotation = noisocr.simulate_annotation(text, 0.01) # Output: Hello world.
NoisOCR 库的核心功能基于利用诸如用于模拟错误的拼写错误和用于管理跨不同语言的单词连字符的连字符等库。以下是关键功能的解释。
simulate_annotation 函数从文本中选择一个随机单词并按照一组定义的注释对其进行注释。
import random annotations = [ '##@@???@@##', '$$@@???@@$$', '@@???@@', '##--xxx--##', '$$--xxx--$$', '--xxx--', '##--text--##', '$$--text--$$', '##text##', '$$text$$', '--text--' ] def simulate_annotation(text, annotations=annotations, probability=0.01): words = text.split() if len(words) > 1: target_word = random.choice(words) else: return text if random.random() < probability: annotation = random.choice(annotations) if 'text' in annotation: annotated_text = annotation.replace('text', target_word) else: annotated_text = annotation result_text = text.replace(target_word, annotated_text, 1) return result_text else: return text
simulate_errors 函数将各种错误应用于文本,这些错误是从拼写错误库中随机选择的。
import random import typo def simulate_errors(text, interactions=3, seed=None): methods = ["char_swap", "missing_char", "extra_char", "nearby_char", "similar_char", "skipped_space", "random_space", "repeated_char", "unichar"] if seed is not None: random.seed(seed) else: random.seed() instance = typo.StrErrer(text) method = random.choice(methods) method_to_call = getattr(instance, method) text = method_to_call().result if interactions > 0: interactions -= 1 text = simulate_errors(text, interactions, seed=seed) return text
这些函数负责将文本分割成滑动窗口,带或不带连字符。
from hyphen import Hyphenator def sliding_window_with_hyphenation(text, window_size=80, language='pt_BR'): hyphenator = Hyphenator(language) words = text.split() windows = [] current_window = [] remaining_word = "" for word in words: if remaining_word: word = remaining_word + word remaining_word = "" if len(" ".join(current_window)) + len(word) + 1 <= window_size: current_window.append(word) else: syllables = hyphenator.syllables(word) temp_word = "" for i, syllable in enumerate(syllables): if len(" ".join(current_window)) + len(temp_word) + len(syllable) + 1 <= window_size: temp_word += syllable else: if temp_word: current_window.append(temp_word + "-") remaining_word = "".join(syllables[i:]) + " " break else: remaining_word = word + " " break else: current_window.append(temp_word) remaining_word = "" windows.append(" ".join(current_window)) current_window = [] if remaining_word: current_window.append(remaining_word) if current_window: windows.append(" ".join(current_window)) return windows
NoisOCR 为从事 OCR 后文本校正工作的人员提供了必要的工具,可以更轻松地模拟数字化文本容易出现错误和注释的现实场景。无论是自动化测试、文本校正模型开发,还是像BRESSAY这样的数据集分析,这个库都是一个多功能且用户友好的解决方案。
查看 GitHub 上的项目:NoisOCR 并为其改进做出贡献!
以上是NoisOCR:用于模拟 OCR 后噪声文本的 Python 库的详细内容。更多信息请关注PHP中文网其他相关文章!