我們如何利用詞頻和動態規劃有效地將沒有空格的文字分離到單字清單中？-Python教學-PHP中文網

我們如何利用詞頻和動態規劃有效地將沒有空格的文字分離到單字清單中？

DDD

發布： 2024-11-04 10:13:30

原創

388 人瀏覽過

How can we efficiently separate text without spaces into a word list, leveraging word frequency and dynamic programming?

將不帶空格的文字分割成單字清單

概述

給定一個由不帶空格的單字組成的字串，本文提出了一個高效率的分割演算法

問題陳述

輸入：「tableapplechairtablecupboard...」

輸出：["table", "apple", " chair" , "table", ["cupboard", ["cup", "board"]], ...]

演算法概述

演算法不是使用簡單的方法，而是使用簡單的方法利用詞頻來提高準確性。假設單字獨立分佈並遵循齊普夫定律，演算法使用動態規劃來識別最可能的單字序列。

程式碼

<code class="python">from math import log

words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)        
        cost.append(c)

    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

def best_match(i):
    candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
    return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))</code>

登入後複製