Gensim 和 scikit-learn 等 Python 函式庫提供了 TF-IDF 轉換和餘弦相似度計算的實作。使用scikit-learn,以下程式碼片段執行餘弦相似度計算:
<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer # Extract documents from text files documents = [open(f).read() for f in text_files] # Create a TF-IDF vectorizer tfidf = TfidfVectorizer().fit_transform(documents) # Calculate pairwise cosine similarity pairwise_similarity = tfidf * tfidf.T</code>
<code class="python">corpus = ["I'd like an apple", "An apple a day keeps the doctor away", "Never compare an apple to an orange", "I prefer scikit-learn to Orange", "The scikit-learn docs are Orange and Blue"] # Create a TF-IDF vectorizer with minimum frequency and exclusion of stop words vect = TfidfVectorizer(min_df=1, stop_words="english") # Apply TF-IDF transformation tfidf = vect.fit_transform(corpus) # Calculate pairwise cosine similarity pairwise_similarity = tfidf * tfidf.T </code>
pairwise_similarity 為稀疏矩陣,其中每行和每列代表語料庫中的一個文件。將稀疏矩陣轉換為 NumPy 陣列表示每個單元格代表兩個對應文件之間的相似性。
例如,要確定與「The scikit-learn docs are Orange and Blue」最相似的文檔,請定位其在語料庫中的索引,然後使用np.fill_diagonal() 對角線(表示自屏蔽對角線(表示自屏蔽相似性)後將np.nanargmax 應用於對應的行:
<code class="python">import numpy as np arr = pairwise_similarity.toarray() np.fill_diagonal(arr, np.nan) input_doc = "The scikit-learn docs are Orange and Blue" input_idx = corpus.index(input_doc) result_idx = np.nanargmax(arr[input_idx]) print(corpus[result_idx])</code>
<code class="python">n, _ = pairwise_similarity.shape pairwise_similarity[np.arange(n), np.arange(n)] = -1.0 pairwise_similarity[input_idx].argmax() </code>
以上是如何使用 TF-IDF 和餘弦相似度計算文字文件之間的相似度?的詳細內容。更多資訊請關注PHP中文網其他相關文章!