How Can I Calculate the Similarity Between Different Text Documents?-Python Tutorial-php.cn

How Can I Calculate the Similarity Between Different Text Documents?

Patricia Arquette

Release： 2024-10-23 06:48:29

Original

467 people have browsed it

How Can I Calculate the Similarity Between Different Text Documents?

How to Determine the Similarity Between Text Documents

Problem: You wish to compute the similarity between two text documents to assess their semantic alignment.

Solution: The prevalent approach to measuring document similarity is to convert them into TF-IDF (Term Frequency-Inverse Document Frequency) vectors. TF-IDF assigns weights to terms based on their frequency within the document and their rarity across the corpus. Subsequently, the cosine similarity between these vectors is computed to quantify their similarity.

Implementation: Python's Gensim and scikit-learn provide robust implementations for TF-IDF transformations. Using scikit-learn:

<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)

# Cosine similarity is calculated automatically
pairwise_similarity = tfidf * tfidf.T</code>

Copy after login

The resulting pairwise_similarity is a sparse matrix where each cell represents the cosine similarity between the corresponding document pairs.

Interpreting Results: The sparse matrix has dimensions equal to the number of documents in the corpus. To extract the document with the highest similarity to a given input document, utilize NumPy's np.fill_diagonal() to mask self-similarity and np.nanargmax() to find the non-self-similarity maximum:

<code class="python">result_idx = np.nanargmax(arr[input_idx])
most_similar_doc = corpus[result_idx]</code>

Copy after login

Note that the argmax is performed on the masked array to avoid the trivial maximum of 1 (each document's similarity to itself).

The above is the detailed content of How Can I Calculate the Similarity Between Different Text Documents?. For more information, please follow other related articles on the PHP Chinese website!