Problem: You wish to compute the similarity between two text documents to assess their semantic alignment.
Solution: The prevalent approach to measuring document similarity is to convert them into TF-IDF (Term Frequency-Inverse Document Frequency) vectors. TF-IDF assigns weights to terms based on their frequency within the document and their rarity across the corpus. Subsequently, the cosine similarity between these vectors is computed to quantify their similarity.
Implementation: Python's Gensim and scikit-learn provide robust implementations for TF-IDF transformations. Using scikit-learn:
<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer documents = [open(f).read() for f in text_files] tfidf = TfidfVectorizer().fit_transform(documents) # Cosine similarity is calculated automatically pairwise_similarity = tfidf * tfidf.T</code>
The resulting pairwise_similarity is a sparse matrix where each cell represents the cosine similarity between the corresponding document pairs.
Interpreting Results: The sparse matrix has dimensions equal to the number of documents in the corpus. To extract the document with the highest similarity to a given input document, utilize NumPy's np.fill_diagonal() to mask self-similarity and np.nanargmax() to find the non-self-similarity maximum:
<code class="python">result_idx = np.nanargmax(arr[input_idx]) most_similar_doc = corpus[result_idx]</code>
Note that the argmax is performed on the masked array to avoid the trivial maximum of 1 (each document's similarity to itself).
The above is the detailed content of How Can I Calculate the Similarity Between Different Text Documents?. For more information, please follow other related articles on the PHP Chinese website!