The most common method for determining the similarity between two text documents is to convert them into TF-IDF (Term Frequency-Inverse Document Frequency) vectors and then use cosine similarity to compare them. This approach is covered in textbooks on information retrieval and detailed in "Introduction to Information Retrieval."
Python libraries like Gensim and scikit-learn provide implementations of TF-IDF conversions and cosine similarity calculations. With scikit-learn, the following code snippet performs cosine similarity computations:
<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer # Extract documents from text files documents = [open(f).read() for f in text_files] # Create a TF-IDF vectorizer tfidf = TfidfVectorizer().fit_transform(documents) # Calculate pairwise cosine similarity pairwise_similarity = tfidf * tfidf.T</code>
Alternatively, for plain text documents:
<code class="python">corpus = ["I'd like an apple", "An apple a day keeps the doctor away", "Never compare an apple to an orange", "I prefer scikit-learn to Orange", "The scikit-learn docs are Orange and Blue"] # Create a TF-IDF vectorizer with minimum frequency and exclusion of stop words vect = TfidfVectorizer(min_df=1, stop_words="english") # Apply TF-IDF transformation tfidf = vect.fit_transform(corpus) # Calculate pairwise cosine similarity pairwise_similarity = tfidf * tfidf.T </code>
pairwise_similarity is a sparse matrix where each row and column represent a document in the corpus. Converting the sparse matrix to a NumPy array reveals that each cell represents the similarity between the two corresponding documents.
For instance, to determine the document most similar to "The scikit-learn docs are Orange and Blue," locate its index in the corpus and then apply np.nanargmax to the corresponding row after masking out the diagonal (representing self-similarity) with np.fill_diagonal():
<code class="python">import numpy as np arr = pairwise_similarity.toarray() np.fill_diagonal(arr, np.nan) input_doc = "The scikit-learn docs are Orange and Blue" input_idx = corpus.index(input_doc) result_idx = np.nanargmax(arr[input_idx]) print(corpus[result_idx])</code>
Note that for large datasets, using a sparse matrix conserves memory. Alternatively, consider using pairwise_similarity.shape to mask self-similarity and argmax() directly:
<code class="python">n, _ = pairwise_similarity.shape pairwise_similarity[np.arange(n), np.arange(n)] = -1.0 pairwise_similarity[input_idx].argmax() </code>
The above is the detailed content of How to Calculate Similarity Between Text Documents Using TF-IDF and Cosine Similarity?. For more information, please follow other related articles on the PHP Chinese website!