Measuring Textual Similarity with TF-IDF and Cosine Similarity
Determining the similarity between two text documents is a crucial task in text mining and information retrieval. One popular approach involves utilizing TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity.
TF-IDF assigns a weight to each word in a document based on its frequency in that document and its rarity across the document corpus. Documents with similar word patterns will share higher TF-IDF vectors.
Cosine similarity measures the angle between two vectors, providing a value between 0 (no similarity) and 1 (perfect similarity). In our case, the TF-IDF vectors of the two documents form these vectors, and the cosine similarity quantifies their angle.
Python Implementation
In Python, using the scikit-learn and Gensim packages, computing pairwise similarities is straightforward:
<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer documents = [open(f).read() for f in text_files] tfidf = TfidfVectorizer().fit_transform(documents) pairwise_similarity = tfidf * tfidf.T</code>
Alternatively, if the documents are already strings, use:
<code class="python">corpus = ["I'd like an apple", "An apple a day keeps the doctor away", "..."] vect = TfidfVectorizer(min_df=1, stop_words="english") tfidf = vect.fit_transform(corpus) pairwise_similarity = tfidf * tfidf.T</code>
Interpreting Results
pairwise_similarity is a sparse matrix representing the similarity between each document pair. To find the document most similar to a specific document, mask out the document's similarity to itself (set it to NaN) and find the maximum value in its row using np.nanargmax():
<code class="python">import numpy as np arr = pairwise_similarity.toarray() np.fill_diagonal(arr, np.nan) input_doc = "The scikit-learn docs are Orange and Blue" input_idx = corpus.index(input_doc) result_idx = np.nanargmax(arr[input_idx]) similar_doc = corpus[result_idx]</code>
Other Considerations
For large corpora and vocabularies, using a sparse matrix is more efficient than converting to NumPy arrays.
By adjusting the parameters in TfidfVectorizer, such as min_df for minimum document frequency, the TF-IDF computation can be customized to suit specific requirements.
Additional Resources
The above is the detailed content of How to Measure Text Similarity using TF-IDF and Cosine Similarity?. For more information, please follow other related articles on the PHP Chinese website!