Home > Backend Development > Python Tutorial > How Can You Determine the Similarity Between Text Documents in Python?

How Can You Determine the Similarity Between Text Documents in Python?

Patricia Arquette
Release: 2024-10-23 06:52:02
Original
196 people have browsed it

How Can You Determine the Similarity Between Text Documents in Python?

Determining Text Similarity

In natural language processing (NLP), determining the similarity between two text documents is crucial. The most common approach is to convert the documents into TF-IDF vectors and calculate the cosine similarity.

Implementing TF-IDF and Cosine Similarity

In Python, the Gensim and scikit-learn packages provide implementations of TF-IDF and cosine similarity. The following code, using scikit-learn, transforms documents into TF-IDF vectors and computes their pairwise similarity:

<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer

# Load documents
documents = [open(f).read() for f in text_files]

# Create TF-IDF vectorizer
tfidf = TfidfVectorizer().fit_transform(documents)

# Compute pairwise similarity
pairwise_similarity = tfidf * tfidf.T</code>
Copy after login

Interpreting the Results

Pairwise_similarity is a sparse matrix representing the similarity scores between documents. Each document's similarity to itself is 1, so these values are masked out. The code below finds the most similar document to a given input document:

<code class="python">import numpy as np

# Input document index
input_idx = corpus.index(input_doc)

# Mask out diagonal and find the most similar document
np.fill_diagonal(pairwise_similarity.toarray(), np.nan)
result_idx = np.nanargmax(pairwise_similarity[input_idx])

# Get the most similar document
similar_doc = corpus[result_idx]</code>
Copy after login

Other Methods

Gensim offers additional options for text similarity tasks. Another resource to explore is [this Stack Overflow question](https://stackoverflow.com/questions/52757816/how-to-find-text-similarity-between-two-documents).

The above is the detailed content of How Can You Determine the Similarity Between Text Documents in Python?. For more information, please follow other related articles on the PHP Chinese website!

source:php
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template