Calculating Cosine Similarity of Sentence Strings without External Libraries
To calculate the cosine similarity between two text strings without external modules, a simple Python implementation can be employed. The fundamental cosine similarity formula is utilized in this process:
cos(θ) = (A · B) / (||A|| · ||B||)
Where:
Implementation
The following Python code provides a practical implementation of this formula:
<code class="python">import math import re from collections import Counter WORD = re.compile(r"\w+") def get_cosine(vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())]) sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())]) denominator = math.sqrt(sum1) * math.sqrt(sum2) if not denominator: return 0.0 else: return float(numerator) / denominator def text_to_vector(text): words = WORD.findall(text) return Counter(words)</code>
To use this code, convert the sentence strings into vectors using the text_to_vector function and then calculate the cosine similarity using the get_cosine function:
<code class="python">text1 = "This is a foo bar sentence ." text2 = "This sentence is similar to a foo bar sentence ." vector1 = text_to_vector(text1) vector2 = text_to_vector(text2) cosine = get_cosine(vector1, vector2) print("Cosine:", cosine)</code>
This will output the cosine similarity between the two sentence strings. Note that tf-idf weighting is not included in this implementation, but can be added if a suitable corpus is available.
The above is the detailed content of How to Calculate Cosine Similarity Between Sentence Strings in Python Without External Libraries?. For more information, please follow other related articles on the PHP Chinese website!