When dealing with a large amount of text, it is essential to have tools that can help computers recognize and evaluate the similarity between document

Cosine Similarity

submited by
Style Pass
2024-05-11 11:00:03

When dealing with a large amount of text, it is essential to have tools that can help computers recognize and evaluate the similarity between documents. One of the most effective methods in this field is cosine similarity.Cosine similarity is a technique that can measure the proximity between two documents by transforming words into vectors within a vector space. This transformative approach allows for the semantic interpretation of human language in a format that machines can easily understand.The idea behind this concept is that words can be represented as vectors, where each dimension corresponds to a unique feature of the text, such as the frequency of a word or its contextual relevance. Calculating the cosine similarity between two vectors becomes a way to measure how documents are similar in content and context, disregarding length and focusing the analysis on the structure of the vectors representing them.Although more complex methods exist for analyzing text similarity, such as neural networks or advanced clustering algorithms, cosine similarity offers an ideal balance between simplicity and effectiveness for analyzing moderate-sized documents. It is precious in applications such as recommendation systems, automatic text classification, and semantic search, where quickly understanding the relationship between different documents is crucial.Below, with a simple example, we will see how it is possible to determine the similarity between various documents, starting from the definition of cosine similarity.FormulaThe cosine similarity between two vectors is calculated using the following formula:\[ \text{cosine similarity} \ (V_x, V_y) = \frac{\sum_{i=1}^{n} V_{x_i} \cdot V_{y_i}}{\sqrt{\sum_{i=1}^{n} (V_{x_i})^2} \times \sqrt{\sum_{i=1}^{n} (V_{y_i})^2}} \]or in compact form:\[\text{cosine similarity} \ (V_x, V_y) = \frac{V_x \cdot V_y}{||V_x|| \ ||V_y||}\]where:\( V_x \cdot V_y \) is the dot product of the vectors \( A \) and \( B \).\( ||V_x|| \) and \( ||V_y|| \) are the norms (lengths) of the vectors \( V_x \) and \( V_y \).The value of cosine similarity ranges from 0 to 1.A value close to 1 means that the angle between the two vectors is minimal; therefore, the vectors are very similar. Conversely, a value close to 0 means that the angle between the two vectors approaches \( \ \frac {\pi}{2} \), and therefore, the two vectors have a low degree of similarity.

ExampleLet’s consider a scenario in which we aim to evaluate the similarity between various documents. For clarity, we will explore a straightforward example involving three brief sentences from which we seek to determine their respective degrees of similarity:\(x\) = I am fond of reading thriller novels.\(y\) = I prefer reading thriller novels.\(z\) = Yesterday, I arrived late.Upon a closer look, it’s clear that sentences \(x\) and \(y\) have similarities, while sentence \(z\) is unrelated.The initial step in our analysis involves transforming the sentences into vectors, extracting all words, and computing their frequencies within the sentences. Subsequently, we will refine the data by eliminating words that contribute little to no meaningful information, such as the conjunction of, the pronoun I and the verb to be. This process is crucial, particularly in large corpora, to ensure that the dataset is qualitatively significant and focuses on the most impactful elements for our analysis. Here is the result:

Leave a Comment