Cosine Distance Function
A cosine distance function is a binary vector distance function between two vectors ([math]\displaystyle{ \mathbf{x}, \mathbf{y} }[/math]) based on the angle between them.
- AKA: Cosine Similarity, Normalized Dot Product.
- Context:
- It can be calculated with the use of the Dot Product Function and the Vector Magnitude Function.
- CosDist(x, y) = (x∙y) / (|x| × |y|)
- It can range from:
- 0 (perfectly unrelated)
- 1 (perfectly related)
- It can be calculated with the use of the Dot Product Function and the Vector Magnitude Function.
- Example(s):
- CosDist(<1,0,0>, <0,1,0>) = (1x0 + 0x1 + 0x0) / (|<1,0,0>|×|<0,1,0>|) = 0
- CosDist(<1,1,1>, <1,1,1>) = (1x1 + 1x1 + 1x1) / (|<1,1,1>|×|<1,1,1>|) = 1
- TF-IDF Distance().
- …
- Counter-Example(s):
- an Euclidean Distance Function.
- any Multiset Distance Function, such as the TF-IDF Distance Function.
- See: Vector Space Model, Dot Product, Cosine Function, Inner Product Space.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/cosine_similarity Retrieved:2015-1-9.
- Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a Cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1].
Note that these bounds apply for any number of dimensions, and Cosine similarity is most commonly used in high-dimensional positive spaces. For example, in Information Retrieval and text mining, each term is notionally assigned a different dimension and a document is characterised by a vector where the value of each dimension corresponds to the number of times that term appears in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter. [1] The technique is also used to measure cohesion within clusters in the field of data mining. [2]
Cosine distance is a term often used for the complement in positive space, that is: [math]\displaystyle{ D_C(A,B) = 1 - S_C(A,B) }[/math]. It is important to note, however, that this is not a proper distance metric as it does not have the triangle inequality property and it violates the coincidence axiom; to repair the triangle inequality property whilst maintaining the same ordering, it is necessary to convert to Angular distance (see below.)
One of the reasons for the popularity of Cosine similarity is that it is very efficient to evaluate, especially for sparse vectors, as only the non-zero dimensions need to be considered.
- Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a Cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1].
- ↑ Singhal, Amit (2001). “Modern Information Retrieval: A Brief Overview". Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24 (4): 35–43.
- ↑ P.-N. Tan, M. Steinbach & V. Kumar, "Introduction to Data Mining", , Addison-Wesley (2005), ISBN 0-321-32136-7, chapter 8; page 500.
2006
- (Garcia, 2006) ⇒ E. Garcia. (2006). “Cosine Similarity and Term Weight Tutorial” http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html#Cosim
- QUOTE: Let's go back to the normalized DOT Product (cosine angle). This ratio is also used as a similarity measure between any two vectors representing documents, queries, snippets or combination of these. The expressions cosine similarity, Sim(A, B), or COSIM are commonly used. As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases.