Skip to content

How to compute similarity between vectors? #25

@koheiw

Description

@koheiw

I noticed that the library computes the similarity between vectors in a strange manner. The scores are not really cosine similarity but cross-products. @jwijffels , do you think it is intentional?

float ret = 0.0f;
for (uint16_t i = 0; i < m_vectorSize; ++i) {
ret += _what[i] * _with[i];
}
if (ret > 0.0f) {
return std::sqrt(ret / m_vectorSize);
}

library(udpipe)
#> Warning: package 'udpipe' was built under R version 4.4.2
library(word2vec)

data(brussels_reviews, package = "udpipe")
x <- subset(brussels_reviews, language == "nl")
x <- tolower(x$feedback)

model <- word2vec::word2vec(x = x, dim = 15, iter = 20)
emb <- as.matrix(model)

pred <- predict(model, emb["bus",], type = "nearest", top_n = 10)
pred
#>        term similarity rank
#> 1      tram  0.9866303    1
#> 2      voet  0.9863607    2
#> 3        10  0.9838170    3
#> 4        15  0.9835117    4
#> 5       min  0.9818841    5
#> 6     lopen  0.9809218    6
#> 7     vanaf  0.9808209    7
#> 8  parkeren  0.9791722    8
#> 9        20  0.9773416    9
#> 10     auto  0.9745490   10

# similarity in the library
cross <- rowSums(sqrt(crossprod(t(emb), emb["bus",]) / ncol(emb)))
#> Warning in sqrt(crossprod(t(emb), emb["bus", ])/ncol(emb)): NaNs produced
head(sort(cross, decreasing = TRUE))
#>       bus      tram      voet        10        15       min 
#> 1.0000001 0.9866303 0.9863608 0.9838170 0.9835117 0.9818842

# cosine similarity 
cosine <- Matrix::rowSums(proxyC::simil(emb, emb["bus",,drop = FALSE]))
head(sort(cosine, decreasing = TRUE))
#>       bus      tram      voet        10        15       min 
#> 1.0000000 0.9734393 0.9729074 0.9678959 0.9672952 0.9640965

# they are very similar but not the same
cor(cross, cosine, use = "pair")
#> [1] 0.9825781
cor(cross, cosine, use = "pair", method = "spearman")    
#> [1] 1

See how cosine similarity is computed: https://koheiw.github.io/proxyC/articles/measures.html#similarity-measures

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions