-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
I noticed that the library computes the similarity between vectors in a strange manner. The scores are not really cosine similarity but cross-products. @jwijffels , do you think it is intentional?
word2vec/src/word2vec/include/word2vec.hpp
Lines 201 to 207 in 96b0e04
| float ret = 0.0f; | |
| for (uint16_t i = 0; i < m_vectorSize; ++i) { | |
| ret += _what[i] * _with[i]; | |
| } | |
| if (ret > 0.0f) { | |
| return std::sqrt(ret / m_vectorSize); | |
| } |
library(udpipe)
#> Warning: package 'udpipe' was built under R version 4.4.2
library(word2vec)
data(brussels_reviews, package = "udpipe")
x <- subset(brussels_reviews, language == "nl")
x <- tolower(x$feedback)
model <- word2vec::word2vec(x = x, dim = 15, iter = 20)
emb <- as.matrix(model)
pred <- predict(model, emb["bus",], type = "nearest", top_n = 10)
pred
#> term similarity rank
#> 1 tram 0.9866303 1
#> 2 voet 0.9863607 2
#> 3 10 0.9838170 3
#> 4 15 0.9835117 4
#> 5 min 0.9818841 5
#> 6 lopen 0.9809218 6
#> 7 vanaf 0.9808209 7
#> 8 parkeren 0.9791722 8
#> 9 20 0.9773416 9
#> 10 auto 0.9745490 10
# similarity in the library
cross <- rowSums(sqrt(crossprod(t(emb), emb["bus",]) / ncol(emb)))
#> Warning in sqrt(crossprod(t(emb), emb["bus", ])/ncol(emb)): NaNs produced
head(sort(cross, decreasing = TRUE))
#> bus tram voet 10 15 min
#> 1.0000001 0.9866303 0.9863608 0.9838170 0.9835117 0.9818842
# cosine similarity
cosine <- Matrix::rowSums(proxyC::simil(emb, emb["bus",,drop = FALSE]))
head(sort(cosine, decreasing = TRUE))
#> bus tram voet 10 15 min
#> 1.0000000 0.9734393 0.9729074 0.9678959 0.9672952 0.9640965
# they are very similar but not the same
cor(cross, cosine, use = "pair")
#> [1] 0.9825781
cor(cross, cosine, use = "pair", method = "spearman")
#> [1] 1See how cosine similarity is computed: https://koheiw.github.io/proxyC/articles/measures.html#similarity-measures
Metadata
Metadata
Assignees
Labels
No labels