How to compute similarity between vectors?

I noticed that the library computes the similarity between vectors in a strange manner. The scores are not really cosine similarity but cross-products. @jwijffels , do you think it is intentional?

https://github.com/bnosac/word2vec/blob/96b0e04c743920c9366a302d0785f505ec5ca08c/src/word2vec/include/word2vec.hpp#L201-L207

``` r
library(udpipe)
#> Warning: package 'udpipe' was built under R version 4.4.2
library(word2vec)

data(brussels_reviews, package = "udpipe")
x <- subset(brussels_reviews, language == "nl")
x <- tolower(x$feedback)

model <- word2vec::word2vec(x = x, dim = 15, iter = 20)
emb <- as.matrix(model)

pred <- predict(model, emb["bus",], type = "nearest", top_n = 10)
pred
#>        term similarity rank
#> 1      tram  0.9866303    1
#> 2      voet  0.9863607    2
#> 3        10  0.9838170    3
#> 4        15  0.9835117    4
#> 5       min  0.9818841    5
#> 6     lopen  0.9809218    6
#> 7     vanaf  0.9808209    7
#> 8  parkeren  0.9791722    8
#> 9        20  0.9773416    9
#> 10     auto  0.9745490   10

# similarity in the library
cross <- rowSums(sqrt(crossprod(t(emb), emb["bus",]) / ncol(emb)))
#> Warning in sqrt(crossprod(t(emb), emb["bus", ])/ncol(emb)): NaNs produced
head(sort(cross, decreasing = TRUE))
#>       bus      tram      voet        10        15       min 
#> 1.0000001 0.9866303 0.9863608 0.9838170 0.9835117 0.9818842

# cosine similarity 
cosine <- Matrix::rowSums(proxyC::simil(emb, emb["bus",,drop = FALSE]))
head(sort(cosine, decreasing = TRUE))
#>       bus      tram      voet        10        15       min 
#> 1.0000000 0.9734393 0.9729074 0.9678959 0.9672952 0.9640965

# they are very similar but not the same
cor(cross, cosine, use = "pair")
#> [1] 0.9825781
cor(cross, cosine, use = "pair", method = "spearman")    
#> [1] 1
```
See how cosine similarity is computed: https://koheiw.github.io/proxyC/articles/measures.html#similarity-measures


	float ret = 0.0f;
	for (uint16_t i = 0; i < m_vectorSize; ++i) {
	ret += _what[i] * _with[i];
	}
	if (ret > 0.0f) {
	return std::sqrt(ret / m_vectorSize);
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to compute similarity between vectors? #25

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to compute similarity between vectors? #25

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions