-
Notifications
You must be signed in to change notification settings - Fork 29
Description
widyr::pairwise_count counts pairwise co-occurrences of items within features. For example:
library(dplyr)
library(widyr)
tbl <- tibble(item = c('a', 'b', 'a', 'c', 'a', 'c', 'b', 'd'),
feature = rep(1:4, each = 2))
pairwise_count(tbl, item, feature)
#> # A tibble: 6 x 3
#> item1 item2 n
#> <chr> <chr> <dbl>
#> 1 b a 1
#> 2 c a 2
#> 3 a b 1
#> 4 d b 1
#> 5 a c 2
#> 6 b d 1In many applications we want to normalise these counts to measure items' pairwise similarities (see van Eck and Waltman, 2009). Some examples of such normalisation include the Dice, Ochiai and overlap coefficients, and the Jaccard index. These normalisations can be computed "the long way" as follows:
tbl %>%
pairwise_count(item, feature, diag = T) %>%
group_by(item1) %>%
mutate(n1 = sum(n * (item1 == item2))) %>%
group_by(item2) %>%
mutate(n2 = sum(n * (item1 == item2))) %>%
ungroup() %>%
mutate(dice = 2 * n / (n1 + n2),
ochiai = n / sqrt(n1 * n2),
overlap = n / pmin(n1, n2),
jaccard = n / (n1 + n2 - n)) %>%
mutate_if(is.double, round, 2) %>%
filter(item1 != item2)
#> # A tibble: 6 x 9
#> item1 item2 n n1 n2 dice ochiai overlap jaccard
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 b a 1 2 3 0.4 0.41 0.5 0.25
#> 2 c a 2 2 3 0.8 0.82 1 0.67
#> 3 a b 1 3 2 0.4 0.41 0.5 0.25
#> 4 d b 1 1 2 0.67 0.71 1 0.5
#> 5 a c 2 3 2 0.8 0.82 1 0.67
#> 6 b d 1 2 1 0.67 0.71 1 0.5A shorter/cleaner way could be to add a method argument to widyr::pairwise_similarity, which currently computes Cosine similarities (which are equivalent to Ochiai coefficients when value is a vector of ones) but could be extended to compute other similarity measures such as those listed above.
I think the only disruptive change would be making widyr::pairwise_similarity's value argument optional, or at least making it have default value equal to the vector of ones.