Skip to content

Feature request: more pairwise similarity measures #31

@bldavies

Description

@bldavies

widyr::pairwise_count counts pairwise co-occurrences of items within features. For example:

library(dplyr)
library(widyr)

tbl <- tibble(item = c('a', 'b', 'a', 'c', 'a', 'c', 'b', 'd'),
              feature = rep(1:4, each = 2))

pairwise_count(tbl, item, feature)
#> # A tibble: 6 x 3
#>   item1 item2     n
#>   <chr> <chr> <dbl>
#> 1 b     a         1
#> 2 c     a         2
#> 3 a     b         1
#> 4 d     b         1
#> 5 a     c         2
#> 6 b     d         1

In many applications we want to normalise these counts to measure items' pairwise similarities (see van Eck and Waltman, 2009). Some examples of such normalisation include the Dice, Ochiai and overlap coefficients, and the Jaccard index. These normalisations can be computed "the long way" as follows:

tbl %>%
  pairwise_count(item, feature, diag = T) %>%
  group_by(item1) %>%
  mutate(n1 = sum(n * (item1 == item2))) %>%
  group_by(item2) %>%
  mutate(n2 = sum(n * (item1 == item2))) %>%
  ungroup() %>%
  mutate(dice = 2 * n / (n1 + n2),
         ochiai = n / sqrt(n1 * n2),
         overlap = n / pmin(n1, n2),
         jaccard = n / (n1 + n2 - n)) %>%
  mutate_if(is.double, round, 2) %>%
  filter(item1 != item2)
#> # A tibble: 6 x 9
#>   item1 item2     n    n1    n2  dice ochiai overlap jaccard
#>   <chr> <chr> <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl>   <dbl>
#> 1 b     a         1     2     3  0.4    0.41     0.5    0.25
#> 2 c     a         2     2     3  0.8    0.82     1      0.67
#> 3 a     b         1     3     2  0.4    0.41     0.5    0.25
#> 4 d     b         1     1     2  0.67   0.71     1      0.5 
#> 5 a     c         2     3     2  0.8    0.82     1      0.67
#> 6 b     d         1     2     1  0.67   0.71     1      0.5

A shorter/cleaner way could be to add a method argument to widyr::pairwise_similarity, which currently computes Cosine similarities (which are equivalent to Ochiai coefficients when value is a vector of ones) but could be extended to compute other similarity measures such as those listed above.

I think the only disruptive change would be making widyr::pairwise_similarity's value argument optional, or at least making it have default value equal to the vector of ones.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions