Added multidimensional scaling to project distance separated items to a k dimensional space by kanishkamisra · Pull Request #20 · juliasilge/widyr

kanishkamisra · 2018-06-07T02:21:53Z

Since widyr has two distance based functions, namely pairwise_delta and pairwise_dist, I was wondering if we can take it a step further and add functionality for the data analyst/scientist to scale these pairwise distances to k dimensions. This is done in base R using cmdscale that takes a distance matrix and returns a matrix with each item as a row and each of the k dimensions as columns. The points are calculated such that the distance is maintained.

Example:

euro_mat <- as.matrix(eurodist)[1:5, 1:5]

euro_mat
#>           Athens Barcelona Brussels Calais Cherbourg
#> Athens         0      3313     2963   3175      3339
#> Barcelona   3313         0     1318   1326      1294
#> Brussels    2963      1318        0    204       583
#> Calais      3175      1326      204      0       460
#> Cherbourg   3339      1294      583    460         0

cmdscale(euro_mat)
#>                [,1]       [,2]
#> Athens    2515.1944   14.68566
#> Barcelona -659.3761  961.30504
#> Brussels  -422.4595 -328.35478
#> Calais    -633.6209 -359.57342
#> Cherbourg -799.7379 -288.06249

In the tidy, widy way, I figured the following way to do this when we have a distance based result from a widyr method:

library(tidytext)
library(janeaustenr)
library(widyr)
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.4     
#> ✔ tibble  1.4.2          ✔ dplyr   0.7.4     
#> ✔ tidyr   0.8.0          ✔ stringr 1.3.0     
#> ✔ readr   1.1.1          ✔ forcats 0.3.0
#> ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

austen_dist <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word) %>%
  pairwise_dist(book, word, n)

## From utils.R in widyr package (private method), didnt want to use :::
col_name <- function(x, default = stop("Please supply column name", call. = FALSE))
{
  if (is.character(x))
    return(x)
  if (identical(x, quote(expr = )))
    return(default)
  if (is.name(x))
    return(as.character(x))
  if (is.null(x))
    return(x)
  stop("Invalid column specification", call. = FALSE)
}

multi_scale <- function(tbl, item1, item2, value, k = 2, ...) {
  multi_scale_(tbl,
               col_name(substitute(item1)),
               col_name(substitute(item2)),
               col_name(substitute(value)),
               k = 2, ...)
}


multi_scale_ <- function(tbl, item1, item2, value, k = 2) {
  tbl_matrix <- tbl %>%
    spread(item2, col_name(value), fill = 0) %>%
    as.data.frame() %>%
    remove_rownames() %>%
    column_to_rownames("item1") %>%
    as.matrix()

  cmdscale(tbl_matrix, k = k) %>%
    as.data.frame() %>%
    rownames_to_column("item") %>%
    as.tibble()
}

austen_dist %>%
  multi_scale(item1, item2, distance)
#> # A tibble: 6 x 3
#>   item                     V1      V2
#>   <chr>                 <dbl>   <dbl>
#> 1 Sense & Sensibility   211.    302. 
#> 2 Pride & Prejudice      36.7   -64.8
#> 3 Mansfield Park      -3942.   1325. 
#> 4 Emma                -3513.  -1525. 
#> 5 Northanger Abbey     3924.   -115. 
#> 6 Persuasion           3284.     77.5

This can be used for something as shown below, where we compare the distances between the austen books separated by euclidean distance:

austen_dist %>%
  multi_scale(item1, item2, distance) %>%
  ggplot(aes(V1, V2, color = item)) +
  geom_point(size = 3, alpha = 0.8) + 
  scale_y_continuous(limits = c(-4000, 4000)) + 
  theme_bw()

I've also solved #13 by adding a fill_value argument in widely (previously used fill = 0 for all pairwise functions making correlations take 0 values for missing data instead of NAs) and in case of pairwise_cor, it'll be NA and 0 otherwise.

…ces to a k dimensional space

kanishkamisra · 2018-06-08T23:41:24Z

Learning a lot about Travis CI from this commit!

codecov-io · 2018-07-31T00:21:34Z

Codecov Report

Merging #20 into master will decrease coverage by 5.29%.
The diff coverage is 37.5%.

@@           Coverage Diff            @@
##           master     #20     +/-   ##
========================================
- Coverage   65.99%   60.7%   -5.3%     
========================================
  Files          11      12      +1     
  Lines         247     257     +10     
========================================
- Hits          163     156      -7     
- Misses         84     101     +17

Impacted Files	Coverage Δ
R/multi_scale.R	`0% <0%> (ø)`
R/pairwise_cor.R	`100% <100%> (ø)`	⬆️
R/widely.R	`98.18% <100%> (-0.04%)`	⬇️
R/utils.R	`66.66% <0%> (-22.23%)`	⬇️
R/pairwise_count.R	`100% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d2a659e...ca14818. Read the comment docs.

kanishkamisra added 6 commits March 22, 2018 11:14

Added multi_scale.R to project documents separated by pairwise distan…

d849ac6

…ces to a k dimensional space

Added multidimensional scaling and solved issue juliasilge#13

5a7cb5b

Fixed tibble call to make Travic CI happy

88818cf

Fixed documentation and widely inconsistencies in new fill_value

6955c29

Fixed documentation again!

b265dbc

Fixed inconsistencies in Examples for multi_scale()

ed2ed0a

fixed :: warning from tibble

ca14818

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added multidimensional scaling to project distance separated items to a k dimensional space#20

Added multidimensional scaling to project distance separated items to a k dimensional space#20
kanishkamisra wants to merge 7 commits intojuliasilge:mainfrom
kanishkamisra:master

kanishkamisra commented Jun 7, 2018 •

edited

Loading

Uh oh!

kanishkamisra commented Jun 8, 2018

Uh oh!

codecov-io commented Jul 31, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kanishkamisra commented Jun 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kanishkamisra commented Jun 8, 2018

Uh oh!

codecov-io commented Jul 31, 2018

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kanishkamisra commented Jun 7, 2018 •

edited

Loading