Skip to content

Added multidimensional scaling to project distance separated items to a k dimensional space#20

Open
kanishkamisra wants to merge 7 commits intojuliasilge:mainfrom
kanishkamisra:master
Open

Added multidimensional scaling to project distance separated items to a k dimensional space#20
kanishkamisra wants to merge 7 commits intojuliasilge:mainfrom
kanishkamisra:master

Conversation

@kanishkamisra
Copy link
Contributor

@kanishkamisra kanishkamisra commented Jun 7, 2018

Since widyr has two distance based functions, namely pairwise_delta and pairwise_dist, I was wondering if we can take it a step further and add functionality for the data analyst/scientist to scale these pairwise distances to k dimensions. This is done in base R using cmdscale that takes a distance matrix and returns a matrix with each item as a row and each of the k dimensions as columns. The points are calculated such that the distance is maintained.

Example:

euro_mat <- as.matrix(eurodist)[1:5, 1:5]

euro_mat
#>           Athens Barcelona Brussels Calais Cherbourg
#> Athens         0      3313     2963   3175      3339
#> Barcelona   3313         0     1318   1326      1294
#> Brussels    2963      1318        0    204       583
#> Calais      3175      1326      204      0       460
#> Cherbourg   3339      1294      583    460         0

cmdscale(euro_mat)
#>                [,1]       [,2]
#> Athens    2515.1944   14.68566
#> Barcelona -659.3761  961.30504
#> Brussels  -422.4595 -328.35478
#> Calais    -633.6209 -359.57342
#> Cherbourg -799.7379 -288.06249

In the tidy, widy way, I figured the following way to do this when we have a distance based result from a widyr method:

library(tidytext)
library(janeaustenr)
library(widyr)
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.4     
#> ✔ tibble  1.4.2          ✔ dplyr   0.7.4     
#> ✔ tidyr   0.8.0          ✔ stringr 1.3.0     
#> ✔ readr   1.1.1          ✔ forcats 0.3.0
#> ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

austen_dist <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word) %>%
  pairwise_dist(book, word, n)

## From utils.R in widyr package (private method), didnt want to use :::
col_name <- function(x, default = stop("Please supply column name", call. = FALSE))
{
  if (is.character(x))
    return(x)
  if (identical(x, quote(expr = )))
    return(default)
  if (is.name(x))
    return(as.character(x))
  if (is.null(x))
    return(x)
  stop("Invalid column specification", call. = FALSE)
}

multi_scale <- function(tbl, item1, item2, value, k = 2, ...) {
  multi_scale_(tbl,
               col_name(substitute(item1)),
               col_name(substitute(item2)),
               col_name(substitute(value)),
               k = 2, ...)
}


multi_scale_ <- function(tbl, item1, item2, value, k = 2) {
  tbl_matrix <- tbl %>%
    spread(item2, col_name(value), fill = 0) %>%
    as.data.frame() %>%
    remove_rownames() %>%
    column_to_rownames("item1") %>%
    as.matrix()

  cmdscale(tbl_matrix, k = k) %>%
    as.data.frame() %>%
    rownames_to_column("item") %>%
    as.tibble()
}

austen_dist %>%
  multi_scale(item1, item2, distance)
#> # A tibble: 6 x 3
#>   item                     V1      V2
#>   <chr>                 <dbl>   <dbl>
#> 1 Sense & Sensibility   211.    302. 
#> 2 Pride & Prejudice      36.7   -64.8
#> 3 Mansfield Park      -3942.   1325. 
#> 4 Emma                -3513.  -1525. 
#> 5 Northanger Abbey     3924.   -115. 
#> 6 Persuasion           3284.     77.5

This can be used for something as shown below, where we compare the distances between the austen books separated by euclidean distance:

austen_dist %>%
  multi_scale(item1, item2, distance) %>%
  ggplot(aes(V1, V2, color = item)) +
  geom_point(size = 3, alpha = 0.8) + 
  scale_y_continuous(limits = c(-4000, 4000)) + 
  theme_bw()

multi_scale

I've also solved #13 by adding a fill_value argument in widely (previously used fill = 0 for all pairwise functions making correlations take 0 values for missing data instead of NAs) and in case of pairwise_cor, it'll be NA and 0 otherwise.

@kanishkamisra
Copy link
Contributor Author

Learning a lot about Travis CI from this commit!

@codecov-io
Copy link

Codecov Report

Merging #20 into master will decrease coverage by 5.29%.
The diff coverage is 37.5%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master     #20     +/-   ##
========================================
- Coverage   65.99%   60.7%   -5.3%     
========================================
  Files          11      12      +1     
  Lines         247     257     +10     
========================================
- Hits          163     156      -7     
- Misses         84     101     +17
Impacted Files Coverage Δ
R/multi_scale.R 0% <0%> (ø)
R/pairwise_cor.R 100% <100%> (ø) ⬆️
R/widely.R 98.18% <100%> (-0.04%) ⬇️
R/utils.R 66.66% <0%> (-22.23%) ⬇️
R/pairwise_count.R 100% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d2a659e...ca14818. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants