-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Since widyr has two distance based functions, namely pairwise_delta and pairwise_dist, I was wondering if we can take it a step further and add functionality for the data analyst/scientist to scale these pairwise distances to k dimensions. This is done in base R using cmdscale that takes a distance matrix and returns a matrix with each item as a row and each of the k dimensions as columns. The points are calculated such that the distance is maintained.
Example:
euro_mat <- as.matrix(eurodist)[1:5, 1:5]
euro_mat
#> Athens Barcelona Brussels Calais Cherbourg
#> Athens 0 3313 2963 3175 3339
#> Barcelona 3313 0 1318 1326 1294
#> Brussels 2963 1318 0 204 583
#> Calais 3175 1326 204 0 460
#> Cherbourg 3339 1294 583 460 0
cmdscale(euro_mat)
#> [,1] [,2]
#> Athens 2515.1944 14.68566
#> Barcelona -659.3761 961.30504
#> Brussels -422.4595 -328.35478
#> Calais -633.6209 -359.57342
#> Cherbourg -799.7379 -288.06249
In the tidy, widy way, I figured the following way to do this when we have a distance based result from a widyr method:
library(tidytext)
library(janeaustenr)
library(widyr)
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.4
#> ✔ tibble 1.4.2 ✔ dplyr 0.7.4
#> ✔ tidyr 0.8.0 ✔ stringr 1.3.0
#> ✔ readr 1.1.1 ✔ forcats 0.3.0
#> ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
austen_dist <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word) %>%
pairwise_dist(book, word, n)
## From utils.R in widyr package (private method), didnt want to use :::
col_name <- function(x, default = stop("Please supply column name", call. = FALSE))
{
if (is.character(x))
return(x)
if (identical(x, quote(expr = )))
return(default)
if (is.name(x))
return(as.character(x))
if (is.null(x))
return(x)
stop("Invalid column specification", call. = FALSE)
}
multi_scale <- function(tbl, item1, item2, value, k = 2, ...) {
multi_scale_(tbl,
col_name(substitute(item1)),
col_name(substitute(item2)),
col_name(substitute(value)),
k = 2, ...)
}
multi_scale_ <- function(tbl, item1, item2, value, k = 2, ...) {
tbl_matrix <- tbl %>%
spread(item2, col_name(value), fill = 0) %>%
remove_rownames() %>%
column_to_rownames("item1") %>%
as.matrix()
cmdscale(tbl_matrix, k = k) %>%
as.data.frame() %>%
rownames_to_column("item") %>%
as.tibble()
}
austen_dist %>%
multi_scale(item1, item2, distance)
#> Warning: Setting row names on a tibble is deprecated.
#> # A tibble: 6 x 3
#> item V1 V2
#> <chr> <dbl> <dbl>
#> 1 Sense & Sensibility 211. 302.
#> 2 Pride & Prejudice 36.7 -64.8
#> 3 Mansfield Park -3942. 1325.
#> 4 Emma -3513. -1525.
#> 5 Northanger Abbey 3924. -115.
#> 6 Persuasion 3284. 77.5
This can be used for something like this:
austen_dist %>%
multi_scale(item1, item2, distance) %>%
ggplot(aes(V1, V2, color = item)) +
geom_point(size = 3, alpha = 0.8) +
scale_y_continuous(limits = c(-4000, 4000)) +
theme_bw()
Now, this method returns a warning because of column_to_rownames() that I would like to get rid of. I'm searching for one but wanted to know your thoughts before I submit a PR.
