different: Features and API design

- Current State of the `diff_tbl` class
  - print
  - summary
  - plot
  - Helpers
- Desired API
  - Missing Pieces
  - Behind the Scenes
  - Resolving Column Matches
  - Resolving Row Alignment
  - tidyselect column name semantics
  - Value Normalization
- Overall Workflow

## Current State

There are two main features of `different` that have been implemented to date: `diff_compare(x, y)` generates a `diff_tbl` object that contains a tidy summary of the differences between `x` and `y`, and `diff_report(x, y)` creats a self contained HTML report.

Behind the scenes, a major component of the differencing process is the alignment of the dataframe rows, which is handled by `different:::align_data_frames()`.  Column alignment is exclusively by matching column names.

###  `diff_tbl` class

The use of the `diff_tbl` class is limited primarily to implementing methods for the following generics.

#### print

`print.diff_tbl` gives simple output like

```r
> diff_obj
<diff_tbl: x vs y>
❯ There were 156 differences across 8 cols and 86 rows
```

or

```r
> diff_compare(x, y2)
<diff_tbl: x vs y2>
✔ There were no differences found between `x` and `y2`
```

or

```r
> diff_compare(x, y2)
<diff_tbl: x vs y2>
✔ There were no differences in overlapping columns between `x` and `y2`
```

#### summary

`summary.diff_tbl` gives a more detailed view of the differences between `x` and `y`.

```r
> diff_compare(x, y) %>% summary()
```

<details><summary>Output</summary>

```
── different: Comparison Summary ─────────────────────────────────────────
# Dimensions
    set    rows  cols
    ----- ----- -----
    x       100    11
    y       100    10

# Columns
● x has 1 unique column:
    `colname_02`
● There are 10 columns that appear in both
  ✔ 2 cols are identical: 
    `id_01`, `id_02`
  ✖ 8 cols have differences: 
    `colname_01`, `colname_03`, 
    `colname_04`, `colname_05`, `colname_06`, 
    `colname_07`, `colname_09`, 
    `colname_10`

# Differences
❯ There were 156 differences across 8 cols and 86 rows
    variable   type.x    type.y    state    miss_count diff             
    ----      -----     -----     -----         ----- ------           
    colname_01 character character diff             20 <tibble [20 × 5]>
    colname_03 integer   character diff             20 <tibble [20 × 5]>
    colname_04 integer   integer   diff             20 <tibble [20 × 5]>
    colname_05 character character diff             19 <tibble [19 × 5]>
    colname_06 numeric   numeric   diff             20 <tibble [20 × 5]>
    colname_07 character character diff             20 <tibble [20 × 5]>
    colname_09 factor    factor    diff             17 <tibble [17 × 5]>
    colname_10 character character diff             20 <tibble [20 × 5]>
    id_01      integer   integer   same              0 <NULL>           
    id_02      character character same              0 <NULL>           
    colname_02 character <NA>      unique_x         NA <NULL>  
```
</details>

or

```r
> diff_compare(x, y2) %>% summary()
```

<details><summary>Output</summary>

```
── different: Comparison Summary ─────────────────────────────────────────
# Dimensions
    set    rows  cols
    ----- ----- -----
    x       100    11
    y2      100     9

# Columns
● x has 2 unique columns:
    `colname_04`, `colname_05`
● There are 9 columns that appear in both
  ✔ 9 cols are identical: 
    `colname_01`, `colname_02`, 
    `colname_03`, `colname_06`, `colname_07`, 
    `colname_09`, `colname_10`, `id_01`, 
    `id_02`

# Differences
✔ There were no differences in overlapping columns between `x` and `y2`
    variable   type.x    type.y    state    miss_count diff  
    ----      -----     -----     -----         ----- ------
    colname_01 character character same              0 <NULL>
    colname_02 character character same              0 <NULL>
    colname_03 integer   integer   same              0 <NULL>
    colname_06 numeric   numeric   same              0 <NULL>
    colname_07 character character same              0 <NULL>
    colname_09 factor    factor    same              0 <NULL>
    colname_10 character character same              0 <NULL>
    id_01      integer   integer   same              0 <NULL>
    id_02      character character same              0 <NULL>
    colname_04 integer   <NA>      unique_x         NA <NULL>
    colname_05 character <NA>      unique_x         NA <NULL>
```
</details>

#### plot

`plot.diff_tbl` gives a visual summary of differences by row, where differences (misses) are encoded in red, and values that exist only in the `x` or `y` data frames are colored blue or green.

#### Helpers

There are also helper methods for `as_tibble()` (convert to standard tibble) and `metadata()` (extract `diff_meta` from the attributes of the `diff_tbl` object).

## Desired API

### Missing Pieces

The current workflow requires that rows and columns of the two data frames can be matched easily by matching keys and column names.

Depending on the "distance" between the two data sets, getting both sides to the point where they "snap together" can be a significant data wrangling burden. It would be nice to develop a workflow that assists in this process in a more explicit manner and with the following tasks:

1. Matching columns
    - Fuzzy string matching of names may help
    - Matching by distribution of values in the columns

2. Row alignment
    - Allow user to specify keys on both sides
    - or provide guess for keys?

3. Value normalization
    - Identify recoding of values in matched columns
    - Identify dropped values or levels

Each of the above steps can be handled with standard `dplyr` processing, but there are two key points of friction that the `different` API can help alleviate.

1. Consolidate the exploratory code into helper functions that run repeated tests across columns, for example in identifying the equivalence maps between values in an `x` column and a `y` column. This process will still require manual decision making but `different` can streamline the exploration. This is also, in part, where the embedded Shiny app comes into play.

2. Track *all of the dimensions of difference* between the two data frames. The pre-processing steps required to reconcile the two data frames are, themselves, part of the differences between the two data frames. But `diff_compare()` doesn't know about these steps and thus can only report on the differences identified between the processed `x` and `y` data.

   Ideally, by involving `different` in the reconciliation process, we can track and report these steps in addition to the final differences between the two objects.

### Behind the Scenes

**Link the data early in a new `diff_pair` class.** Ideally, the two data frames can be linked together immediately after import. By relying on R's [copy-on-modify semantics](https://adv-r.hadley.nz/names-values.html#copy-on-modify), we can reference the original data in the new class.

```r
x <- read_csv("data_set_1.csv")
y <- read_tsv("data_set_2.tsv")
z <- diff_pair(x, y)
```

Internally, this class can record the transformations that take place instead of actually performing them, and the transformations can be applied to both (or one) of the original data sets. For printing, the recorded actions can be replayed on the first 10-25 rows rather than the entire data set.

This also means we can consistently expect to be able to reference the original values in reporting and other outputs.

**Dispatch on the method of the first argument.** Wherever possible, functions in `different` should be able to be applied to a `diff_pair` or `diff_tbl` class _OR_ to an `x`, `y` pair of data frames. (If `...` is used in a function designed primarily for `diff_pair`, then the `y` argument is named `.y` and needs to be named explicitly.)

```r
diff_pair(x, y) %>% diff_compare()
#> <diff_tbl: x vs y>
#> ❯ There were 156 differences across 8 cols and 86 rows

diff_compare(x, y)
#> <diff_tbl: x vs y>
#> ❯ There were 156 differences across 8 cols and 86 rows
```

### Resolving Column Matches

Functions with the `diff_cols_` prefix help resolve column matching issues. Column name mismatches are resolved using the function `diff_cols_match(z, ..., .y = NULL)` to register the final meshed column names.

```r
z <- diff_pair(x, y)

diff_cols_common(z)
#> `colname_01`, `colname_02`, ...

diff_cols_unique(z)
#> x has 1 unique column:
#>   `colname_03`
#> y has 1 unique column:
#>   `colname_04`

diff_cols_fuzzy(z)
#> Fuzzy string matching suggests x -> y column matches:
#>  "colname_05" = "colname 05",
#>  "column 6" = c("column_06", "cols 6")

# ...above output could also be a tibble

diff_cols_guess(z)
#> ...returns a tibble with columns:
#> - x: column name from x
#> - best_match: column name from y
#> - match_reason:
#>     - exact name
#>     - value test
#>     - fuzzy name match

z <- diff_cols_match(z,
  colname_05 = c("column 05", "colname 05"),         # rename mismatched columns
  colname_05 = c(x = "column 05", y = "colname 05"), # ...same as above
  c(y = "cols 6", x = "column 6"),                   # let order give precedence
  "cols 6" = c("column 6", "cols 6"),                # ...same as above
  colname_new_04 = "colname_04"         # rename columns from x,y simultaneously
)

# If no parameters given to diff_cols_match(), 
# use .method = c("strict", "fuzzy", "guess")
z <- diff_cols_match(z, .method = "guess")
#> ...print messages about how matches were guessed...
```

Transitioning into modifying the values (or mutating the input data), we could also have a function `diff_cols_type()` that coerces (matched) columns to a particular type. In general, this would be useful only for data types that can be whole-sale coerced, e.g. `integer` to `character`.

```r
z <- diff_cols_type(z, "colname_05" = "double", "colname_06" = "character")
```

This process could be implicitly incorporated into `diff_cols_match()` via a `.coerce_type = TRUE` argument that would coerce matched columns into the same type, using the type of the column with precedence or the column from `x`.

Finally, the user can also specifiy that certain columns should be dropped altogether.

```r
# diff_cols_select <- function(z, ..., .which = c("both", "x", "y"))
z <- diff_cols_select(z, starts_with("colname"))
```

### Resolving Row Alignment

Row alignment is required for unordered data frames or data frames with differeing numbers of rows.

Key variables can be specified in several places:

```r
z <- diff_pair(x, y, keys = c("id_01", "id_02"))

z <- diff_cols_keys(z, id_01, id_02)

# Or use left = right like in `full_join()`, left takes precedence
z <- diff_cols_keys(z, id_11 = id_01, id_02)

diff_compare(z, keys = c("id_01"))
```

Note that the keys reference the _meshed_ column names:

```r
diff_pair(x, y, keys = "id_01") %>%
  diff_cols_match(id_99 = "id_01")
#> Error: key `id_01` not found in `x` or `y`. Did you mean `id_99`?

diff_pair(x, y, keys = "id_99") %>%
  diff_cols_match(id_99 = "id_01")
#> will work!
```

### tidyselect column name semantics

Column names should match `dplyr` + `tidyselect` semantics/feel:

- `select()` style:
  - `diff_cols_select()`
  - `diff_cols_keys()`
- `rename()` style (ish):
  - `diff_cols_match()`

### Value Normalization

This portion of the API should help with two things: 

- surfacing equivalence maps between two columns in the presence of 1:1 or minor modifications
- modifying the values in columns of one or both data sets according to the discovered mappings.

This functionality is differentiated from `diff_cols_guess()` in the sense that these functions help to resolve encoding issues in columns that are known (or have been confirmed) to be matched.

```r
diff_values(z)
#> $colname_01
#> # A tibble: 6 x 5
#>   variable   value.x value.y id_01 id_02 
#>   <chr>      <chr>   <chr>   <int> <chr> 
#> 1 colname_01 xikarq  xikarq     77 ldhuft
#> 2 colname_01 qdoyuv  qdoyuv      2 gkmols
#> 3 colname_01 zapizm  zapizm     53 izygxg
#> 4 colname_01 aknhkr  aknhkr     24 pmxpow
#> 5 colname_01 zamsxb  zamsxb      9 hpwqhh
#> 6 colname_01 hhgfxp  hhgfxp     15 dtjden
#> 
#> $colname_03
#> # A tibble: 6 x 5
#>   variable   value.x value.y id_01 id_02 
#>   <chr>      <chr>   <chr>   <int> <chr> 
#> 1 colname_03 -24     -8         77 ldhuft
#> 2 colname_03 -8      -8          2 gkmols
#> 3 colname_03 -19     6          53 izygxg
#> 4 colname_03 26      26         24 pmxpow
#> 5 colname_03 -46     19          9 hpwqhh
#> 6 colname_03 14      14         15 dtjden

# or
diff_values(z, colname_03)
#> # A tibble: 6 x 5
#>   variable   value.x value.y id_01 id_02 
#>   <chr>      <chr>   <chr>   <int> <chr> 
#> 1 colname_03 -24     -8         77 ldhuft
#> 2 colname_03 -8      -8          2 gkmols
#> ...3
```

This provides a list of tibbles with `value.x` and `value.y` for matching rows for each column. The user can explore these tibbles, or they can use additional helper functions like `diff_values_plot()` and `diff_values_count()`.

```r
diff_values_plot(z)

#...same as
diff_values(z) %>% 
  imap(~ ggplot(.x) + aes(value.x, value.y) + geom_point() + ggtitle(.y)) %>% 
  cowplot::plot_grid(plotlist = .)

diff_values_count(z, "colname_13")
#> $colname_13
#> # A tibble: 4 x 3
#>   value.x value.y     n
#>     <int> <chr>   <int>
#> 1       1 a          26
#> 2       2 b          31
#> 3       3 c          16
#> 4       4 d          27
```

The value combination counts hint at the equivalence map, which we can get by observing any 1:1, 1:n or n:1 mappings.

```r
diff_values_equivalent(z, "colname_13")
#> $colname_13
#> A tibble: 4 x 3
#>   set   value equivalent
#>   <chr> <int> <chr>     
#> 1 x         1 a         
#> 2 x         2 b         
#> 3 x         3 c         
#> 4 x         4 d
```

This function returns any 1:1 mappings in a 1:1 or (1 of n):1 format, where `set` tracks the source data set of `value` and its `equivalent` value in the other data set. A potential extension would be to include a `.fuzzy` argument that could be used for approximate string matching to catch values that are equivalent up to a small typo.

These mappings, once verified, are then applied to the `diff_pair`:

```r
# ---- Columns from one side, pre column matching ----
# re-map values for the `x` set
diff_values_mutate(z, colname_13 = letters[colname_13], .which = "x")

# or for y
diff_values_mutate(z, colname_13 = which(letters == colname_13), .which = "y")

# ---- Both columns, post column matching ----
diff_values_mutate(z, vital_status = as.integer(vital_status == "ALIVE"))

# ---- Automatic mapping ----
diff_values_mutate(z, .map = diff_values_equivalent(z, "colname_13))
```

## Overall Workflow

To summarize what an idealized workflow might look like, here's a purely made-up example where some work is required to match columns and values.

```r
x <- read_csv("data_set_1.csv")
y <- read_tsv("data_set_2.tsv")
z <- diff_pair(x, y)

# Check columns
diff_cols_unique(z)
diff_cols_guess(z)

z <- diff_cols_match(z
  colname_05 = c("Column 5", "colname 05"),
  colname_06 = c(y = "cols 6", x = "column 6"),
  colname_new_04 = "colname_04"
) %>%
  diff_cols_select(contains("id|colname")) %>%
  diff_cols_keys(id_01, id_02)

# Check values
diff_values_plot(z)
diff_values_equivalent(z, colname_13)

z <- z %>%
  diff_values_mutate(colname_04 = recode(colname_04, "old_x" = "new"), .which = "x") %>%
  diff_values_mutate(colname_04 = recode(colname_04, "old_y" = "new"), .which = "y") %>%
  diff_values_mutate(colname_06 = colname_06 == "YES")

# printing a diff_pair object runs a truncated version of diff_compare
z
#> <diff_tbl - preview: x vs y>
#> ❯ There were 73 differences across 8 cols and 21 rows in the first 100 rows

diff_x_y <- diff_compare(z)
diff_x_y
#> <diff_tbl: x vs y>
#> ❯ There were 156 differences across 8 cols and 86 rows
```

I'm not sure about how `diff_pair` and `diff_tbl` should interact. Should `diff_compare()` "freeze" the comparison? Or should `diff_pair` have a `diff_tbl` slot that it updates on `diff_compare()`?

One argument for the need for `diff_compare()` is that it "finalizes" the row-wise alignment. Whereas the `diff_pair` object doesn't _need_ to have two data frames aligned by rows, the comparison _is meaningless_ under certain conditions if the alignment isn't completed. The user might need to do some wrangling to get the alignment to work -- maybe a `diff_check_alignment()` is needed? -- so `diff_pair` lets the difference comparison be suspended in state until we're ready to run the comparison (like a Schrodinger tibble).

Thoughts and suggestions welcome cc @tgerke 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

different: Features and API design #7

Current State

`diff_tbl` class

print

summary

plot

Helpers

Desired API

Missing Pieces

Behind the Scenes

Resolving Column Matches

Resolving Row Alignment

tidyselect column name semantics

Value Normalization

Overall Workflow

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

different: Features and API design #7

Description

Current State

diff_tbl class

print

summary

plot

Helpers

Desired API

Missing Pieces

Behind the Scenes

Resolving Column Matches

Resolving Row Alignment

tidyselect column name semantics

Value Normalization

Overall Workflow

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`diff_tbl` class