-
Notifications
You must be signed in to change notification settings - Fork 0
Description
- Current State of the
diff_tblclass- summary
- plot
- Helpers
- Desired API
- Missing Pieces
- Behind the Scenes
- Resolving Column Matches
- Resolving Row Alignment
- tidyselect column name semantics
- Value Normalization
- Overall Workflow
Current State
There are two main features of different that have been implemented to date: diff_compare(x, y) generates a diff_tbl object that contains a tidy summary of the differences between x and y, and diff_report(x, y) creats a self contained HTML report.
Behind the scenes, a major component of the differencing process is the alignment of the dataframe rows, which is handled by different:::align_data_frames(). Column alignment is exclusively by matching column names.
diff_tbl class
The use of the diff_tbl class is limited primarily to implementing methods for the following generics.
print.diff_tbl gives simple output like
> diff_obj
<diff_tbl: x vs y>
❯ There were 156 differences across 8 cols and 86 rowsor
> diff_compare(x, y2)
<diff_tbl: x vs y2>
✔ There were no differences found between `x` and `y2`or
> diff_compare(x, y2)
<diff_tbl: x vs y2>
✔ There were no differences in overlapping columns between `x` and `y2`summary
summary.diff_tbl gives a more detailed view of the differences between x and y.
> diff_compare(x, y) %>% summary()Output
── different: Comparison Summary ─────────────────────────────────────────
# Dimensions
set rows cols
----- ----- -----
x 100 11
y 100 10
# Columns
● x has 1 unique column:
`colname_02`
● There are 10 columns that appear in both
✔ 2 cols are identical:
`id_01`, `id_02`
✖ 8 cols have differences:
`colname_01`, `colname_03`,
`colname_04`, `colname_05`, `colname_06`,
`colname_07`, `colname_09`,
`colname_10`
# Differences
❯ There were 156 differences across 8 cols and 86 rows
variable type.x type.y state miss_count diff
---- ----- ----- ----- ----- ------
colname_01 character character diff 20 <tibble [20 × 5]>
colname_03 integer character diff 20 <tibble [20 × 5]>
colname_04 integer integer diff 20 <tibble [20 × 5]>
colname_05 character character diff 19 <tibble [19 × 5]>
colname_06 numeric numeric diff 20 <tibble [20 × 5]>
colname_07 character character diff 20 <tibble [20 × 5]>
colname_09 factor factor diff 17 <tibble [17 × 5]>
colname_10 character character diff 20 <tibble [20 × 5]>
id_01 integer integer same 0 <NULL>
id_02 character character same 0 <NULL>
colname_02 character <NA> unique_x NA <NULL>
or
> diff_compare(x, y2) %>% summary()Output
── different: Comparison Summary ─────────────────────────────────────────
# Dimensions
set rows cols
----- ----- -----
x 100 11
y2 100 9
# Columns
● x has 2 unique columns:
`colname_04`, `colname_05`
● There are 9 columns that appear in both
✔ 9 cols are identical:
`colname_01`, `colname_02`,
`colname_03`, `colname_06`, `colname_07`,
`colname_09`, `colname_10`, `id_01`,
`id_02`
# Differences
✔ There were no differences in overlapping columns between `x` and `y2`
variable type.x type.y state miss_count diff
---- ----- ----- ----- ----- ------
colname_01 character character same 0 <NULL>
colname_02 character character same 0 <NULL>
colname_03 integer integer same 0 <NULL>
colname_06 numeric numeric same 0 <NULL>
colname_07 character character same 0 <NULL>
colname_09 factor factor same 0 <NULL>
colname_10 character character same 0 <NULL>
id_01 integer integer same 0 <NULL>
id_02 character character same 0 <NULL>
colname_04 integer <NA> unique_x NA <NULL>
colname_05 character <NA> unique_x NA <NULL>
plot
plot.diff_tbl gives a visual summary of differences by row, where differences (misses) are encoded in red, and values that exist only in the x or y data frames are colored blue or green.
Helpers
There are also helper methods for as_tibble() (convert to standard tibble) and metadata() (extract diff_meta from the attributes of the diff_tbl object).
Desired API
Missing Pieces
The current workflow requires that rows and columns of the two data frames can be matched easily by matching keys and column names.
Depending on the "distance" between the two data sets, getting both sides to the point where they "snap together" can be a significant data wrangling burden. It would be nice to develop a workflow that assists in this process in a more explicit manner and with the following tasks:
-
Matching columns
- Fuzzy string matching of names may help
- Matching by distribution of values in the columns
-
Row alignment
- Allow user to specify keys on both sides
- or provide guess for keys?
-
Value normalization
- Identify recoding of values in matched columns
- Identify dropped values or levels
Each of the above steps can be handled with standard dplyr processing, but there are two key points of friction that the different API can help alleviate.
-
Consolidate the exploratory code into helper functions that run repeated tests across columns, for example in identifying the equivalence maps between values in an
xcolumn and aycolumn. This process will still require manual decision making butdifferentcan streamline the exploration. This is also, in part, where the embedded Shiny app comes into play. -
Track all of the dimensions of difference between the two data frames. The pre-processing steps required to reconcile the two data frames are, themselves, part of the differences between the two data frames. But
diff_compare()doesn't know about these steps and thus can only report on the differences identified between the processedxandydata.Ideally, by involving
differentin the reconciliation process, we can track and report these steps in addition to the final differences between the two objects.
Behind the Scenes
Link the data early in a new diff_pair class. Ideally, the two data frames can be linked together immediately after import. By relying on R's copy-on-modify semantics, we can reference the original data in the new class.
x <- read_csv("data_set_1.csv")
y <- read_tsv("data_set_2.tsv")
z <- diff_pair(x, y)Internally, this class can record the transformations that take place instead of actually performing them, and the transformations can be applied to both (or one) of the original data sets. For printing, the recorded actions can be replayed on the first 10-25 rows rather than the entire data set.
This also means we can consistently expect to be able to reference the original values in reporting and other outputs.
Dispatch on the method of the first argument. Wherever possible, functions in different should be able to be applied to a diff_pair or diff_tbl class OR to an x, y pair of data frames. (If ... is used in a function designed primarily for diff_pair, then the y argument is named .y and needs to be named explicitly.)
diff_pair(x, y) %>% diff_compare()
#> <diff_tbl: x vs y>
#> ❯ There were 156 differences across 8 cols and 86 rows
diff_compare(x, y)
#> <diff_tbl: x vs y>
#> ❯ There were 156 differences across 8 cols and 86 rowsResolving Column Matches
Functions with the diff_cols_ prefix help resolve column matching issues. Column name mismatches are resolved using the function diff_cols_match(z, ..., .y = NULL) to register the final meshed column names.
z <- diff_pair(x, y)
diff_cols_common(z)
#> `colname_01`, `colname_02`, ...
diff_cols_unique(z)
#> x has 1 unique column:
#> `colname_03`
#> y has 1 unique column:
#> `colname_04`
diff_cols_fuzzy(z)
#> Fuzzy string matching suggests x -> y column matches:
#> "colname_05" = "colname 05",
#> "column 6" = c("column_06", "cols 6")
# ...above output could also be a tibble
diff_cols_guess(z)
#> ...returns a tibble with columns:
#> - x: column name from x
#> - best_match: column name from y
#> - match_reason:
#> - exact name
#> - value test
#> - fuzzy name match
z <- diff_cols_match(z,
colname_05 = c("column 05", "colname 05"), # rename mismatched columns
colname_05 = c(x = "column 05", y = "colname 05"), # ...same as above
c(y = "cols 6", x = "column 6"), # let order give precedence
"cols 6" = c("column 6", "cols 6"), # ...same as above
colname_new_04 = "colname_04" # rename columns from x,y simultaneously
)
# If no parameters given to diff_cols_match(),
# use .method = c("strict", "fuzzy", "guess")
z <- diff_cols_match(z, .method = "guess")
#> ...print messages about how matches were guessed...Transitioning into modifying the values (or mutating the input data), we could also have a function diff_cols_type() that coerces (matched) columns to a particular type. In general, this would be useful only for data types that can be whole-sale coerced, e.g. integer to character.
z <- diff_cols_type(z, "colname_05" = "double", "colname_06" = "character")This process could be implicitly incorporated into diff_cols_match() via a .coerce_type = TRUE argument that would coerce matched columns into the same type, using the type of the column with precedence or the column from x.
Finally, the user can also specifiy that certain columns should be dropped altogether.
# diff_cols_select <- function(z, ..., .which = c("both", "x", "y"))
z <- diff_cols_select(z, starts_with("colname"))Resolving Row Alignment
Row alignment is required for unordered data frames or data frames with differeing numbers of rows.
Key variables can be specified in several places:
z <- diff_pair(x, y, keys = c("id_01", "id_02"))
z <- diff_cols_keys(z, id_01, id_02)
# Or use left = right like in `full_join()`, left takes precedence
z <- diff_cols_keys(z, id_11 = id_01, id_02)
diff_compare(z, keys = c("id_01"))Note that the keys reference the meshed column names:
diff_pair(x, y, keys = "id_01") %>%
diff_cols_match(id_99 = "id_01")
#> Error: key `id_01` not found in `x` or `y`. Did you mean `id_99`?
diff_pair(x, y, keys = "id_99") %>%
diff_cols_match(id_99 = "id_01")
#> will work!tidyselect column name semantics
Column names should match dplyr + tidyselect semantics/feel:
select()style:diff_cols_select()diff_cols_keys()
rename()style (ish):diff_cols_match()
Value Normalization
This portion of the API should help with two things:
- surfacing equivalence maps between two columns in the presence of 1:1 or minor modifications
- modifying the values in columns of one or both data sets according to the discovered mappings.
This functionality is differentiated from diff_cols_guess() in the sense that these functions help to resolve encoding issues in columns that are known (or have been confirmed) to be matched.
diff_values(z)
#> $colname_01
#> # A tibble: 6 x 5
#> variable value.x value.y id_01 id_02
#> <chr> <chr> <chr> <int> <chr>
#> 1 colname_01 xikarq xikarq 77 ldhuft
#> 2 colname_01 qdoyuv qdoyuv 2 gkmols
#> 3 colname_01 zapizm zapizm 53 izygxg
#> 4 colname_01 aknhkr aknhkr 24 pmxpow
#> 5 colname_01 zamsxb zamsxb 9 hpwqhh
#> 6 colname_01 hhgfxp hhgfxp 15 dtjden
#>
#> $colname_03
#> # A tibble: 6 x 5
#> variable value.x value.y id_01 id_02
#> <chr> <chr> <chr> <int> <chr>
#> 1 colname_03 -24 -8 77 ldhuft
#> 2 colname_03 -8 -8 2 gkmols
#> 3 colname_03 -19 6 53 izygxg
#> 4 colname_03 26 26 24 pmxpow
#> 5 colname_03 -46 19 9 hpwqhh
#> 6 colname_03 14 14 15 dtjden
# or
diff_values(z, colname_03)
#> # A tibble: 6 x 5
#> variable value.x value.y id_01 id_02
#> <chr> <chr> <chr> <int> <chr>
#> 1 colname_03 -24 -8 77 ldhuft
#> 2 colname_03 -8 -8 2 gkmols
#> ...3This provides a list of tibbles with value.x and value.y for matching rows for each column. The user can explore these tibbles, or they can use additional helper functions like diff_values_plot() and diff_values_count().
diff_values_plot(z)
#...same as
diff_values(z) %>%
imap(~ ggplot(.x) + aes(value.x, value.y) + geom_point() + ggtitle(.y)) %>%
cowplot::plot_grid(plotlist = .)
diff_values_count(z, "colname_13")
#> $colname_13
#> # A tibble: 4 x 3
#> value.x value.y n
#> <int> <chr> <int>
#> 1 1 a 26
#> 2 2 b 31
#> 3 3 c 16
#> 4 4 d 27The value combination counts hint at the equivalence map, which we can get by observing any 1:1, 1:n or n:1 mappings.
diff_values_equivalent(z, "colname_13")
#> $colname_13
#> A tibble: 4 x 3
#> set value equivalent
#> <chr> <int> <chr>
#> 1 x 1 a
#> 2 x 2 b
#> 3 x 3 c
#> 4 x 4 dThis function returns any 1:1 mappings in a 1:1 or (1 of n):1 format, where set tracks the source data set of value and its equivalent value in the other data set. A potential extension would be to include a .fuzzy argument that could be used for approximate string matching to catch values that are equivalent up to a small typo.
These mappings, once verified, are then applied to the diff_pair:
# ---- Columns from one side, pre column matching ----
# re-map values for the `x` set
diff_values_mutate(z, colname_13 = letters[colname_13], .which = "x")
# or for y
diff_values_mutate(z, colname_13 = which(letters == colname_13), .which = "y")
# ---- Both columns, post column matching ----
diff_values_mutate(z, vital_status = as.integer(vital_status == "ALIVE"))
# ---- Automatic mapping ----
diff_values_mutate(z, .map = diff_values_equivalent(z, "colname_13))Overall Workflow
To summarize what an idealized workflow might look like, here's a purely made-up example where some work is required to match columns and values.
x <- read_csv("data_set_1.csv")
y <- read_tsv("data_set_2.tsv")
z <- diff_pair(x, y)
# Check columns
diff_cols_unique(z)
diff_cols_guess(z)
z <- diff_cols_match(z
colname_05 = c("Column 5", "colname 05"),
colname_06 = c(y = "cols 6", x = "column 6"),
colname_new_04 = "colname_04"
) %>%
diff_cols_select(contains("id|colname")) %>%
diff_cols_keys(id_01, id_02)
# Check values
diff_values_plot(z)
diff_values_equivalent(z, colname_13)
z <- z %>%
diff_values_mutate(colname_04 = recode(colname_04, "old_x" = "new"), .which = "x") %>%
diff_values_mutate(colname_04 = recode(colname_04, "old_y" = "new"), .which = "y") %>%
diff_values_mutate(colname_06 = colname_06 == "YES")
# printing a diff_pair object runs a truncated version of diff_compare
z
#> <diff_tbl - preview: x vs y>
#> ❯ There were 73 differences across 8 cols and 21 rows in the first 100 rows
diff_x_y <- diff_compare(z)
diff_x_y
#> <diff_tbl: x vs y>
#> ❯ There were 156 differences across 8 cols and 86 rowsI'm not sure about how diff_pair and diff_tbl should interact. Should diff_compare() "freeze" the comparison? Or should diff_pair have a diff_tbl slot that it updates on diff_compare()?
One argument for the need for diff_compare() is that it "finalizes" the row-wise alignment. Whereas the diff_pair object doesn't need to have two data frames aligned by rows, the comparison is meaningless under certain conditions if the alignment isn't completed. The user might need to do some wrangling to get the alignment to work -- maybe a diff_check_alignment() is needed? -- so diff_pair lets the difference comparison be suspended in state until we're ready to run the comparison (like a Schrodinger tibble).
Thoughts and suggestions welcome cc @tgerke