Skip to content

how cSplit() treats multiple splitCols when they contain different number of fields #46

@aushev

Description

@aushev

Here is example table:

dt1 <- fread("V1 V2       V3
              x  xA;xB;xC x1;x2;x3
              y  yD       y1
              z  zF;zG    z1")

and I want to split it by both V2 and V3 columns. You can see that the last record is "wrong": V2 has 2 values while V3 has only one. And that how cSplit() treats those cases:

# with default arguments:
cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long')
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  y NA NA
#6:  y NA NA
#7:  z zF z1
#8:  z zG NA
#9:  z NA NA

# with `makeEqual = TRUE`:
cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long', makeEqual = T)
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  y NA NA
#6:  y NA NA
#7:  z zF z1
#8:  z zG NA
#9:  z NA NA

So, by default it works like with makeEqual = TRUE while in the help it is said Defaults to FALSE. Then I tried with FALSE:

cSplit(dt1, splitCols = c('V2', 'V3'), sep=';', direction = 'long', makeEqual = F)
# Warning in `[.data.table`(indt, , `:=`(eval(splitCols), lapply(X, function(x) { :
#     Supplied 5 items to be assigned to 6 items of column 'V3' (recycled leaving remainder of 1 items).
#      V1 V2 V3
#   1:  x xA x1
#   2:  x xB x2
#   3:  x xC x3
#   4:  y yD y1
#   5:  z zF z1
#   6:  z zG x1

It recycles V3 elements but it takes it from another group which is kinda unexpected. I think it would be more logical to give one of the following outputs:

# without recycling, fill with NA:
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  z zF z1
#6:  z zG NA

# with recycling:
#    V1 V2 V3
#1:  x xA x1
#2:  x xB x2
#3:  x xC x3
#4:  y yD y1
#5:  z zF z1
#6:  z zG z1

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions