Sketch fst driver by wlandau · Pull Request #109 · richfitz/storr

wlandau · 2019-06-18T02:30:38Z

This PR is a sketch of the fst driver I suggested in #108. It completely based on the RDS driver. Instead of saving the serialized raw vector with writeBin(), we wrap it in a data frame and save it with write_fst(). Optional compression is powered by compress_fst(). If this works out, I plan to add functionality to set the serialization method based on the class/type of the object, re #77 (comment) (thinking of Keras models).

A quick glance at speed is encouraging (with 4 OpenMP threads on a mid-tier Kubuntu machine). cc @MarcusKlik.

library(microbenchmark)
library(storr)
data <- runif(2.5e7)
pryr::object_size(data)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> 200 MB
s1 <- storr_fst(tempfile())
s2 <- storr_rds(tempfile())
microbenchmark(
  fst = s1$set("x", data),
  rds = s2$set("x", data),
  times = 1
)
#> Unit: seconds
#>  expr       min        lq      mean    median        uq       max neval
#>   fst  1.568269  1.568269  1.568269  1.568269  1.568269  1.568269     1
#>   rds 22.373394 22.373394 22.373394 22.373394 22.373394 22.373394     1

^{Created on 2019-06-17 by the reprex package (v0.3.0)}

Issues and questions

fst for arbitrary data storage? fstpackage/fst#201
Tests are failing.
To try to avoid copying too much code, R6_driver_fst currently inherits from R6_driver_rds. Should it? A lot of code needs to be copied anyway because of hard-coded references to RDS, and I am not sure I feel about the current one-foot-in-and-one-foot-out approach.

MarcusKlik · 2019-06-18T10:55:34Z

Hi @wlandau, if I understand correctly, you will always be serializing the complete raw vector to disk as a whole (just like saveRDS().

Because you do not need random access to the stored vector, the meta-data stored in the fst file is a bit of an overkill. You might also test a setup where you compress the raw vector using compress_fst() and then store the result using saveRDS(raw_vec, compress = FALSE).

The advantage is that the compressor can use bigger chunks for compression, increasing the compression ratio (can be a big effect).
The disadvantage is that you are not compressing during the write to disk as with write_fst(), so it might be a little slower.

perhaps you can add this option to the benchmark and see what happens!

best

wlandau · 2019-06-18T11:27:54Z

Thanks for the useful advice, @MarcusKlik! Looks like we might be able to use fst compression in the existing RDS storr and reap most/all of the benefits. Really exciting!

library(fst)
library(magrittr)
library(microbenchmark)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
library(storr)
data <- runif(2.5e8)
object_size(data)
#> 2 GB
serialize(data, NULL, ascii = FALSE) %>%
  compress_fst() %>%
  object_size()
#> 1.28 GB
s1 <- storr_fst(tempfile())
s2 <- storr_fst(tempfile(), compress = TRUE)
s3 <- storr_rds(tempfile())
f <- function(data) {
  raw <- serialize(data, NULL, ascii = FALSE)
  cmp <- compress_fst(raw)
  saveRDS(cmp, tempfile(), compress = FALSE)
}
microbenchmark(
  fst = s1$set("x", data),
  fst_cmp = s2$set("x", data),
  rds = s3$set("x", data),
  f = f(data),
  times = 1
)
#> Unit: seconds
#>     expr        min         lq       mean     median         uq        max
#>      fst  12.043528  12.043528  12.043528  12.043528  12.043528  12.043528
#>  fst_cmp  12.056381  12.056381  12.056381  12.056381  12.056381  12.056381
#>      rds 206.944071 206.944071 206.944071 206.944071 206.944071 206.944071
#>        f   8.960977   8.960977   8.960977   8.960977   8.960977   8.960977
#>  neval
#>      1
#>      1
#>      1
#>      1

^{Created on 2019-06-18 by the reprex package (v0.3.0)}

wlandau · 2019-06-18T16:52:45Z

@MarcusKlik, I think #109 (comment) is a the best answer to https://stackoverflow.com/questions/56614592/faster-way-to-slice-a-raw-vector. If you post, I will accept and grant you the reputation points.

wlandau · 2019-06-19T02:47:05Z

After fstpackage/fst@c8bfde9, 519c38d, and d06e6f7, storr_fst() can handle data over 2 GB.

library(digest)
library(fst)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
library(storr)
x <- runif(3e8)
object_size(x)
#> 2.4 GB
s <- storr_fst(tempfile())
s$set(key = "x", value = x, use_cache = FALSE)
y <- s$get("x", use_cache = FALSE)
digest(x) == digest(y)
#> [1] TRUE

^{Created on 2019-06-18 by the reprex package (v0.3.0)}

wlandau · 2019-06-19T03:12:34Z

After looking at fstpackage/fst@c8bfde9, I merged #109 (storr_fst(), powered by write_fst() and read_fst()) and #111 (compress_fst() within RDS storrs) into https://github.com/wlandau/storr/tree/fst-compare for the sake of head-to-head benchmarking. Below, the most useful comparison is fst_none (storr_fst(), powered by write_fst() and read_fst()) versus rds_cmpr (compress_fst() withinin RDS storrs). It looks like we can save large-ish data faster with #109, but we can read small data faster with #111. Overall, I would say #109 wins in terms of speed, but not by much, at least compared to the current default behavior of storr.

So do we go with #109 or #111? Maybe both? #109 is a bit faster for large data, but the implementation is more complicated, and it requires development fst at the moment. #111 keeps the storr internals simpler, and it opens up the compress argument to allow more types of compression options in future development.

+2 GB data benchmarks:

library(digest)
library(fst)
library(microbenchmark)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
library(storr)
x <- runif(3e8)
object_size(x)
#> 2.4 GB
fst_none <- storr_fst(tempfile(), compress = "none")
fst_cmpr <- storr_fst(tempfile(), compress = "fst")
rds_none <- storr_rds(tempfile(), compress = "none")
rds_cmpr <- storr_rds(tempfile(), compress = "fst")
rds_default <- storr_rds(tempfile(), compress = "gzfile")
microbenchmark::microbenchmark(
  fst_none = fst_none$set(key = "x", value = x, use_cache = FALSE),
  fst_cmpr = fst_cmpr$set(key = "x", value = x, use_cache = FALSE),
  rds_none = rds_none$set(key = "x", value = x, use_cache = FALSE),
  rds_cmpr = rds_cmpr$set(key = "x", value = x, use_cache = FALSE),
  rds_default = rds_default$set(key = "x", value = x, use_cache = FALSE),
  times = 1
)
#> Repacking large object
#> Repacking large object
#> Unit: seconds
#>         expr        min         lq       mean     median         uq
#>     fst_none   9.943442   9.943442   9.943442   9.943442   9.943442
#>     fst_cmpr  16.299916  16.299916  16.299916  16.299916  16.299916
#>     rds_none  13.164031  13.164031  13.164031  13.164031  13.164031
#>     rds_cmpr  16.212105  16.212105  16.212105  16.212105  16.212105
#>  rds_default 263.741792 263.741792 263.741792 263.741792 263.741792
#>         max neval
#>    9.943442     1
#>   16.299916     1
#>   13.164031     1
#>   16.212105     1
#>  263.741792     1
microbenchmark::microbenchmark(
  fst_none = fst_none$get(key = "x", use_cache = FALSE),
  fst_cmpr = fst_cmpr$get(key = "x", use_cache = FALSE),
  rds_none = rds_none$get(key = "x", use_cache = FALSE),
  rds_cmpr = rds_cmpr$get(key = "x", use_cache = FALSE),
  rds_default = rds_default$get(key = "x", use_cache = FALSE),
  times = 1
)
#> Unit: seconds
#>         expr       min        lq      mean    median        uq       max
#>     fst_none  3.997610  3.997610  3.997610  3.997610  3.997610  3.997610
#>     fst_cmpr  9.683951  9.683951  9.683951  9.683951  9.683951  9.683951
#>     rds_none  2.891874  2.891874  2.891874  2.891874  2.891874  2.891874
#>     rds_cmpr  4.645483  4.645483  4.645483  4.645483  4.645483  4.645483
#>  rds_default 12.585657 12.585657 12.585657 12.585657 12.585657 12.585657
#>  neval
#>      1
#>      1
#>      1
#>      1
#>      1

^{Created on 2019-06-18 by the reprex package (v0.3.0)}

Small data benchmarks:

library(digest)
library(fst)
library(microbenchmark)
library(pryr)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
library(storr)
fst_none <- storr_fst(tempfile(), compress = "none")
fst_cmpr <- storr_fst(tempfile(), compress = "fst")
rds_none <- storr_rds(tempfile(), compress = "none")
rds_cmpr <- storr_rds(tempfile(), compress = "fst")
rds_default <- storr_rds(tempfile(), compress = "gzfile")
microbenchmark::microbenchmark(
  fst_none = fst_none$set(key = "x", value = runif(1), use_cache = FALSE),
  fst_cmpr = fst_cmpr$set(key = "x", value = runif(1), use_cache = FALSE),
  rds_none = rds_none$set(key = "x", value = runif(1), use_cache = FALSE),
  rds_cmpr = rds_cmpr$set(key = "x", value = runif(1), use_cache = FALSE),
  rds_default = rds_default$set(key = "x", value = runif(1), use_cache = FALSE)
)
#> Unit: microseconds
#>         expr     min       lq     mean   median       uq      max neval
#>     fst_none 264.506 273.3300 291.2796 287.7015 304.0750  488.368   100
#>     fst_cmpr 284.696 292.3405 333.9626 312.5445 326.6185 1240.974   100
#>     rds_none 231.126 240.5605 256.2129 252.6815 266.9450  438.347   100
#>     rds_cmpr 249.724 257.7000 301.7618 267.2055 285.0685 2301.713   100
#>  rds_default 239.317 253.3230 295.7192 300.4380 319.8265  815.004   100
microbenchmark::microbenchmark(
  fst_none = fst_none$get(key = "x", use_cache = FALSE),
  fst_cmpr = fst_cmpr$get(key = "x", use_cache = FALSE),
  rds_none = rds_none$get(key = "x", use_cache = FALSE),
  rds_cmpr = rds_cmpr$get(key = "x", use_cache = FALSE),
  rds_default = rds_default$get(key = "x", use_cache = FALSE)
)
#> Unit: microseconds
#>         expr     min       lq      mean   median       uq       max neval
#>     fst_none 119.854 127.4390 484.00473 131.7765 144.9625 34477.090   100
#>     fst_cmpr 125.987 131.3730 144.91864 135.6095 143.7470   237.292   100
#>     rds_none  76.280  81.6835 115.04263  85.8315  93.7595  2406.733   100
#>     rds_cmpr  91.058  95.7800 111.86227  99.9195 106.4455   513.389   100
#>  rds_default  76.906  81.3260  90.50064  86.5245  93.3555   156.133   100

^{Created on 2019-06-18 by the reprex package (v0.3.0)}

Sketch fst driver

b5e5699

This was referenced Jun 18, 2019

Use compress_fst() in the RDS driver? #110

Open

Multiformat driver #103

Open

wlandau mentioned this pull request Jun 19, 2019

fst for arbitrary data storage? fstpackage/fst#201

Closed

wlandau added 2 commits June 18, 2019 22:40

Use class<- workaround

519c38d

Unserialize at the right time (fst)

d06e6f7

This was referenced Jun 19, 2019

lz4 and zstd compression via fst #111

Open

Speed up the cache ropensci/drake#907

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sketch fst driver#109

Sketch fst driver#109
wlandau wants to merge 3 commits intorichfitz:masterfrom
wlandau:108

wlandau commented Jun 18, 2019

Uh oh!

MarcusKlik commented Jun 18, 2019

Uh oh!

wlandau commented Jun 18, 2019 •

edited

Loading

Uh oh!

wlandau commented Jun 18, 2019

Uh oh!

wlandau commented Jun 19, 2019

Uh oh!

wlandau commented Jun 19, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wlandau commented Jun 18, 2019

Issues and questions

Uh oh!

MarcusKlik commented Jun 18, 2019

Uh oh!

wlandau commented Jun 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wlandau commented Jun 18, 2019

Uh oh!

wlandau commented Jun 19, 2019

Uh oh!

wlandau commented Jun 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wlandau commented Jun 18, 2019 •

edited

Loading

wlandau commented Jun 19, 2019 •

edited

Loading