Skip to content

plan with one target runs out of memory (only inside drake and not manually) #930

@cimentadaj

Description

@cimentadaj

Prework

Description

Processing a big file returns a 'cannot allocate vector of size X' or 'cannot allocate buffer' but ONLY inside drake. That is, I can read the file and process it outside of drake and there is no error.

Reproducible example

I had some trouble making this a reprex because I'm very unfamiliar with drake. In fact this is my first project but since I found this weird error I thought it would useful to put it here. Instead, I have a minimal working repository in Github that has the workflow. Below I explain.

  1. Clone the repo:
git clone https://github.com/cimentadaj/spain_census.git
  1. Run renv (next iteration of packrat) for package management
devtools::install_github("rstudio/renv")
renv::restore() # should only take 1-2 mins
  1. Load drake and run r_make()
library(drake)
r_make()
# This will take a few mins because it downloads the data which is about 4M rows

There are four files (the same as in drake's documentation)

  • code/01-packages.R loads packages
  • code/02-reading_data.R has one function which downloads, reads and saves the data in output/
  • code/plan.R outlines the plan.
  • _drake.R

If I run r_make() (because my workflow is very interactive), everything will run OK (although it will take some time because everything is very heavy) until the plan in code/plan.R. That is, line 13 will read the heavy data but when the plan executes the target process_data (which only selects a few columns), drake will crash with memory related problems. The specific error is Error : cannot allocate vector of size 7.9GB or 'cannot allocate buffer'.

However, if I run all the scripts inside the folder code/ and manually run everything until line 13 in code/plan.R and then just do select(read_data, CPRO), this works. The error is only happening inside drake.

Session info

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] drake_7.4.0     workflowr_1.4.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       ps_1.3.0         crayon_1.3.4     assertthat_0.2.1
 [5] digest_0.6.19    R6_2.4.0         backports_1.1.4  storr_1.2.1     
 [9] magrittr_1.5     evaluate_0.14    cli_1.1.0        rlang_0.4.0     
[13] renv_0.5.0-66    callr_3.2.0      rmarkdown_1.13   tools_3.6.0     
[17] igraph_1.2.4.1   processx_3.3.1   xfun_0.7         compiler_3.6.0  
[21] pkgconfig_2.0.2  base64url_1.4    htmltools_0.3.6  knitr_1.23      

Expected output

What output would the correct behavior have produced?
No error and then readd(process_data) will return the correct data frame.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions