-
Notifications
You must be signed in to change notification settings - Fork 26
Description
readtext leaves temporary files on /tmp in linux each time you read a .docx file. If you read a lot of large .docx files, the space on /tmp can get used up quickly.
I have been using Natural Language Processing on a corpus of 1000 books, where each book is a few hundred pages. Each time I open a book, about 10 MB of space is consumed on /tmp. The files are not cleaned up by readtext. So if I open and process all 1000 books (one at a time, in a loop), about 10 GB of disk space is consumed on /tmp!!
It would be really nice if readtext would clean up the /tmp files it creates before it returns its result, rather than leaving potentially large files behind.
A reprex would be to create a .docx file (foo.docx) consisting of about 300 pages of text. Then run the following loop in R
for (i in 1:1000) {
dummy <- readtext::readtext("/path/to/foo.docx")
}
This will open foo.docx 1000 times, creating 1000 temporary folders on /tmp which will consume 5-10 GB of disk space.