Skip to content

Fix file storage redundancy and non-re-use. Server files WILL build up. #9

@bdklahn

Description

@bdklahn

UPDATE: scroll to last paragraph, for the latest proposed solution.

Whenever a new file is uploaded, it looks like it will always create a new file, even if that exact file has already been uploaded.
The save function will upload a file with a new, unique name, if it is already there.

savefile = FileSystemStorage()
csvname = savefile.save(uploaded_file.name, uploaded_file)

There appears to be no means to clean up files. It doesn't look like there is a mechanism to even reuse any uploaded file, after the first upload is triggered.
There only use seems to be an immediate call to a function which loads it into a DataFrame into a global input_ts variable (See: #8).
So it looks like calls to upload files will always create new files (which will never be re-used), with no mechanism to ever remove any.

If we do want to try and re-use any redundant files, between users/sessions, we might try to see if they are identical to something previously uploaded, by hash or something. But something, somewhere, would need to store hashes.
I guess another way to try and keep things clean might be to run a cleanup function, on every upload, which maybe deletes anything older than n days.

But I wonder if it might be easier to handle file data (also) per session, and not use file storage.
Thoughts, @ajs997 @srinivvenkat ?

It seems to me that file storage, under media only makes sense for things we might want to re-use/share between users/sessions.

Other thoughts: We could store an actual global hash to file name dictionary. Each upload attempt could hash the file, and check if hash in dict:. If so, it could just create a symlink with the name the user wants, to the duplicate file.
I guess we might then need to deal with files/links of the same name, but different content. Seems fiddly. But maybe it could work if we stick with Django's file name collision renaming. Maybe start with that, then remove and switch to pointing to a previous name, found to hash to the same value.
Fiddly.

Maybe this sort of thing is supposed to be handled via request.FILES.
https://docs.djangoproject.com/en/4.2/topics/http/file-uploads/

But it looks like we still need to handle where/how things get saved, re-used, and cleaned up.

Ah, I see . . .

uploaded_file = request.FILES['document']

So maybe it's fine to just use the file from there. If it is a large file, it looks like the actual file already ends up with a unique name, under /tmp.
So maybe it is fine to just access from there? It looks like you can interface with it as a File object (via that 'document' key.). Maybe a request.session variable could even be set to refer to that file, for non-upload form requests?
Maybe readfile() could always read from that original storage, and return a df (which could be utilized from a local variable, where it's called). Maybe if that File interface could return a truly unique ID for the file, that could be used in the readfile arguments, to allow @cache memoization to keep it in memory (similar to storing it a global input_ts variable, but safe for sessions). We could limit set a cache size limit, to avoid memory leak.

Ok. Here's what it looks like we could do to: Continue to use the FileSystemStorage, and use the actual generated (to be unique) file name. Save the actual local storage path (String) in a session variable, by something like request.session['input_ts_csv'] = fs_store.path(name) method to get that.
https://docs.djangoproject.com/en/4.2/ref/files/storage/#django.core.files.storage.Storage.path
Dolocal_df = readfile(that_filepath) (gotten from the session dict), whenever something needs the data.
read_file() could use some functools@cache memoization (on that uniqueified file path), to save the data in memory (for a while, or number of dfs), similar to having it in that input_ts variable.
Then, I guess, we have a function which gets called to clean up some number or age of the oldest files.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions