Skip to content

update get_result() route #510

Open
dale-wahl wants to merge 17 commits intomasterfrom
dataset-key-for-result
Open

update get_result() route #510
dale-wahl wants to merge 17 commits intomasterfrom
dataset-key-for-result

Conversation

@dale-wahl
Copy link
Copy Markdown
Member

this requires a dataset key (or, if provided, can use the dataset object) to ensure user has rights to files.

one annoying caveat is that PixPlot HTML files have the route hardcoded. I need to either figure out a way to circumvent that or write a migrate script. new plots will be produced correctly.

to reduce the db requests, we could memcache the dataset data and re-instantiate the dataset object with that data. dataset.refresh_owners() would still run which has its own db requests. this only really seems relevant with something like pixplot which makes many requests for various thumbs/images.

@dale-wahl dale-wahl requested a review from stijn-uva July 4, 2025 12:07
@stijn-uva
Copy link
Copy Markdown
Member

The implementation makes sense, but one concern I have is that this breaks existing URLs (i.e. the key is now required in the URL and old URLs won't work at all anymore). I have used direct URLs in the past to share files and now these would no longer work. So I would prefer if the key-less URLs would still work, perhaps optionally - there could be a global option "allow file downloads without login" or something.

@dale-wahl
Copy link
Copy Markdown
Member Author

@stijn-uva This looks good now. You can enable the legacy links via a general global setting.

For the legacy path, we could attempt to extract the dataset key. E.g.:

    # Attempt to extract dataset key from query_file and validate it
    # the key of the dataset files belong to can be extracted from the
    # file name in a predictable way.
    possible_keys = re.findall(r"[abcdef0-9]{32}", query_file)
    if possible_keys:
        # if for whatever reason there are multiple hashes in the filename,
        # the key would always be the last one
        key = possible_keys.pop()
        try:
            dataset = DataSet(key=key, db=g.db, modules=g.modules)
            # If dataset is found, delegate to the main get_result function for proper access control and file serving
            return get_result(query_file=query_file, dataset_key=key, dataset=dataset)
        except DataSetException:
            # If dataset is not found, fall back to legacy behavior
            pass

We could do that before checking the setting and only check if no dataset is found. Lifted from the cleanup worker.

@dale-wahl dale-wahl marked this pull request as ready for review March 4, 2026 14:45
@dale-wahl
Copy link
Copy Markdown
Member Author

@stijn-uva, alright, merged the archive and get_result route... mostly. @sal-uva does not have the actual the dataset's result file (just key and filename of the specific archived file). We need both the result zip filename and the file path inside the zip to extract it via get_result route (I did not like the idea of parsing paths like /dataset_results.zip/some/zipped/file but it is a possible alternative). Thus I kept the archive route to instantiate the DataSet and get it. We could refactor the new get_media_from_children function to provide the dataset's result archive filename as well (and also use the extract_file_from_archive to avoid iterating through the dataset files until .metadata.json) and then remove the archive route if really desired.

I also extract the zipped file and stream it as opposed to reading it into memory. Might not matter for most social media, but could cause issues with larger videos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants