fix(api): implement robust zip extraction and path sanitization#419
Open
PratyushNayak99 wants to merge 1 commit intoEAPD-DRB:mainfrom
Open
fix(api): implement robust zip extraction and path sanitization#419PratyushNayak99 wants to merge 1 commit intoEAPD-DRB:mainfrom
PratyushNayak99 wants to merge 1 commit intoEAPD-DRB:mainfrom
Conversation
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Linked issue
Existing related work reviewed
Overlap assessment
Why this PR should proceed
Engineering Report: ZIP Extraction & Metadata Synchronization
Overview
This report documents the architectural corrections implemented in
UploadRoute.pyto fix a critical disconnect between the frontend model pathing variables and the backend physical extraction protocols. The changes were applied systematically across both the modernhandle_full_zip()and legacyuploadCaseUnchunked_old()routing endpoints to ensure robust, zero-collision data persistence.Code Overview: Implemented Defenses
The following structural improvements and safety checks were integrated inline into the zip parsing logic:
casename_raw = case): Safe-guards the root storage path if an uploaded ZIP does not containerize its internal configuration files natively within a parent directory.[:-4] if ...endswith('.zip')): Mathematical validation to strip literal.zipsuffixes internally encoded by irregular localized user compression utilities.os.path.exists()): Overridesextractall()execution if the post-sanitized directory name is already occupied by moving the evaluation boundary upstream.os.rename()): Mutates the raw OS directory outputted byextractall()to strictly align with the internal trackingcasenameparameter sequentially, completely avoidingFileExistsErrorapp crashes.🛠️ Click to expand: Technical Q&A / Architectural Justifications
1. The Root Cause (The Metadata Trap)
Question: In the original codebase, why was relying solely on
zippedfilepath.parent.namedangerous?Response: Focusing solely on
zippedfilepath.parent.nameblindly trusted the internal metadata structure of the uploaded ZIP. Depending on the user's OS or local ZIP utility, the internal wrapper directory was occasionally generated with a literal.zipsuffix (e.g., a folder physically nameddummy_case.zip). Python'szf.extractall()inherently clones this exact directory string directly onto the hard drive. However,resDataJSON paths were constructed expecting standardized, suffix-free folders. This generated a massive desync between UI tracking logic looking for standard folders, and literal backend filesystem locations terminating in.zipformats, resulting in dead frontend routing loops.2. The "Naked" Archive Defense
Question: Explain the edge case where a user uploads an archive without a parent folder.
Response: If a user compresses root files directly instead of zipping a wrapping folder,
zippedfilepath.parent.nameexecutes against a flat path structure and resolves to a blank string"". Executingextractall()on an un-restricted metadata variable triggers the backend to dump those loose configuration components indiscriminately into the root environment (DataStorage/), utterly corrupting the workspace. Thecasename_raw = casesafety mechanism intercepts blank parent resolutions and builds an immediate virtual fence identical to the ZIP's uploaded filename to safely containerize the execution.3. The Extraction Limitation
Question: Why did we have to use
os.rename(old_dir, new_dir)after the extraction?Response: Targeting
zf.extractall(target_path)combines the target location with the precise internal tree still locked within the ZIP format. Appending a clean target directory directly to the extraction string results in deep nested configurations (e.g.,DataStorage/dummy_case/dummy_case.zip/genData.json). The only mathematically absolute way to maintain single-directory structural integrity was to let Python resolve the native extraction sequence unaltered to the host level, let the underlying OS directory manifest, and structurally realign it backward manually via an absoluteos.rename()pass on the root folder.4. Preemptive Collision Resistance
Question: By moving the
if not os.path.exists(...)check to evaluate the sanitizednew_casenamebefore extraction, what specific server crash or filesystem error did you prevent?Response: By evaluating the sanitized target before unzipping, we generated an idempotent flow. Previously, we verified existence against corrupted internal metadata. That vulnerability sequence allowed
zf.extractall()to spawn.zipfolders, at which point the system would desperately try to execute anos.rename(old, new). If the sanitized/new/version was already occupied inDataStorage, the server crashed instantly with aFileExistsError. Pushing the evaluation upstream inherently hijacked the system's pre-existing handling to naturally report a standard "Model Already Exists!" UI warning on the frontend and simultaneously kill the crash-prone extraction pipeline completely.5. Architectural Consistency
Question: You applied this fix to both
handle_full_zip()anduploadCaseUnchunked_old(). Why is it critical to patch the legacy routes?Response: In comprehensive open-source maintenance, older API routes function as silent threat vectors. They are routinely engaged by un-updated automated staging scripts, backward-compatible API gateways, or alternate UI workflows. Permitting a data-layer fragmentation bug to survive inside older, yet actively resolvable endpoints compromises the structural integrity of the application tier entirely. Patching the backend horizontally guarantees dataset stability entirely independent of which legacy or modern gateway vector initiated the routing request.