Skip to content

fix(api): implement robust zip extraction and path sanitization#419

Open
PratyushNayak99 wants to merge 1 commit intoEAPD-DRB:mainfrom
PratyushNayak99:fix-zip-bug-final
Open

fix(api): implement robust zip extraction and path sanitization#419
PratyushNayak99 wants to merge 1 commit intoEAPD-DRB:mainfrom
PratyushNayak99:fix-zip-bug-final

Conversation

@PratyushNayak99
Copy link
Copy Markdown

Linked issue

Existing related work reviewed

  • Issues/PRs reviewed: None found after search

Overlap assessment

  • Classification: Core Bug Fix / Path Routing
  • Overlapping items: None
  • Why this is not duplicate/superseded: This addresses a specific backend disk reconciliation bug not previously documented.

Why this PR should proceed

Engineering Report: ZIP Extraction & Metadata Synchronization

Overview
This report documents the architectural corrections implemented in UploadRoute.py to fix a critical disconnect between the frontend model pathing variables and the backend physical extraction protocols. The changes were applied systematically across both the modern handle_full_zip() and legacy uploadCaseUnchunked_old() routing endpoints to ensure robust, zero-collision data persistence.

Code Overview: Implemented Defenses
The following structural improvements and safety checks were integrated inline into the zip parsing logic:

  • Naked Archive Resolution (casename_raw = case): Safe-guards the root storage path if an uploaded ZIP does not containerize its internal configuration files natively within a parent directory.
  • Metadata Sanitization ([:-4] if ...endswith('.zip')): Mathematical validation to strip literal .zip suffixes internally encoded by irregular localized user compression utilities.
  • Preemptive Collision Routing (os.path.exists()): Overrides extractall() execution if the post-sanitized directory name is already occupied by moving the evaluation boundary upstream.
  • Physical Disk Reconciliation (os.rename()): Mutates the raw OS directory outputted by extractall() to strictly align with the internal tracking casename parameter sequentially, completely avoiding FileExistsError app crashes.
🛠️ Click to expand: Technical Q&A / Architectural Justifications

1. The Root Cause (The Metadata Trap)
Question: In the original codebase, why was relying solely on zippedfilepath.parent.name dangerous?
Response: Focusing solely on zippedfilepath.parent.name blindly trusted the internal metadata structure of the uploaded ZIP. Depending on the user's OS or local ZIP utility, the internal wrapper directory was occasionally generated with a literal .zip suffix (e.g., a folder physically named dummy_case.zip). Python's zf.extractall() inherently clones this exact directory string directly onto the hard drive. However, resData JSON paths were constructed expecting standardized, suffix-free folders. This generated a massive desync between UI tracking logic looking for standard folders, and literal backend filesystem locations terminating in .zip formats, resulting in dead frontend routing loops.

2. The "Naked" Archive Defense
Question: Explain the edge case where a user uploads an archive without a parent folder.
Response: If a user compresses root files directly instead of zipping a wrapping folder, zippedfilepath.parent.name executes against a flat path structure and resolves to a blank string "". Executing extractall() on an un-restricted metadata variable triggers the backend to dump those loose configuration components indiscriminately into the root environment (DataStorage/), utterly corrupting the workspace. The casename_raw = case safety mechanism intercepts blank parent resolutions and builds an immediate virtual fence identical to the ZIP's uploaded filename to safely containerize the execution.

3. The Extraction Limitation
Question: Why did we have to use os.rename(old_dir, new_dir) after the extraction?
Response: Targeting zf.extractall(target_path) combines the target location with the precise internal tree still locked within the ZIP format. Appending a clean target directory directly to the extraction string results in deep nested configurations (e.g., DataStorage/dummy_case/dummy_case.zip/genData.json). The only mathematically absolute way to maintain single-directory structural integrity was to let Python resolve the native extraction sequence unaltered to the host level, let the underlying OS directory manifest, and structurally realign it backward manually via an absolute os.rename() pass on the root folder.

4. Preemptive Collision Resistance
Question: By moving the if not os.path.exists(...) check to evaluate the sanitized new_casename before extraction, what specific server crash or filesystem error did you prevent?
Response: By evaluating the sanitized target before unzipping, we generated an idempotent flow. Previously, we verified existence against corrupted internal metadata. That vulnerability sequence allowed zf.extractall() to spawn .zip folders, at which point the system would desperately try to execute an os.rename(old, new). If the sanitized /new/ version was already occupied in DataStorage, the server crashed instantly with a FileExistsError. Pushing the evaluation upstream inherently hijacked the system's pre-existing handling to naturally report a standard "Model Already Exists!" UI warning on the frontend and simultaneously kill the crash-prone extraction pipeline completely.

5. Architectural Consistency
Question: You applied this fix to both handle_full_zip() and uploadCaseUnchunked_old(). Why is it critical to patch the legacy routes?
Response: In comprehensive open-source maintenance, older API routes function as silent threat vectors. They are routinely engaged by un-updated automated staging scripts, backward-compatible API gateways, or alternate UI workflows. Permitting a data-layer fragmentation bug to survive inside older, yet actively resolvable endpoints compromises the structural integrity of the application tier entirely. Patching the backend horizontally guarantees dataset stability entirely independent of which legacy or modern gateway vector initiated the routing request.

@github-actions github-actions bot added the needs-intake-fix PR intake structure needs maintainer follow-up label Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-intake-fix PR intake structure needs maintainer follow-up

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] handle_full_zip() uses wrong variable for resData and viewDefinitions paths

1 participant