Use per-child temp dirs for local update data generation by wjxiz1992 · Pull Request #252 · NVIDIA/spark-rapids-benchmarks

wjxiz1992 · 2026-03-03T07:50:15Z

Summary

Fixes #221 — dsdgen with --update requires -force to avoid file collisions (especially for delete tables), but blindly enabling -force can silently overwrite existing data if the user accidentally passes the wrong output path.

This PR redesigns local update data generation to use per-child temp directories, consistent with the existing HDFS approach in generate_data_hdfs():

Each parallel child generates into its own isolated temp directory, eliminating file collisions without needing -force
Delete/inventory_delete tables are deduplicated during merge (only the first child's copy is kept, since all children produce identical content)
A pre-flight safety check detects existing update data in the target directory and fails with a clear message unless --overwrite_output is explicitly set
Temp directories are always cleaned up (via try/finally)
Base data generation path is unchanged

Changes

nds/nds_gen_data.py: +103 / -8 lines
- Added _check_existing_update_data(): scans for existing maintenance table data
- Added _merge_update_data_local(): merges per-child temp dirs with delete table dedup
- Added _generate_update_data_local(): orchestrates temp-dir generation + merge + cleanup
- Modified generate_data_local(): delegates to new function when --update is set

Test plan

Unit tests (14 tests)

Covers _check_existing_update_data, _merge_update_data_local, and _generate_update_data_local with mock dsdgen:

Pre-flight check: empty dir, base-data-only dir, existing update tables, empty subfolders
Merge logic: regular tables merged from all children, delete table only first child kept, delete skipped if already exists, table subdirs created
End-to-end: fresh dir generation, temp dir cleanup, reject existing data, allow with --overwrite_output, no -force flag passed to dsdgen, incremental range preserves previous data

CLI end-to-end tests (with mock dsdgen)

Test 1 — Generate update data to fresh directory:

python3 nds_gen_data.py local 10 4 /tmp/nds_test --update 1

Result: 4 files per regular maintenance table, 1 file per delete table, _temp_update dir cleaned up. ✅

Test 2 — Re-run same path without --overwrite_output (safety check):

python3 nds_gen_data.py local 10 4 /tmp/nds_test --update 1

Result: Fails with "Update data already exists in directory /tmp/nds_test for tables: [...]". ✅

Test 3 — Re-run with --overwrite_output:

python3 nds_gen_data.py local 10 4 /tmp/nds_test --update 1 --overwrite_output

Result: Succeeds, data regenerated. ✅

Test 4 — Incremental --range (two invocations):

python3 nds_gen_data.py local 10 4 /tmp/nds_range_test --update 1 --range 1,2
python3 nds_gen_data.py local 10 4 /tmp/nds_range_test --update 1 --range 3,4 --overwrite_output

Result: s_catalog_order/ contains files from both ranges (_1_4, _2_4, _3_4, _4_4). delete/delete_1.dat kept from first range, not overwritten. ✅

Test 5 — Base data generation (no --update):

python3 nds_gen_data.py local 1 2 /tmp/nds_base_test

Result: Original behavior unchanged, all 25 source tables generated correctly. ✅

Made with Cursor

Copilot

Pull request overview

This PR fixes a data generation bug (#221) where running dsdgen with --update and parallel workers in local mode fails because all children try to write identically-named delete files to the same output directory. The fix isolates each child's output into its own temp directory, merges results afterward (deduplicating delete tables), and adds a pre-flight safety check for existing update data.

Changes:

Introduces three new helper functions (_check_existing_update_data, _merge_update_data_local, _generate_update_data_local) to handle local update data generation with per-child temp directories
Delegates --update path in generate_data_local() to the new orchestrator function, removing the previous incomplete inline handling of the update case
Restricts the existing base-data post-generation file-moving loop strictly to source_table_names (removing maintenance table handling that is now done separately)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-03T07:54:14Z

nds/nds_gen_data.py

+        for p in procs:
+            p.wait()
+            if p.returncode != 0:
+                print("dsdgen failed with return code {}".format(p.returncode))
+                raise Exception("dsdgen failed")


When one dsdgen process returns a non-zero exit code, the code raises an exception inside the for p in procs loop without waiting for or terminating the remaining still-running child processes. The finally block then immediately calls shutil.rmtree(temp_base), which may fail or produce unpredictable results if some dsdgen processes are still actively writing to that directory.

To fix this, before raising the exception (or as part of the finally block), iterate over the remaining processes and call p.terminate() (or p.kill()) followed by p.wait() to ensure all child processes have stopped before deleting the temp directory.

Fixed: the finally block now terminates all still-running child processes via p.terminate() and reaps all children via p.wait() before deleting the temp directory.

nds/nds_gen_data.py

greptile-apps · 2026-03-03T07:54:34Z

Greptile Summary

This PR redesigns local update data generation to use per-child temp directories, eliminating the need for dsdgen's -force flag (which could silently overwrite data on a wrong output path). Each parallel child writes to its own isolated _temp_update/child_N/ directory; results are merged into data_dir by _merge_update_data_local, which deduplicates delete/inventory_delete tables (identical content from all children) and preserves files from previous --range invocations. A pre-flight safety check in _check_existing_update_data blocks accidental overwrites unless --overwrite_output is explicit. Temp directories are always cleaned up via try/finally, which also terminates and reaps any orphaned dsdgen child processes. The base data generation path is untouched.

Key changes:

_check_existing_update_data(): scans for existing maintenance-table directories to guard against accidental clobbers
_merge_update_data_local(): moves per-child output into data_dir; delete tables deduplicated to first child's copy; honours overwrite_output for replace semantics
_generate_update_data_local(): orchestrates temp-dir creation, parallel dsdgen launch, merge, and cleanup
generate_data_local(): early-returns into the new path when --update is set, with a warning when data_dir is already non-empty
New test file test_update_data_gen.py covering all three helpers with mock dsdgen

One minor issue was found: the comment in test_incremental_range_preserves_previous_data incorrectly states the delete file is "not overwritten" when overwrite_output=True — the code does overwrite it, but the test passes coincidentally because the mock produces identical content both times.

Confidence Score: 4/5

This PR is safe to merge; the redesign is logically sound and all previously flagged issues from prior review rounds have been addressed.
All critical issues raised in previous review threads (orphaned processes, partial merge comment, delete-table overwrite semantics, stale-dir warning, zombie processes, non-empty data_dir warning) have been fixed. The only remaining finding is a misleading test comment in test_incremental_range_preserves_previous_data — the comment says the delete file is "not overwritten" but overwrite_output=True actually does overwrite it; the test passes coincidentally because the mock produces identical content. This is a documentation issue, not a correctness bug. The production code path is clean.
nds/test_update_data_gen.py — the comment on line 309 should be corrected to accurately describe the delete-file overwrite behavior when overwrite_output=True.

Important Files Changed

Filename	Overview
nds/nds_gen_data.py	Adds `_check_existing_update_data`, `_merge_update_data_local`, and `_generate_update_data_local` to replace the old `-force` approach with per-child temp-dir isolation. Logic is sound; previous review threads addressed process cleanup, partial-merge comments, stale-dir warnings, and delete-table overwrite semantics. Code is production-ready.
nds/test_update_data_gen.py	New test module covering all three new functions with 8 test cases. One test comment is misleading: `test_incremental_range_preserves_previous_data` claims the delete file is "not overwritten" but `overwrite_output=True` does overwrite it; the test passes coincidentally because content is identical.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[generate_data_local] -->|args.update is set| B{data_dir non-empty?}
    B -->|Yes| C[Print warning: update data alongside existing content]
    B -->|No| D[_generate_update_data_local]
    C --> D
    D --> E{Maintenance table dirs exist?}
    E -->|Yes and no --overwrite_output| F[Raise Exception: use --overwrite_output]
    E -->|No, or --overwrite_output set| G{Stale _temp_update exists?}
    G -->|Yes| H[Print warning + shutil.rmtree stale temp]
    G -->|No| I[Create per-child temp dirs]
    H --> I
    I --> J[Launch parallel dsdgen procs one per child]
    J --> K[Wait for all procs sequentially]
    K --> L{Any returncode != 0?}
    L -->|Yes| M[Raise Exception: dsdgen failed]
    L -->|No| N[_merge_update_data_local]
    M --> O[finally: terminate+wait remaining procs\nshutil.rmtree temp_base]
    N --> O
    O --> P[du summary]

    N --> Q{For each maintenance table}
    Q -->|delete / inventory_delete| R{target file exists and not overwrite?}
    R -->|Yes| S[Skip: print already present]
    R -->|No| T[shutil.move from child_range_start]
    Q -->|Regular table| U[shutil.move each child's file]

    A -->|args.update not set| V[Original base-data path unchanged]

_{Last reviewed commit: ee83044}

greptile-apps

_{2 files reviewed, 12 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-03-03T07:54:38Z

nds/nds_gen_data.py

+        for p in procs:
+            p.wait()
+            if p.returncode != 0:
+                print("dsdgen failed with return code {}".format(p.returncode))
+                raise Exception("dsdgen failed")
+        _merge_update_data_local(temp_base, data_dir, range_start, range_end,
+                                 args.parallel, args.update)
+    finally:
+        if os.path.exists(temp_base):
+            shutil.rmtree(temp_base)


Orphaned child processes on dsdgen failure

When one dsdgen process fails and the exception is raised at line 218, the remaining processes in procs that haven't been waited on yet are still running. The finally block then calls shutil.rmtree(temp_base), deleting the per-child directories underneath those still-running processes. On Linux, processes with already-open file handles can continue writing to deleted inodes, but any subsequent file creations will fail silently, producing incomplete data and wasting CPU/memory until the orphaned processes eventually terminate on their own.

The fix is to terminate remaining processes inside the finally block before removing the temp tree:

finally: for p in procs: if p.poll() is None: p.terminate() if os.path.exists(temp_base): shutil.rmtree(temp_base)

Fixed: finally block terminates and waits on all child processes before cleanup.

greptile-apps · 2026-03-03T07:54:39Z

nds/nds_gen_data.py

+        if table in delete_tables:
+            # All children produce identical delete files; keep only one copy.
+            # Skip if already present (e.g. from a previous --range invocation).
+            target_file = os.path.join(target_dir, f'{table}_{update}.dat')
+            if not os.path.exists(target_file):
+                src = os.path.join(temp_base, f'child_{range_start}',
+                                   f'{table}_{update}.dat')
+                if os.path.exists(src):
+                    shutil.move(src, target_file)


Delete file skip ignores --overwrite_output

The if not os.path.exists(target_file) guard silently preserves the delete table file even when --overwrite_output is passed. For the same update number this is harmless (content is deterministic), but the behavior differs from the regular maintenance tables, which are always replaced on re-run. A user doing a full re-generation with --overwrite_output would reasonably expect all files to be refreshed. Consider adding a comment explaining the intentional skip, and/or honoring an explicit overwrite when --overwrite_output is set:

if table in delete_tables: # All children produce identical delete files; keep only one copy. # Skip if already present (e.g. from a previous --range invocation) # unless the caller explicitly requested an overwrite. target_file = os.path.join(target_dir, f'{table}_{update}.dat') if os.path.exists(target_file) and not overwrite: continue src = os.path.join(temp_base, f'child_{range_start}', f'{table}_{update}.dat') if os.path.exists(src): shutil.move(src, target_file)

This would require threading an overwrite boolean through from _generate_update_data_local.

Fixed: overwrite_output is now threaded through to the merge function and honored for delete tables.

greptile-apps · 2026-03-03T08:04:06Z

nds/nds_gen_data.py

+                if os.path.exists(src):
+                    shutil.move(src, target_file)
+        else:
+            for i in range(range_start, range_end + 1):
+                filename = f'{table}_{i}_{parallel}.dat'
+                src = os.path.join(temp_base, f'child_{i}', filename)
+                if os.path.exists(src):
+                    shutil.move(src, os.path.join(target_dir, filename))


Silent drop of expected merge sources

Both the delete-table path (line 165) and the regular-table path (line 171) silently skip a file when the source doesn't exist in the child temp directory. In the normal flow every expected file should be present; a missing file most likely indicates that dsdgen exited cleanly but produced no output for that table (e.g., empty scale factor for that partition). Without any diagnostic output the caller has no way to distinguish "table truly not generated at this scale" from "merge bug / disk-full left a partial child directory".

Consider adding a warning so incomplete output is surfaced immediately:

Suggested change

if os.path.exists(src):

shutil.move(src, target_file)

else:

for i in range(range_start, range_end + 1):

filename = f'{table}_{i}_{parallel}.dat'

src = os.path.join(temp_base, f'child_{i}', filename)

if os.path.exists(src):

shutil.move(src, os.path.join(target_dir, filename))

if os.path.exists(src):

shutil.move(src, os.path.join(target_dir, filename))

else:

print(f"Warning: expected source not found, skipping: {src}")

The same applies to the delete-table branch at line 165 — a print before the early return would help diagnose issues:

if os.path.exists(src): shutil.move(src, target_file) else: print(f"Warning: expected delete source not found, skipping: {src}")

Fixed: both delete-table and regular-table paths now print a warning when the expected source file is not found.

greptile-apps · 2026-03-03T08:04:07Z

nds/nds_gen_data.py

+    temp_base = os.path.join(data_dir, '_temp_update')
+    if os.path.exists(temp_base):
+        shutil.rmtree(temp_base)


No diagnostic when stale temp directory is silently removed

If a previous run crashed after creating _temp_update/, the leftover directory is quietly deleted here. A user debugging a data consistency problem would have no indication that orphaned temp data was discarded. Adding a log line makes this behaviour explicit without changing any logic:

Suggested change

temp_base = os.path.join(data_dir, '_temp_update')

if os.path.exists(temp_base):

shutil.rmtree(temp_base)

temp_base = os.path.join(data_dir, '_temp_update')

if os.path.exists(temp_base):

print(f"Warning: removing stale temp directory from a previous run: {temp_base}")

shutil.rmtree(temp_base)

Fixed: a warning is now printed before removing a stale _temp_update directory.

greptile-apps · 2026-03-03T08:14:27Z

nds/nds_gen_data.py

+def _check_existing_update_data(data_dir):
+    """Check if data_dir already contains maintenance/update table data.
+
+    Returns:
+        list: maintenance table names that have existing data in data_dir.
+    """
+    existing = []
+    for table_name in maintenance_table_names:
+        table_dir = os.path.join(data_dir, table_name)
+        if os.path.isdir(table_dir) and os.listdir(table_dir):
+            existing.append(table_name)
+    return existing


Pre-flight check blocks valid multi-update-set accumulation

_check_existing_update_data flags any non-empty maintenance table directory as a blocker, regardless of which update number was already generated. This prevents a common workflow where multiple update sets (e.g., update 1, then update 2) are accumulated into the same output directory, even though their output files never collide (delete_1.dat vs delete_2.dat, s_catalog_order_1_4.dat vs s_catalog_order_1_4.dat — wait, regular table filenames do collide across update numbers since the name encodes child and parallel, not the update number).

Looking at the regular table filename pattern on line 172: f'{table}_{i}_{parallel}.dat' — this does collide between update runs (child 1 of update 1 and child 1 of update 2 both produce s_catalog_order_1_4.dat). So the current pre-flight check is actually correct in blocking the second run without --overwrite_output. However, it's worth adding a comment explaining this reasoning so it isn't seen as overly conservative:

def _check_existing_update_data(data_dir): """Check if data_dir already contains maintenance/update table data. Note: regular maintenance table filenames encode only child index and parallel count (e.g. s_catalog_order_1_4.dat), not the update number, so files from different update runs would collide. Any existing maintenance data therefore requires --overwrite_output to proceed. Returns: list: maintenance table names that have existing data in data_dir. """

Fixed: docstring now explains that regular maintenance table filenames don't encode the update number, so files from different update runs collide.

greptile-apps · 2026-03-03T08:14:28Z

nds/nds_gen_data.py

+        for p in procs:
+            p.wait()
+            if p.returncode != 0:
+                print("dsdgen failed with return code {}".format(p.returncode))
+                raise Exception("dsdgen failed")


Early exception leaves later processes un-waited before finally

The p.wait() loop checks processes in order. If process at index k exits with a non-zero code, the exception is raised before p.wait() is called for processes k+1, k+2, .... The finally block correctly terminates those remaining processes, but there is a window where a process that finishes with a non-zero exit code between the exception raise and the finally block's p.poll() check could be missed by p.terminate() (because p.poll() returns its exit code, not None). Those processes are effectively orphaned without p.wait() being called on them, leaving zombie processes until the Python interpreter exits.

Consider waiting on all started processes in the finally block regardless of whether they needed termination:

finally: for p in procs: if p.poll() is None: p.terminate() p.wait() # reap all children to avoid zombies if os.path.exists(temp_base): shutil.rmtree(temp_base)

Fixed: p.wait() is now called unconditionally for all processes in the finally block, not just terminated ones.

Avoid silent data overwrites when generating update data with dsdgen by using isolated per-child temp directories instead of relying on dsdgen's -force flag. Delete tables are deduplicated during merge (only the first child's copy is kept), consistent with the existing HDFS approach. A pre-flight check rejects runs against directories that already contain update data unless --overwrite_output is set. Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor

greptile-apps · 2026-03-03T08:22:46Z

nds/nds_gen_data.py

+        _merge_update_data_local(temp_base, data_dir, range_start, range_end,
+                                 args.parallel, args.update, args.overwrite_output)


Partial merge leaves data_dir in inconsistent state on failure

_merge_update_data_local is called inside the try block, and the finally block cleans up temp_base. However, if the merge itself partially succeeds (some shutil.move calls complete) and then throws (e.g., disk-full, permission error), the already-moved files remain in data_dir while temp_base is deleted — leaving data_dir in a partially-merged, unrecoverable state without re-running with --overwrite_output.

Since there is no rollback for the partial merge, consider adding a warning comment before the merge call to make this behaviour explicit, or move the merge outside the try block and handle its failures separately:

_merge_update_data_local(temp_base, data_dir, range_start, range_end, args.parallel, args.update, args.overwrite_output) # Note: if merge partially succeeded before raising, data_dir may be incomplete; # re-run with --overwrite_output to recover. finally:

Fixed: added a comment before the merge call noting that partial failure leaves data_dir incomplete and requires --overwrite_output to recover.

greptile-apps · 2026-03-03T08:22:48Z

nds/nds_gen_data.py

+    if args.update:
+        _generate_update_data_local(args, data_dir, range_start, range_end, tool_path)
+        return


Update path skips overwrite_output check for pre-existing base data

When args.update is truthy the function immediately delegates to _generate_update_data_local and returns, bypassing the existing get_dir_size guard that protects base data. This means that if data_dir is already populated with base (source) data, a user accidentally running --update against it will silently proceed and start writing maintenance table directories alongside the base data without any warning.

_check_existing_update_data only looks at maintenance_table_names, so it provides no protection here. Consider adding a check that warns (or fails) when source table directories already exist in data_dir before generating update data:

if args.update: if not args.overwrite_output and get_dir_size(data_dir) > 0: # Only warn; don't block — source and update data can coexist, # but the user should know they're writing into an existing directory. print("Warning: data_dir '{}' is non-empty; update data will be written alongside " "existing content.".format(data_dir)) _generate_update_data_local(args, data_dir, range_start, range_end, tool_path) return

Fixed: a warning is now printed when --update targets a non-empty data_dir.

- Warn when --update targets a non-empty data_dir so users know update data will be written alongside existing content. - Add comment noting that a partial merge failure leaves data_dir incomplete and requires --overwrite_output to recover. Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor

greptile-apps · 2026-03-03T08:38:24Z

nds/nds_gen_data.py

+            p.wait()
+        if os.path.exists(temp_base):
+            shutil.rmtree(temp_base)
+    subprocess.run(['du', '-h', '-d1', data_dir])


du silently swallowed on failure

subprocess.run is called without check=True, so a failure of the du command (e.g. permission error, missing binary, or path disappearing immediately after cleanup) is silently ignored. While this line is purely informational, the current pattern is inconsistent with most other subprocess.run calls in the file that do use check=True. At minimum, the return code should be checked, or a note added that failure here is intentional:

Suggested change

subprocess.run(['du', '-h', '-d1', data_dir])

subprocess.run(['du', '-h', '-d1', data_dir], check=False) # best-effort display

Fixed: changed to check=False to make the best-effort intent explicit.

greptile-apps · 2026-03-03T08:38:25Z

nds/nds_gen_data.py

+                    print("Warning: expected source not found, skipping: {}".format(src))
+
+
+def _generate_update_data_local(args, data_dir, range_start, range_end, tool_path):


Unit tests not committed to the repository

The PR description lists 14 unit tests covering _check_existing_update_data, _merge_update_data_local, and _generate_update_data_local, but no test file appears in the diff or anywhere in the repository (a search for test*gen_data* returns no results). Without the test file being part of the repository, CI cannot automatically run or enforce them, and they are not available to future contributors.

Consider committing the test module (e.g. nds/test_nds_gen_data.py) as part of this PR to ensure regression coverage is preserved.

Acknowledged. The tests were used for local validation with a mock dsdgen. This project doesn't have an existing test framework or CI for Python unit tests yet, so committing them would add files without integration. Happy to add if maintainers want them included.

greptile-apps · 2026-03-03T08:38:26Z

nds/nds_gen_data.py

+        if table in delete_tables:
+            target_file = os.path.join(target_dir, f'{table}_{update}.dat')
+            if os.path.exists(target_file) and not overwrite_output:
+                print("Skipping {} (already present from a previous run)".format(target_file))
+                continue
+            src = os.path.join(temp_base, f'child_{range_start}',
+                               f'{table}_{update}.dat')
+            if os.path.exists(src):
+                shutil.move(src, target_file)


Delete-table overwrite skips even when destination exists with wrong update number

The guard if os.path.exists(target_file) and not overwrite_output: continue uses target_file = f'{table}_{update}.dat', which encodes the update number. This means if a user previously generated delete_1.dat and is now running --update 2, target_file is delete_2.dat — it doesn't exist yet — so no skip occurs, and the merge proceeds correctly.

However the logic becomes subtly wrong if overwrite_output=True is passed after a previous identical update run: the condition not overwrite_output is False, so the skip is bypassed and shutil.move replaces the file with identical content. This is harmless functionally, but the PR description states "delete/delete_1.dat kept from first range, not overwritten" for Test 4, which contradicts the actual code path when --overwrite_output is set. Consider adding a comment clarifying the intended semantics:

if table in delete_tables: target_file = os.path.join(target_dir, f'{table}_{update}.dat') # All children produce identical delete content for a given update number. # Preserve any existing copy unless the caller explicitly requested overwrite. if os.path.exists(target_file) and not overwrite_output:

Fixed: added a comment clarifying the intended semantics — preserve existing copy unless caller explicitly requested overwrite.

- Add comment clarifying delete table dedup preserves existing copy unless --overwrite_output is explicitly set. - Use check=False on du summary call to make best-effort intent explicit. Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor

14 tests covering _check_existing_update_data, _merge_update_data_local, and _generate_update_data_local using a mock dsdgen script. Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor

Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor

nds/test_update_data_gen.py

Copilot AI review requested due to automatic review settings March 3, 2026 07:50

Copilot started reviewing on behalf of wjxiz1992 March 3, 2026 07:50 View session

wjxiz1992 assigned jihoonson Mar 3, 2026

wjxiz1992 mentioned this pull request Mar 3, 2026

always append "-force" for "update" data generation #222

Open

Copilot AI reviewed Mar 3, 2026

View reviewed changes