Skip to content

Bug: Incomplete computed_pair_distances when multiple chunks processed per node #3

@rukubrakov

Description

@rukubrakov

Description

In simba/mces/mces_computation.py (lines 342-434), the compute_all_mces_results_exhaustive function has a bug where computed_pair_distances is overwritten for each processed chunk instead of being accumulated. This results in the returned MolecularPairsSet containing only the last processed chunk when a node processes multiple chunks.

Location

File: simba/mces/mces_computation.py
Function: compute_all_mces_results_exhaustive
Lines: 342-434

Current Behavior

computed_pair_distances = np.empty((0, 3))  # Line 342

for chunk_idx, chunk in enumerate(chunks):  # Line 344
    if ((chunk_idx % config.PREPROCESSING_NUM_NODES) == config.PREPROCESSING_CURRENT_NODE) or (config.PREPROCESSING_CURRENT_NODE is None):
        # ... chunk processing ...
        
        # Get results from async objects
        computed_pair_distances = np.concatenate(  # Line 421 - OVERWRITES!
            [result.get() for result in results], axis=0
        )
        # save distances per chunk
        np.save(arr=computed_pair_distances, file=filename)  # Line 426

logger.info(f"Computed distances for {computed_pair_distances.shape[0]} pairs.")  # Line 429 - WRONG COUNT!

molecular_pair_set = MolecularPairsSet(
    spectra=all_spectra, pair_distances=computed_pair_distances  # Line 433 - INCOMPLETE!
)

Problem

  1. Overwriting instead of accumulating: Each iteration overwrites computed_pair_distances instead of appending to it
  2. Incorrect logging: The final logger message (line 429) reports only the size of the last chunk
  3. Incomplete return value: The returned MolecularPairsSet (line 433) contains only the last chunk's data
  4. Used return value: The return value is used in line 72 and passed to line 81, making this a real bug

When Does This Occur?

  • When PREPROCESSING_NUM_NODES > 1 and multiple chunks map to the same node
  • When PREPROCESSING_CURRENT_NODE is None (processes all chunks)

Impact

  • ⚠️ Low impact if: Each node typically processes only one chunk and callers always load data via LoadMCES.load_raw_data
  • 🔴 High impact if: The returned MolecularPairsSet is used directly without reloading from files

Note: All chunk data is correctly saved to separate files (line 426), so data is not lost - but the return value and logging are incorrect.

Expected Behavior

Accumulate all processed chunks:

computed_pair_distances = np.empty((0, 3))
all_computed = []  # NEW: accumulator

for chunk_idx, chunk in enumerate(chunks):
    if ((chunk_idx % config.PREPROCESSING_NUM_NODES) == config.PREPROCESSING_CURRENT_NODE) or (config.PREPROCESSING_CURRENT_NODE is None):
        # ... chunk processing ...
        
        # Get results from async objects
        chunk_distances = np.concatenate(
            [result.get() for result in results], axis=0
        )
        all_computed.append(chunk_distances)  # NEW: accumulate
        
        # save distances per chunk (unchanged)
        np.save(arr=chunk_distances, file=filename)

# NEW: Concatenate all chunks after the loop
if all_computed:
    computed_pair_distances = np.concatenate(all_computed, axis=0)

logger.info(f"Computed distances for {computed_pair_distances.shape[0]} pairs.")  # NOW CORRECT!

molecular_pair_set = MolecularPairsSet(
    spectra=all_spectra, pair_distances=computed_pair_distances  # NOW COMPLETE!
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions