-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Description
In simba/mces/mces_computation.py (lines 342-434), the compute_all_mces_results_exhaustive function has a bug where computed_pair_distances is overwritten for each processed chunk instead of being accumulated. This results in the returned MolecularPairsSet containing only the last processed chunk when a node processes multiple chunks.
Location
File: simba/mces/mces_computation.py
Function: compute_all_mces_results_exhaustive
Lines: 342-434
Current Behavior
computed_pair_distances = np.empty((0, 3)) # Line 342
for chunk_idx, chunk in enumerate(chunks): # Line 344
if ((chunk_idx % config.PREPROCESSING_NUM_NODES) == config.PREPROCESSING_CURRENT_NODE) or (config.PREPROCESSING_CURRENT_NODE is None):
# ... chunk processing ...
# Get results from async objects
computed_pair_distances = np.concatenate( # Line 421 - OVERWRITES!
[result.get() for result in results], axis=0
)
# save distances per chunk
np.save(arr=computed_pair_distances, file=filename) # Line 426
logger.info(f"Computed distances for {computed_pair_distances.shape[0]} pairs.") # Line 429 - WRONG COUNT!
molecular_pair_set = MolecularPairsSet(
spectra=all_spectra, pair_distances=computed_pair_distances # Line 433 - INCOMPLETE!
)Problem
- Overwriting instead of accumulating: Each iteration overwrites
computed_pair_distancesinstead of appending to it - Incorrect logging: The final logger message (line 429) reports only the size of the last chunk
- Incomplete return value: The returned
MolecularPairsSet(line 433) contains only the last chunk's data - Used return value: The return value is used in line 72 and passed to line 81, making this a real bug
When Does This Occur?
- When
PREPROCESSING_NUM_NODES > 1and multiple chunks map to the same node - When
PREPROCESSING_CURRENT_NODE is None(processes all chunks)
Impact
⚠️ Low impact if: Each node typically processes only one chunk and callers always load data viaLoadMCES.load_raw_data- 🔴 High impact if: The returned
MolecularPairsSetis used directly without reloading from files
Note: All chunk data is correctly saved to separate files (line 426), so data is not lost - but the return value and logging are incorrect.
Expected Behavior
Accumulate all processed chunks:
computed_pair_distances = np.empty((0, 3))
all_computed = [] # NEW: accumulator
for chunk_idx, chunk in enumerate(chunks):
if ((chunk_idx % config.PREPROCESSING_NUM_NODES) == config.PREPROCESSING_CURRENT_NODE) or (config.PREPROCESSING_CURRENT_NODE is None):
# ... chunk processing ...
# Get results from async objects
chunk_distances = np.concatenate(
[result.get() for result in results], axis=0
)
all_computed.append(chunk_distances) # NEW: accumulate
# save distances per chunk (unchanged)
np.save(arr=chunk_distances, file=filename)
# NEW: Concatenate all chunks after the loop
if all_computed:
computed_pair_distances = np.concatenate(all_computed, axis=0)
logger.info(f"Computed distances for {computed_pair_distances.shape[0]} pairs.") # NOW CORRECT!
molecular_pair_set = MolecularPairsSet(
spectra=all_spectra, pair_distances=computed_pair_distances # NOW COMPLETE!
)Metadata
Metadata
Assignees
Labels
No labels