Skip to content

[REFACTOR] Rename and refactor DataHandler #201

@jeipollack

Description

@jeipollack

Description of Refactoring/Improvement
Split DataHandler into two focused classes: SimulationDataLoader (for loading .npy files) and TensorFlowDatasetConverter (for SED processing and TensorFlow conversion). Keep DataHandler as a deprecated wrapper for backward compatibility.

Goals and Objectives

  • Separate file loading logic from data conversion logic
  • Create reusable converter that works with any PSF dataset format (not just simulations)
  • Enable upcoming adapter pattern for supporting both simulation and real observational data
  • Maintain backward compatibility with existing code

Current Code Behaviour
DataHandler currently mixes three responsibilities:

  • Loading simulation .npy files from disk
  • Converting NumPy arrays to TensorFlow tensors
  • Processing SEDs with simPSF

This tight coupling makes it difficult to:

  • Use PSF dataset dataclasses (like SHEPSFDataset) without converting to dicts
  • Reuse conversion logic for different data sources
  • Test loading and conversion independently

Proposed Changes
Create simulation_data_loader.py:

class SimulationDataLoader:
    """Loads .npy simulation files and validates structure."""
    def load(self):
        """Load from disk, validate, and return dataset dict."""

Create tensorflow_converter.py:

class TensorFlowDatasetConverter:
    """Converts PSF datasets to TensorFlow tensors."""
    def convert_psf_dataset(self, dataset, target_field='images'):
        """Convert PSF dataclass to TF dict."""
    
    def convert_dict(self, dataset_dict, dataset_type='train'):
        """Convert legacy dict to TF dict."""
    
    def _process_seds(self, sed_data):
        """Process SEDs using simPSF."""

Update data_handler.py:

class DataHandler:
    """DEPRECATED: Thin wrapper delegating to SimulationDataLoader."""
    def __init__(self, *args, **kwargs):
        warnings.warn("DataHandler is deprecated...", DeprecationWarning)
        self._loader = SimulationDataLoader(*args, **kwargs)

Expected Benefits

  • Reusability: TensorFlowDatasetConverter works with any PSF dataset (Euclid, Roman, JWST)
  • Testability: Can test loading and conversion independently
  • Maintainability: Single Responsibility Principle - each class has one clear purpose
  • Extensibility: Enables adapter pattern for unified training interface
  • Backward compatibility: Existing code continues working with deprecation warning

Dependencies

  • No breaking changes - existing code using DataHandler continues to work
  • Enables follow-up PRs for adapter pattern implementation
  • Future external API (TrainWaveDiffPSF) will use TensorFlowDatasetConverter directly

Testing Plan

  1. Unit tests for SimulationDataLoader:

    • Test loading .npy files
    • Test validation of simulation-specific structure
    • Test error handling for missing/invalid files
  2. Unit tests for TensorFlowDatasetConverter:

    • Test convert_psf_dataset() with mock PSF dataclass
    • Test convert_dict() with simulation dict
    • Test SED processing pipeline
    • Verify correct tensor shapes and dtypes
  3. Integration tests:

    • Test deprecated DataHandler produces identical results to new classes
    • Test with real simulation data end-to-end
  4. Regression tests:

    • Ensure existing training scripts work unchanged
    • Verify deprecation warning fires correctly

Additional Context
This refactoring is prerequisite work for supporting Euclid SHEPSFDataset (real observational data from Euclid SHE) alongside existing simulation workflows. The converter's generic design (convert_psf_dataset()) will support future missions (Roman, JWST) without modification.
Related: Upcoming PRs will introduce TrainingDataAdapter pattern that builds on these refactored components.

Impact Assessment
Low risk, high value foundation work:

  • No breaking changes - 100% backward compatible via deprecated wrapper
  • Enables future work - Required for adapter pattern and real data support
  • Small scope - Code reorganization without algorithmic changes
  • Well-isolated - Changes contained to data loading/conversion layer
  • Migration path - Deprecation warning guides users to new classes

Estimated files changed: 3 new, 1 modified
Estimated LOC: ~400 (mostly moved, not new logic)

Next Steps

  1. Implement SimulationDataLoader and TensorFlowDatasetConverter
  2. Add deprecation wrapper to DataHandler
  3. Write comprehensive unit and integration tests
  4. Update internal documentation with migration examples
  5. Merge this PR before opening follow-up adapter pattern PRs:

PR #199 : Add TrainingDataAdapter for simulation data
PR (TBD) : Add PSFDatasetAdapter for real data
PR #200 : Migrate training code to use adapters

Thank you for starting this request to refactor or improve the code. We will review it and collaborate to enhance the codebase together! 🛠️

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions