Skip to content

New component: PanelDataset (optional preprocessing & IO) #25

@fedemolina

Description

@fedemolina

Relation

This proposal complements Issue #24 (“Input Matrix Validation & Data Diagnostics”) by providing a data container class (PanelDataset) that integrates validation, imputation, and scaling before calling CausalTensor estimators.

New component: PanelDataset (optional preprocessing & IO)

Goal

Provide an optional wrapper to prepare O and Z (and optional covariates X) safely and consistently before passing them to CausalTensor estimators. This preserves the current simple API (DID(O, Z), DC_PR_*, MC_NNM_*) while offering a standard place for validation, imputation, scaling, and panel alignment.

Why?

CausalTensor expects matrices O and Z as inputs and exposes estimator functions directly (no dataset class today). A thin PanelDataset aligns with this design and avoids coupling preprocessing to any specific estimator.

  • Docs: Usage requires O (N×T) and Z (N×T); API shows direct calls to DID, DC_PR_auto_rank, MC_NNM_with_cross_validation, etc. (no preprocessing class).

Scope

  • Construction from long form (unit, time, value) or wide matrices.
  • Validation: calls ct.validate_panel(...) (the diagnostics module proposed) to compute missingness, scale heterogeneity, rank/κ(X), pretrend checks, etc.
  • Transformers (optional):
    • .impute(strategy=...) with choices like ffill/bfill/median/low_rank/kNN.
    • .scale(strategy=...) with standard/minmax/robust/log and ability to persist/restore params.
    • .balance(min_pre_period=...) to enforce pre-period sufficiency / drop under-informed units.
    • .align(units=..., times=...) to ensure consistent N×T layout across O, Z, and X.
  • Outputs:
    • .to_matrices()(O, Z) (and optionally X) ready for estimators.
    • .report → attach the ValidationReport from validate_panel.

Minimal API sketch

import causaltensor as ct
from causaltensor.cauest.DebiasConvex import DCPanelSolver


ds = ct.PanelDataset.from_long(
    data=df,              # columns: unit_id, time_id, outcome, [treatment], [covariates...]
    unit_col="unit_id",
    time_col="time_id",
    outcome_col="y",
    treat_col="z",        # optional; can also be built with design helpers
    covar_cols=None
)

# Validate
vrep = ct.validate_panel(ds.O, X=ds.X, W=ds.Z, 
                         options=ct.ValidationOptions(
                             compute_condition_number=True, 
                             imputation_advice=True,
                             stationarity_checks=True))

# Optional transforms (returning self or new ds)
ds = ds.impute(strategy="low_rank", rank=3).scale(strategy="standard")

# Estimation uses matrices, unchanged API
O, Z = ds.to_matrices()
solver = DCPanelSolver(Z=Z, O=O)
result = solver.fit() 
M, tau, std = result.baseline, result.tau, result.std

Design principles

  • Optional: Estimators remain usable with raw matrices.
  • Pure: Transformers return a new dataset or update with a .fitted_ state; deterministic, with parameter persistence.
  • Reproducible: ValidationReport and transform params are serializable to JSON for CI logs.

Open questions

  • Naming: PanelDataset vs PanelData.
  • Dependencies: core (NumPy/SciPy) vs optional extras for plotting/stat tests.
  • Imputation policies: include low-rank imputation internally or keep it as a suggestion only?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions