Relation
This proposal complements Issue #24 (“Input Matrix Validation & Data Diagnostics”) by providing a data container class (PanelDataset) that integrates validation, imputation, and scaling before calling CausalTensor estimators.
New component: PanelDataset (optional preprocessing & IO)
Goal
Provide an optional wrapper to prepare O and Z (and optional covariates X) safely and consistently before passing them to CausalTensor estimators. This preserves the current simple API (DID(O, Z), DC_PR_*, MC_NNM_*) while offering a standard place for validation, imputation, scaling, and panel alignment.
Why?
CausalTensor expects matrices O and Z as inputs and exposes estimator functions directly (no dataset class today). A thin PanelDataset aligns with this design and avoids coupling preprocessing to any specific estimator.
- Docs: Usage requires
O (N×T) and Z (N×T); API shows direct calls to DID, DC_PR_auto_rank, MC_NNM_with_cross_validation, etc. (no preprocessing class).
Scope
- Construction from long form (unit, time, value) or wide matrices.
- Validation: calls
ct.validate_panel(...) (the diagnostics module proposed) to compute missingness, scale heterogeneity, rank/κ(X), pretrend checks, etc.
- Transformers (optional):
.impute(strategy=...) with choices like ffill/bfill/median/low_rank/kNN.
.scale(strategy=...) with standard/minmax/robust/log and ability to persist/restore params.
.balance(min_pre_period=...) to enforce pre-period sufficiency / drop under-informed units.
.align(units=..., times=...) to ensure consistent N×T layout across O, Z, and X.
- Outputs:
.to_matrices() → (O, Z) (and optionally X) ready for estimators.
.report → attach the ValidationReport from validate_panel.
Minimal API sketch
import causaltensor as ct
from causaltensor.cauest.DebiasConvex import DCPanelSolver
ds = ct.PanelDataset.from_long(
data=df, # columns: unit_id, time_id, outcome, [treatment], [covariates...]
unit_col="unit_id",
time_col="time_id",
outcome_col="y",
treat_col="z", # optional; can also be built with design helpers
covar_cols=None
)
# Validate
vrep = ct.validate_panel(ds.O, X=ds.X, W=ds.Z,
options=ct.ValidationOptions(
compute_condition_number=True,
imputation_advice=True,
stationarity_checks=True))
# Optional transforms (returning self or new ds)
ds = ds.impute(strategy="low_rank", rank=3).scale(strategy="standard")
# Estimation uses matrices, unchanged API
O, Z = ds.to_matrices()
solver = DCPanelSolver(Z=Z, O=O)
result = solver.fit()
M, tau, std = result.baseline, result.tau, result.std
Design principles
- Optional: Estimators remain usable with raw matrices.
- Pure: Transformers return a new dataset or update with a .fitted_ state; deterministic, with parameter persistence.
- Reproducible: ValidationReport and transform params are serializable to JSON for CI logs.
Open questions
- Naming: PanelDataset vs PanelData.
- Dependencies: core (NumPy/SciPy) vs optional extras for plotting/stat tests.
- Imputation policies: include low-rank imputation internally or keep it as a suggestion only?
Relation
This proposal complements Issue #24 (“Input Matrix Validation & Data Diagnostics”) by providing a data container class (
PanelDataset) that integrates validation, imputation, and scaling before calling CausalTensor estimators.New component: PanelDataset (optional preprocessing & IO)
Goal
Provide an optional wrapper to prepare
OandZ(and optional covariatesX) safely and consistently before passing them to CausalTensor estimators. This preserves the current simple API (DID(O, Z),DC_PR_*,MC_NNM_*) while offering a standard place for validation, imputation, scaling, and panel alignment.Why?
CausalTensor expects matrices
OandZas inputs and exposes estimator functions directly (no dataset class today). A thinPanelDatasetaligns with this design and avoids coupling preprocessing to any specific estimator.O (N×T)andZ (N×T); API shows direct calls toDID,DC_PR_auto_rank,MC_NNM_with_cross_validation, etc. (no preprocessing class).Scope
ct.validate_panel(...)(the diagnostics module proposed) to compute missingness, scale heterogeneity, rank/κ(X), pretrend checks, etc..impute(strategy=...)with choices likeffill/bfill/median/low_rank/kNN..scale(strategy=...)withstandard/minmax/robust/logand ability to persist/restore params..balance(min_pre_period=...)to enforce pre-period sufficiency / drop under-informed units..align(units=..., times=...)to ensure consistentN×Tlayout acrossO,Z, andX..to_matrices()→(O, Z)(and optionallyX) ready for estimators..report→ attach theValidationReportfromvalidate_panel.Minimal API sketch
Design principles
Open questions