New component: PanelDataset (optional preprocessing & IO)

## Relation
This proposal complements Issue #24  (“Input Matrix Validation & Data Diagnostics”) by providing a data container class (`PanelDataset`) that integrates validation, imputation, and scaling before calling CausalTensor estimators.

## New component: PanelDataset (optional preprocessing & IO)

### Goal
Provide an optional wrapper to prepare `O` and `Z` (and optional covariates `X`) safely and consistently before passing them to CausalTensor estimators. This preserves the current simple API (`DID(O, Z)`, `DC_PR_*`, `MC_NNM_*`) while offering a standard place for validation, imputation, scaling, and panel alignment.

### Why?
CausalTensor expects matrices `O` and `Z` as inputs and exposes estimator functions directly (no dataset class today). A thin `PanelDataset` aligns with this design and avoids coupling preprocessing to any specific estimator.
- Docs: Usage requires `O (N×T)` and `Z (N×T)`; API shows direct calls to `DID`, `DC_PR_auto_rank`, `MC_NNM_with_cross_validation`, etc. (no preprocessing class). 

### Scope
- **Construction** from long form (unit, time, value) or wide matrices.
- **Validation**: calls `ct.validate_panel(...)` (the diagnostics module proposed) to compute missingness, scale heterogeneity, rank/κ(X), pretrend checks, etc.
- **Transformers** (optional):
  - `.impute(strategy=...)` with choices like `ffill/bfill/median/low_rank/kNN`.
  - `.scale(strategy=...)` with `standard/minmax/robust/log` and ability to persist/restore params.
  - `.balance(min_pre_period=...)` to enforce pre-period sufficiency / drop under-informed units.
  - `.align(units=..., times=...)` to ensure consistent `N×T` layout across `O`, `Z`, and `X`.
- **Outputs**:
  - `.to_matrices()` → `(O, Z)` (and optionally `X`) ready for estimators.
  - `.report` → attach the `ValidationReport` from `validate_panel`.

### Minimal API sketch

```python
import causaltensor as ct
from causaltensor.cauest.DebiasConvex import DCPanelSolver


ds = ct.PanelDataset.from_long(
    data=df,              # columns: unit_id, time_id, outcome, [treatment], [covariates...]
    unit_col="unit_id",
    time_col="time_id",
    outcome_col="y",
    treat_col="z",        # optional; can also be built with design helpers
    covar_cols=None
)

# Validate
vrep = ct.validate_panel(ds.O, X=ds.X, W=ds.Z, 
                         options=ct.ValidationOptions(
                             compute_condition_number=True, 
                             imputation_advice=True,
                             stationarity_checks=True))

# Optional transforms (returning self or new ds)
ds = ds.impute(strategy="low_rank", rank=3).scale(strategy="standard")

# Estimation uses matrices, unchanged API
O, Z = ds.to_matrices()
solver = DCPanelSolver(Z=Z, O=O)
result = solver.fit() 
M, tau, std = result.baseline, result.tau, result.std
```

### Design principles
- Optional: Estimators remain usable with raw matrices.
- Pure: Transformers return a new dataset or update with a .fitted_ state; deterministic, with parameter persistence.
- Reproducible: ValidationReport and transform params are serializable to JSON for CI logs.

### Open questions
- Naming: PanelDataset vs PanelData.
- Dependencies: core (NumPy/SciPy) vs optional extras for plotting/stat tests.
- Imputation policies: include low-rank imputation internally or keep it as a suggestion only?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New component: PanelDataset (optional preprocessing & IO) #25

Relation