|
| 1 | +# Add `Digest` dataclass for machine-readable comparison output |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +Currently, `DataFrameComparison.summary()` returns a `Summary` object that renders rich-formatted console output. There's no way to get structured, machine-readable data from a comparison (e.g., for LLM consumption, CI pipelines, or programmatic analysis). This adds a `digest()` method returning a plain dataclass hierarchy that can be serialized via `dataclasses.asdict()` and `json.dumps()`. |
| 6 | + |
| 7 | +## Dataclass Structure |
| 8 | + |
| 9 | +All dataclasses in a new file `diffly/digest.py`: |
| 10 | + |
| 11 | +```python |
| 12 | +@dataclass |
| 13 | +class Digest: |
| 14 | + equal: bool |
| 15 | + left_name: str |
| 16 | + right_name: str |
| 17 | + primary_key: list[str] | None |
| 18 | + schemas: DigestSchemas | None # None when equal, or slim + schemas match |
| 19 | + rows: DigestRows | None # None when equal, or slim + rows match |
| 20 | + columns: list[DigestColumn] | None # None when equal, no PK, no joined rows, or slim + all match |
| 21 | + sample_rows_left_only: list[tuple[Any, ...]] | None # None when no PK or sample_k==0 |
| 22 | + sample_rows_right_only: list[tuple[Any, ...]] | None |
| 23 | + |
| 24 | +@dataclass |
| 25 | +class DigestSchemas: |
| 26 | + left_only: list[tuple[str, str]] # (col_name, dtype_str) |
| 27 | + in_common: list[tuple[str, str, str]] # (col_name, left_dtype_str, right_dtype_str) |
| 28 | + right_only: list[tuple[str, str]] |
| 29 | + |
| 30 | +@dataclass |
| 31 | +class DigestRows: |
| 32 | + n_left: int |
| 33 | + n_right: int |
| 34 | + n_left_only: int | None # None when no primary key |
| 35 | + n_joined_equal: int | None |
| 36 | + n_joined_unequal: int | None |
| 37 | + n_right_only: int | None |
| 38 | + |
| 39 | +@dataclass |
| 40 | +class DigestColumn: |
| 41 | + name: str |
| 42 | + match_rate: float |
| 43 | + changes: list[DigestColumnChange] | None # None when top_k==0 or column is hidden |
| 44 | + |
| 45 | +@dataclass |
| 46 | +class DigestColumnChange: |
| 47 | + old: Any |
| 48 | + new: Any |
| 49 | + count: int |
| 50 | + sample_pk: Any | None # None when show_sample_primary_key_per_change=False |
| 51 | +``` |
| 52 | + |
| 53 | +**Design notes:** |
| 54 | + |
| 55 | +- `primary_key` is a top-level field so consumers know what the sample row tuples represent. |
| 56 | +- `sample_rows_left_only` / `sample_rows_right_only` use `list[tuple]` matching the primary key column order. |
| 57 | +- `in_common` uses 3-tuples `(name, left_dtype, right_dtype)` to capture dtype changes (when they match, `left_dtype == right_dtype`). |
| 58 | +- `schemas` is always populated (not `None`) when frames aren't equal and not slim-hidden, even if schemas match -- the caller might want to confirm schemas are identical. **Actually**: mirror `Summary` logic -- `None` when `slim=True` and schemas are equal. |
| 59 | + |
| 60 | +## Files to modify |
| 61 | + |
| 62 | +### 1. New: [diffly/digest.py](diffly/digest.py) |
| 63 | + |
| 64 | +- All dataclass definitions above |
| 65 | +- `_to_python(value)` helper to convert Polars values (date, datetime, timedelta, Decimal) to JSON-safe types |
| 66 | +- Builder function `_build_digest(comparison, **params) -> Digest` containing the logic to extract data from `DataFrameComparison`, mirroring the control flow of `Summary._print_to_console` / `_print_diff` |
| 67 | +- `to_dict()` method on `Digest` via `dataclasses.asdict()` |
| 68 | +- `to_json()` convenience method |
| 69 | + |
| 70 | +### 2. [diffly/comparison.py](diffly/comparison.py) (~line 976) |
| 71 | + |
| 72 | +- Add `digest()` method on `DataFrameComparison` with same signature as `summary()` |
| 73 | +- Lazy import `from .digest import Digest` (same pattern as summary) |
| 74 | + |
| 75 | +### 3. [diffly/cli.py](diffly/cli.py) |
| 76 | + |
| 77 | +- Add `--json` flag (bool, default False) |
| 78 | +- When True, call `comparison.digest(...).to_json()` instead of `comparison.summary(...).format()` |
| 79 | + |
| 80 | +### 4. [diffly/**init**.py](diffly/__init__.py) |
| 81 | + |
| 82 | +- No changes needed -- `Digest` is accessed via `comparison.digest()`, not imported directly. Can revisit later. |
| 83 | + |
| 84 | +### 5. **No changes to** [diffly/testing.py](diffly/testing.py) |
| 85 | + |
| 86 | +- `testing.py` uses `summary()` for human-readable assertion error messages. `digest()` is a data output format, not relevant to assertions. |
| 87 | + |
| 88 | +### 6. New: [tests/test_digest.py](tests/test_digest.py) |
| 89 | + |
| 90 | +- Equal frames -> `equal=True`, all sections `None` |
| 91 | +- Schema differences (left-only, right-only, dtype mismatches in in_common) |
| 92 | +- Row counts with and without primary key |
| 93 | +- Column match rates with `show_perfect_column_matches=True/False` |
| 94 | +- `top_k_column_changes` + `show_sample_primary_key_per_change` |
| 95 | +- `sample_k_rows_only` for `sample_rows_left_only` / `sample_rows_right_only` |
| 96 | +- `slim=True` suppresses matching sections |
| 97 | +- `hidden_columns` hides column changes |
| 98 | +- Validation errors (same as Summary: hidden PK columns, sample PK without top-k) |
| 99 | +- JSON serialization roundtrip: `json.loads(digest.to_json())` is valid |
| 100 | + |
| 101 | +## Verification |
| 102 | + |
| 103 | +```bash |
| 104 | +pixi run pytest tests/test_digest.py -v |
| 105 | +pixi run test |
| 106 | +pixi run pre-commit-run |
| 107 | +``` |
0 commit comments