Skip to content

Commit 4cf7a5f

Browse files
feat: Add machine-readable digest of comparison
1 parent c85d340 commit 4cf7a5f

1 file changed

Lines changed: 107 additions & 0 deletions

File tree

lexical-sprouting-scroll.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# Add `Digest` dataclass for machine-readable comparison output
2+
3+
## Context
4+
5+
Currently, `DataFrameComparison.summary()` returns a `Summary` object that renders rich-formatted console output. There's no way to get structured, machine-readable data from a comparison (e.g., for LLM consumption, CI pipelines, or programmatic analysis). This adds a `digest()` method returning a plain dataclass hierarchy that can be serialized via `dataclasses.asdict()` and `json.dumps()`.
6+
7+
## Dataclass Structure
8+
9+
All dataclasses in a new file `diffly/digest.py`:
10+
11+
```python
12+
@dataclass
13+
class Digest:
14+
equal: bool
15+
left_name: str
16+
right_name: str
17+
primary_key: list[str] | None
18+
schemas: DigestSchemas | None # None when equal, or slim + schemas match
19+
rows: DigestRows | None # None when equal, or slim + rows match
20+
columns: list[DigestColumn] | None # None when equal, no PK, no joined rows, or slim + all match
21+
sample_rows_left_only: list[tuple[Any, ...]] | None # None when no PK or sample_k==0
22+
sample_rows_right_only: list[tuple[Any, ...]] | None
23+
24+
@dataclass
25+
class DigestSchemas:
26+
left_only: list[tuple[str, str]] # (col_name, dtype_str)
27+
in_common: list[tuple[str, str, str]] # (col_name, left_dtype_str, right_dtype_str)
28+
right_only: list[tuple[str, str]]
29+
30+
@dataclass
31+
class DigestRows:
32+
n_left: int
33+
n_right: int
34+
n_left_only: int | None # None when no primary key
35+
n_joined_equal: int | None
36+
n_joined_unequal: int | None
37+
n_right_only: int | None
38+
39+
@dataclass
40+
class DigestColumn:
41+
name: str
42+
match_rate: float
43+
changes: list[DigestColumnChange] | None # None when top_k==0 or column is hidden
44+
45+
@dataclass
46+
class DigestColumnChange:
47+
old: Any
48+
new: Any
49+
count: int
50+
sample_pk: Any | None # None when show_sample_primary_key_per_change=False
51+
```
52+
53+
**Design notes:**
54+
55+
- `primary_key` is a top-level field so consumers know what the sample row tuples represent.
56+
- `sample_rows_left_only` / `sample_rows_right_only` use `list[tuple]` matching the primary key column order.
57+
- `in_common` uses 3-tuples `(name, left_dtype, right_dtype)` to capture dtype changes (when they match, `left_dtype == right_dtype`).
58+
- `schemas` is always populated (not `None`) when frames aren't equal and not slim-hidden, even if schemas match -- the caller might want to confirm schemas are identical. **Actually**: mirror `Summary` logic -- `None` when `slim=True` and schemas are equal.
59+
60+
## Files to modify
61+
62+
### 1. New: [diffly/digest.py](diffly/digest.py)
63+
64+
- All dataclass definitions above
65+
- `_to_python(value)` helper to convert Polars values (date, datetime, timedelta, Decimal) to JSON-safe types
66+
- Builder function `_build_digest(comparison, **params) -> Digest` containing the logic to extract data from `DataFrameComparison`, mirroring the control flow of `Summary._print_to_console` / `_print_diff`
67+
- `to_dict()` method on `Digest` via `dataclasses.asdict()`
68+
- `to_json()` convenience method
69+
70+
### 2. [diffly/comparison.py](diffly/comparison.py) (~line 976)
71+
72+
- Add `digest()` method on `DataFrameComparison` with same signature as `summary()`
73+
- Lazy import `from .digest import Digest` (same pattern as summary)
74+
75+
### 3. [diffly/cli.py](diffly/cli.py)
76+
77+
- Add `--json` flag (bool, default False)
78+
- When True, call `comparison.digest(...).to_json()` instead of `comparison.summary(...).format()`
79+
80+
### 4. [diffly/**init**.py](diffly/__init__.py)
81+
82+
- No changes needed -- `Digest` is accessed via `comparison.digest()`, not imported directly. Can revisit later.
83+
84+
### 5. **No changes to** [diffly/testing.py](diffly/testing.py)
85+
86+
- `testing.py` uses `summary()` for human-readable assertion error messages. `digest()` is a data output format, not relevant to assertions.
87+
88+
### 6. New: [tests/test_digest.py](tests/test_digest.py)
89+
90+
- Equal frames -> `equal=True`, all sections `None`
91+
- Schema differences (left-only, right-only, dtype mismatches in in_common)
92+
- Row counts with and without primary key
93+
- Column match rates with `show_perfect_column_matches=True/False`
94+
- `top_k_column_changes` + `show_sample_primary_key_per_change`
95+
- `sample_k_rows_only` for `sample_rows_left_only` / `sample_rows_right_only`
96+
- `slim=True` suppresses matching sections
97+
- `hidden_columns` hides column changes
98+
- Validation errors (same as Summary: hidden PK columns, sample PK without top-k)
99+
- JSON serialization roundtrip: `json.loads(digest.to_json())` is valid
100+
101+
## Verification
102+
103+
```bash
104+
pixi run pytest tests/test_digest.py -v
105+
pixi run test
106+
pixi run pre-commit-run
107+
```

0 commit comments

Comments
 (0)