@@ -8,7 +8,7 @@ The package follows Dagster best practices with utilities organized in subpackag
88
99- ` src/prepare_annotations/ ` : Main package
1010 - ` definitions.py ` : ** Main Dagster definitions** (assets, jobs, resources)
11- - ` pipelines.py ` : ** Standalone API** (PreparationPipelines class )
11+ - ` pipelines.py ` : ** Standalone API** for ClinVar, dbSNP, gnomAD (non-Dagster sources )
1212 - ` cli.py ` : Main Typer CLI entrypoint
1313
1414 - ` core/ ` : Core utilities
@@ -88,7 +88,7 @@ These pipelines are Dagster-first. Follow these rules to avoid the issues we alr
8888
8989- Do not use ` dagster job execute ` or other deprecated CLI for orchestration.
9090- Prefer Python API: ` materialize() ` for assets, ` execute_job() ` for non-partitioned jobs.
91- - If CLI is needed, use Dagster dev server only (` uv run dagster dev -m prepare_annotations ` ).
91+ - If CLI is needed, use Dagster dev server only (` uv run dagster dev -m prepare_annotations.definitions ` ).
9292
9393### 2) Dynamic partitions must be explicit
9494
@@ -140,14 +140,15 @@ Do not upload temp files. Filter out:
140140### Primary Dagster Pipelines (Recommended)
141141
142142- ` uv run dagster-ensembl ` : Run the full Ensembl pipeline (download, convert, upload).
143- - ` uv run dagster-ensembl ui ` : Launch Dagster UI for Ensembl pipelines .
144- - ` uv run dagster-ui ` : General Dagster development server entrypoint .
143+ - ` uv run prepare longevitymap ` : Run the LongevityMap pipeline (convert, join with Ensembl, upload) .
144+ - ` uv run dagster-ui ` : Launch Dagster UI for monitoring and lineage visualization .
145145
146146### OakVar Module Management
147147
148148- ` uv run modules data --repo dna-seq/just_longevitymap ` : Download module data files.
149149- ` uv run modules clone --repo dna-seq/just_longevitymap ` : Clone full module repository.
150- - ` uv run modules convert-longevitymap ` : Convert LongevityMap to unified annotation schema.
150+ - ` uv run prepare longevitymap ` : Run full Dagster pipeline (convert + Ensembl join + upload).
151+ - ` uv run prepare longevitymap --convert-only ` : Convert only (no Ensembl join, no upload).
151152
152153### Unified Annotation Schema
153154
@@ -192,6 +193,67 @@ Datasets are typically uploaded to the `just-dna-seq` organization on Hugging Fa
192193- ** Auto-download** : Tests automatically download required data from GitHub
193194- ** Validation** : Comprehensive checks ensuring data integrity during conversion
194195
196+ ### Test Generation Guidelines (Universal)
197+
198+ - ** Real data + ground truth** : Use actual source data, auto-download if needed, and compute expected values at runtime.
199+ - ** Deterministic coverage** : Use fixed seeds or explicit filters; include representative and edge cases.
200+ - ** Meaningful assertions** : Prefer relationships and aggregates over existence-only checks.
201+
202+ #### What to Validate
203+
204+ - ** Counts & aggregates** : Row counts, sums/min/max/means, distinct counts, and distributions.
205+ - ** Joins** : Pre/post counts, key coverage, cardinality expectations, nulls introduced by outer joins, and a few spot-checks.
206+ - ** Transformations** : Round-trip survival, subset/superset semantics, value mapping, key preservation.
207+ - ** Data quality** : Format/range checks, outliers, malformed entries, duplicates, referential integrity.
208+
209+ #### Avoiding LLM "Reward Hacking" in Tests
210+
211+ - ** Runtime ground truth** : Query source data at test time instead of hardcoding expectations.
212+ - ** Seeded sampling** : Validate random records with a fixed seed, not just known examples.
213+ - ** Negative & boundary tests** : Ensure invalid inputs fail; probe min/max, empty, unicode.
214+ - ** Derived assertions** : Test relationships (e.g., input vs output counts), not magic numbers.
215+ - ** Allow expected failures** : Use ` pytest.mark.xfail ` for known data quality issues with a clear reason.
216+
217+ #### Test Structure Best Practices
218+
219+ - ** Parameterize over duplicate** : If testing the same logic on multiple outputs, use ` @pytest.mark.parametrize ` instead of copy-pasting tests.
220+ - ** Set equality over counts** : Prefer ` assert set_a == set_b ` over ` assert len(set_a) == 270 ` - set comparison catches both missing and extra values.
221+ - ** Delete redundant tests** : If test A (e.g., set equality) fully covers test B (e.g., count check), keep only test A.
222+ - ** Domain constants are OK** : Hardcoding expected enum values or well-known constants from specs is fine; hardcoding row counts or unique counts derived from data inspection is not.
223+
224+ #### Anti-Patterns to Avoid
225+
226+ - Testing only "happy path" with trivial data
227+ - Hardcoding expected values that drift from source (use derived ground truth)
228+ - Mocking data transformations instead of running real pipelines
229+ - Ignoring edge cases (nulls, empty strings, boundary values, unicode, malformed data)
230+
231+ ** Meaningless Tests to Avoid** (common AI-generated anti-patterns):
232+
233+ ``` python
234+ # BAD: Existence-only checks as the sole validation
235+ assert " name" in df.columns
236+ assert len (df) > 0
237+
238+ # BAD: Hardcoded counts derived from data inspection
239+ assert len (source_ids) == 270 # will break when source changes
240+
241+ # BAD: Redundant with set equality test
242+ assert len (output_cats) == 12 # already covered by subset check
243+
244+ # ACCEPTABLE: Required columns as prerequisites
245+ required_cols = {" id" , " name" , " value" }
246+ assert required_cols.issubset(df.columns)
247+
248+ # GOOD: Set equality from source data
249+ source_ids = set (source_df[" id" ].unique().drop_nulls().to_list())
250+ output_ids = set (output_df[" id" ].unique().drop_nulls().to_list())
251+ assert source_ids == output_ids
252+
253+ # GOOD: Domain knowledge constants (from spec, not data inspection)
254+ assert valid_states == {" active" , " inactive" , " pending" } # from API spec
255+ ```
256+
195257### Running Tests
196258
197259``` bash
@@ -216,29 +278,29 @@ These fixtures are automatically used by test modules to ensure data availabilit
216278
217279### Example: LongevityMap Validation
218280
219- The ` test_longevitymap_module.py ` includes 47 tests validating :
281+ The ` test_longevitymap_module.py ` validates conversion integrity :
220282
221- 1 . ** Weights Table** (1043 rows, 528 variants)
222- - Row counts match between SQLite and Parquet
223- - Weight values preserved (sum, min, max, mean)
224- - Per-rsid weight sums match
225- - Negative (risk) weights correctly identified
283+ 1 . ** Weights Table**
284+ - Row counts: parquet ≥ sqlite (due to genotype expansion)
285+ - Unique rsid counts match between formats
286+ - Weight values preserved (min/max match, all unique values present)
287+ - Per-rsid weight sets match (not sums, due to expansion)
226288
2272892 . ** APOE Variants** (Critical longevity markers)
228- - rs7412 (APOE e2): protective weights
229- - rs429358 (APOE e4): risk weights
290+ - rs7412 (APOE e2): expected protective weight set ` {0.5, 1.0} `
291+ - rs429358 (APOE e4): expected risk weight set ` {-0.5, -1.0} `
230292
2312933 . ** Schema Transformations**
232- - Genotype format (het → CT, hom → TT)
233- - State values ( protective/ risk)
234- - Module column correctness
294+ - Genotype format: list of 2 alleles, alphabetically sorted
295+ - State values: valid enum ( ` protective ` , ` risk ` , ` alt ` , ` ref ` )
296+ - Module column: all rows have ` "longevitymap" `
235297
2362984 . ** Studies & Annotations**
237- - All PMIDs preserved (270 unique)
238- - Categories preserved (12 categories)
239- - Populations preserved (81 populations)
299+ - All PMIDs: set equality between sqlite and parquet
300+ - Categories: parquet subset of sqlite categories
301+ - Populations: parquet subset of sqlite populations
240302
241303Tests automatically:
2423041 . Download SQLite from ` dna-seq/just_longevitymap ` if missing
2433052 . Convert to parquet if needed
244- 3 . Validate data integrity
306+ 3 . Validate data integrity via source comparison (not hardcoded counts)
0 commit comments