Skip to content

Commit 045b9bb

Browse files
committed
refactor and fi test
1 parent b2ca8ce commit 045b9bb

39 files changed

Lines changed: 6051 additions & 1852 deletions

.github/workflows/tests.yml

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,23 +20,20 @@ jobs:
2020
uses: actions/checkout@v4
2121

2222
- name: Install uv
23-
uses: astral-sh/setup-uv@v4
23+
uses: astral-sh/setup-uv@v5
2424
with:
25-
version: "latest"
2625
enable-cache: true
2726

2827
- name: Set up Python
29-
run: uv python install 3.13
28+
run: uv python install
3029

3130
- name: Install dependencies
3231
run: uv sync --dev
3332

3433
- name: Run tests
35-
env:
36-
PREPARE_ANNOTATIONS_CI: "true"
3734
run: |
3835
# Create data directory tree to avoid warnings
3936
mkdir -p data/{input,interim,output} logs
4037
# Run all tests, including integration tests
41-
uv run python -m pytest tests/ -vvv
38+
uv run pytest -vvv
4239

AGENTS.md

Lines changed: 82 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The package follows Dagster best practices with utilities organized in subpackag
88

99
- `src/prepare_annotations/`: Main package
1010
- `definitions.py`: **Main Dagster definitions** (assets, jobs, resources)
11-
- `pipelines.py`: **Standalone API** (PreparationPipelines class)
11+
- `pipelines.py`: **Standalone API** for ClinVar, dbSNP, gnomAD (non-Dagster sources)
1212
- `cli.py`: Main Typer CLI entrypoint
1313

1414
- `core/`: Core utilities
@@ -88,7 +88,7 @@ These pipelines are Dagster-first. Follow these rules to avoid the issues we alr
8888

8989
- Do not use `dagster job execute` or other deprecated CLI for orchestration.
9090
- Prefer Python API: `materialize()` for assets, `execute_job()` for non-partitioned jobs.
91-
- If CLI is needed, use Dagster dev server only (`uv run dagster dev -m prepare_annotations`).
91+
- If CLI is needed, use Dagster dev server only (`uv run dagster dev -m prepare_annotations.definitions`).
9292

9393
### 2) Dynamic partitions must be explicit
9494

@@ -140,14 +140,15 @@ Do not upload temp files. Filter out:
140140
### Primary Dagster Pipelines (Recommended)
141141

142142
- `uv run dagster-ensembl`: Run the full Ensembl pipeline (download, convert, upload).
143-
- `uv run dagster-ensembl ui`: Launch Dagster UI for Ensembl pipelines.
144-
- `uv run dagster-ui`: General Dagster development server entrypoint.
143+
- `uv run prepare longevitymap`: Run the LongevityMap pipeline (convert, join with Ensembl, upload).
144+
- `uv run dagster-ui`: Launch Dagster UI for monitoring and lineage visualization.
145145

146146
### OakVar Module Management
147147

148148
- `uv run modules data --repo dna-seq/just_longevitymap`: Download module data files.
149149
- `uv run modules clone --repo dna-seq/just_longevitymap`: Clone full module repository.
150-
- `uv run modules convert-longevitymap`: Convert LongevityMap to unified annotation schema.
150+
- `uv run prepare longevitymap`: Run full Dagster pipeline (convert + Ensembl join + upload).
151+
- `uv run prepare longevitymap --convert-only`: Convert only (no Ensembl join, no upload).
151152

152153
### Unified Annotation Schema
153154

@@ -192,6 +193,67 @@ Datasets are typically uploaded to the `just-dna-seq` organization on Hugging Fa
192193
- **Auto-download**: Tests automatically download required data from GitHub
193194
- **Validation**: Comprehensive checks ensuring data integrity during conversion
194195

196+
### Test Generation Guidelines (Universal)
197+
198+
- **Real data + ground truth**: Use actual source data, auto-download if needed, and compute expected values at runtime.
199+
- **Deterministic coverage**: Use fixed seeds or explicit filters; include representative and edge cases.
200+
- **Meaningful assertions**: Prefer relationships and aggregates over existence-only checks.
201+
202+
#### What to Validate
203+
204+
- **Counts & aggregates**: Row counts, sums/min/max/means, distinct counts, and distributions.
205+
- **Joins**: Pre/post counts, key coverage, cardinality expectations, nulls introduced by outer joins, and a few spot-checks.
206+
- **Transformations**: Round-trip survival, subset/superset semantics, value mapping, key preservation.
207+
- **Data quality**: Format/range checks, outliers, malformed entries, duplicates, referential integrity.
208+
209+
#### Avoiding LLM "Reward Hacking" in Tests
210+
211+
- **Runtime ground truth**: Query source data at test time instead of hardcoding expectations.
212+
- **Seeded sampling**: Validate random records with a fixed seed, not just known examples.
213+
- **Negative & boundary tests**: Ensure invalid inputs fail; probe min/max, empty, unicode.
214+
- **Derived assertions**: Test relationships (e.g., input vs output counts), not magic numbers.
215+
- **Allow expected failures**: Use `pytest.mark.xfail` for known data quality issues with a clear reason.
216+
217+
#### Test Structure Best Practices
218+
219+
- **Parameterize over duplicate**: If testing the same logic on multiple outputs, use `@pytest.mark.parametrize` instead of copy-pasting tests.
220+
- **Set equality over counts**: Prefer `assert set_a == set_b` over `assert len(set_a) == 270` - set comparison catches both missing and extra values.
221+
- **Delete redundant tests**: If test A (e.g., set equality) fully covers test B (e.g., count check), keep only test A.
222+
- **Domain constants are OK**: Hardcoding expected enum values or well-known constants from specs is fine; hardcoding row counts or unique counts derived from data inspection is not.
223+
224+
#### Anti-Patterns to Avoid
225+
226+
- Testing only "happy path" with trivial data
227+
- Hardcoding expected values that drift from source (use derived ground truth)
228+
- Mocking data transformations instead of running real pipelines
229+
- Ignoring edge cases (nulls, empty strings, boundary values, unicode, malformed data)
230+
231+
**Meaningless Tests to Avoid** (common AI-generated anti-patterns):
232+
233+
```python
234+
# BAD: Existence-only checks as the sole validation
235+
assert "name" in df.columns
236+
assert len(df) > 0
237+
238+
# BAD: Hardcoded counts derived from data inspection
239+
assert len(source_ids) == 270 # will break when source changes
240+
241+
# BAD: Redundant with set equality test
242+
assert len(output_cats) == 12 # already covered by subset check
243+
244+
# ACCEPTABLE: Required columns as prerequisites
245+
required_cols = {"id", "name", "value"}
246+
assert required_cols.issubset(df.columns)
247+
248+
# GOOD: Set equality from source data
249+
source_ids = set(source_df["id"].unique().drop_nulls().to_list())
250+
output_ids = set(output_df["id"].unique().drop_nulls().to_list())
251+
assert source_ids == output_ids
252+
253+
# GOOD: Domain knowledge constants (from spec, not data inspection)
254+
assert valid_states == {"active", "inactive", "pending"} # from API spec
255+
```
256+
195257
### Running Tests
196258

197259
```bash
@@ -216,29 +278,29 @@ These fixtures are automatically used by test modules to ensure data availabilit
216278

217279
### Example: LongevityMap Validation
218280

219-
The `test_longevitymap_module.py` includes 47 tests validating:
281+
The `test_longevitymap_module.py` validates conversion integrity:
220282

221-
1. **Weights Table** (1043 rows, 528 variants)
222-
- Row counts match between SQLite and Parquet
223-
- Weight values preserved (sum, min, max, mean)
224-
- Per-rsid weight sums match
225-
- Negative (risk) weights correctly identified
283+
1. **Weights Table**
284+
- Row counts: parquet ≥ sqlite (due to genotype expansion)
285+
- Unique rsid counts match between formats
286+
- Weight values preserved (min/max match, all unique values present)
287+
- Per-rsid weight sets match (not sums, due to expansion)
226288

227289
2. **APOE Variants** (Critical longevity markers)
228-
- rs7412 (APOE e2): protective weights
229-
- rs429358 (APOE e4): risk weights
290+
- rs7412 (APOE e2): expected protective weight set `{0.5, 1.0}`
291+
- rs429358 (APOE e4): expected risk weight set `{-0.5, -1.0}`
230292

231293
3. **Schema Transformations**
232-
- Genotype format (het → CT, hom → TT)
233-
- State values (protective/risk)
234-
- Module column correctness
294+
- Genotype format: list of 2 alleles, alphabetically sorted
295+
- State values: valid enum (`protective`, `risk`, `alt`, `ref`)
296+
- Module column: all rows have `"longevitymap"`
235297

236298
4. **Studies & Annotations**
237-
- All PMIDs preserved (270 unique)
238-
- Categories preserved (12 categories)
239-
- Populations preserved (81 populations)
299+
- All PMIDs: set equality between sqlite and parquet
300+
- Categories: parquet subset of sqlite categories
301+
- Populations: parquet subset of sqlite populations
240302

241303
Tests automatically:
242304
1. Download SQLite from `dna-seq/just_longevitymap` if missing
243305
2. Convert to parquet if needed
244-
3. Validate data integrity
306+
3. Validate data integrity via source comparison (not hardcoded counts)

README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,15 +34,14 @@ The primary entry points are `dagster-ensembl` for running jobs and `dagster-ui`
3434
# Run the full Ensembl pipeline (download → convert → upload)
3535
uv run dagster-ensembl
3636

37+
# Run the LongevityMap pipeline (convert → join with Ensembl → upload)
38+
uv run prepare longevitymap
39+
3740
# Start the Dagster UI for monitoring and lineage visualization
3841
uv run dagster-ui
3942

4043
# Run for a specific species
4144
uv run dagster-ensembl --species mus_musculus
42-
43-
# Run specific jobs (prepare, download, convert, upload, longevitymap)
44-
uv run prepare job download
45-
uv run prepare job convert
4645
```
4746

4847
### Advanced Operations
@@ -78,7 +77,7 @@ The package follows Dagster best practices with utilities organized in subpackag
7877
```
7978
src/prepare_annotations/
8079
├── definitions.py # Main Dagster definitions (assets, jobs, resources)
81-
├── pipelines.py # Standalone API (PreparationPipelines)
80+
├── pipelines.py # Standalone API for ClinVar, dbSNP, gnomAD (non-Dagster)
8281
├── cli.py # Typer CLI entrypoint
8382
8483
├── core/ # Core utilities
88 KB
Binary file not shown.

0 commit comments

Comments
 (0)