Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
87d688f
Add index creation script for onsides PostgreSQL schema
wolfderby Nov 6, 2025
894ac12
feat: add entries to .gitignore for new data and VSCode settings
wolfderby Nov 6, 2025
563498f
Add archive + delete orphan rows script for cem_development_2025 (dry…
wolfderby Nov 7, 2025
fce5d1c
feat: add qa_faers_wc_import_log script for logging file processing m…
wolfderby Nov 7, 2025
df1d4c5
database/qa: update README to reflect current z_qa_faers_wc_import_lo…
wolfderby Nov 7, 2025
2d669d7
database/qa: rename laers_or_faers -> onsides_release_version in DDL,…
wolfderby Nov 7, 2025
18badc6
qa: skip DDL step for non-superusers; only create table if current us…
wolfderby Nov 7, 2025
30d10e7
database/qa: add idempotent migration, ensure script, and bulk-run he…
wolfderby Nov 7, 2025
0cce8aa
Add scripts for loading missing rows, patching vocab placeholders, an…
wolfderby Nov 7, 2025
6b0ad5c
Add ETL best practices and command formatting guidelines to database …
wolfderby Nov 7, 2025
5fe68b3
Update QA script to exclude placeholder vocab entries from domain counts
wolfderby Nov 7, 2025
6c7f68b
Add remaining files: QA bulk runner, GitHub workflows, and constraint…
wolfderby Nov 7, 2025
a00f5e9
Document the difference between wc -l and logical CSV record counts i…
wolfderby Nov 7, 2025
d99d99e
Add QA summary.md explaining import validation metrics
wolfderby Nov 7, 2025
9d69254
Add onsides.about table to schema and populate script
wolfderby Nov 10, 2025
d479e8e
Update run_populate_about.sh to prompt for env vars if not set
wolfderby Nov 10, 2025
9da86de
Add script to populate _about.about table with OnSIDES metadata
wolfderby Nov 10, 2025
636b43b
derived tables
wolfderby Feb 9, 2026
038bd45
derived tables in snake make
wolfderby Feb 9, 2026
0637b34
instructions
wolfderby Feb 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# AI Coding Agent Guidelines for OnSIDES

## Overview
OnSIDES is a comprehensive database of drugs and their adverse events, extracted from drug product labels using fine-tuned natural language processing models. The project is structured to support database generation, querying, and analysis workflows.

### Key Components:
1. **Data**: Raw data files and annotations are stored in the `data/` directory.
2. **Database**: Schema files and helper scripts for MySQL, PostgreSQL, and SQLite are in `database/schema/`.
3. **Source Code**: Python scripts for data processing, database interactions, and predictions are in `src/onsides/`.
4. **Snakemake Pipelines**: Workflow automation scripts for data parsing and evaluation are in `snakemake/`.
5. **Documentation**: Example queries and developer notes are in `docs/`.

---

## Developer Workflows

### Setting Up the Database
- Use the schema files in `database/schema/` to create the database.
- Example scripts for loading data:
- `database_scripts/mysql.sh`
- `database_scripts/postgres.sh`
- `database_scripts/sqlite.sh`
- For PostgreSQL, use `data_our_improvements/schema/postgres_v3.1.0_fixed.sql` for the latest schema.

### Running Snakemake Pipelines
- Navigate to the appropriate subdirectory in `snakemake/` (e.g., `snakemake/eu/parse/`).
- Execute the pipeline:
```bash
snakemake --snakefile Snakefile
```

### Querying the Database
- Example SQL queries are provided in `docs/README.md`.
- Use `summarize.sql` and `test.sql` in `database/` for testing and summarization.

---

## Project-Specific Conventions

### Code Organization
- Python modules are in `src/onsides/`.
- `db.py`: Database connection utilities.
- `predict.py`: Prediction logic using ClinicalBERT.
- `stringsearch.py`: String matching utilities.

### Data Loading
- CSV files are in `data/csv/`.
- Use `load_remaining_onsides.sh` to load additional data.

### Testing
- Unit tests are in `src/onsides/test_stringsearch.py`.
- Run tests with:
```bash
pytest src/onsides/
```

---

## Integration Points

### External Dependencies
- **Podman**: Used for containerized database setups.
- **pgloader**: For importing SQLite databases into PostgreSQL.
- **Snakemake**: Workflow management.

### Cross-Component Communication
- Data flows from `data/` to `database/` via schema scripts and loading utilities.
- Snakemake pipelines automate parsing and evaluation workflows.

---

## Examples

### Example Query: Find Ingredients for Renal Injury
```sql
SELECT DISTINCT i.*
FROM product_label
INNER JOIN product_to_rxnorm USING (label_id)
INNER JOIN vocab_rxnorm_ingredient_to_product ON rxnorm_product_id = product_id
INNER JOIN vocab_rxnorm_ingredient i ON ingredient_id = rxnorm_id
INNER JOIN product_adverse_effect ON label_id = product_label_id
INNER JOIN vocab_meddra_adverse_effect ON effect_meddra_id = meddra_id
WHERE meddra_name = 'Renal injury';
```

### Example Query: Fraction of US Products with Headache Label
```sql
WITH n_headache AS (
SELECT COUNT(DISTINCT label_id) AS n
FROM product_label
INNER JOIN product_adverse_effect ON label_id = product_label_id
INNER JOIN vocab_meddra_adverse_effect ON effect_meddra_id = meddra_id
WHERE source = 'US' AND meddra_name = 'Headache'
),
n_overall AS (
SELECT COUNT(DISTINCT label_id) AS n
FROM product_label
INNER JOIN product_adverse_effect ON label_id = product_label_id
INNER JOIN vocab_meddra_adverse_effect ON effect_meddra_id = meddra_id
WHERE source = 'US'
)
SELECT CAST(n_headache.n AS real) / n_overall.n AS frac_with_headache
FROM n_headache, n_overall;
```

---

## Contact
For questions or issues, refer to the [README.md](../README.md) or contact the maintainers via GitHub Issues.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,6 @@ log
database/onsides.db
db.zip
summary.md
data_our_improvements/yay
data_our_improvements/yay.pub
.vscode/settings.json
19 changes: 19 additions & 0 deletions database/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,22 @@ Please visit the [Releases](https://github.com/tatonetti-lab/onsides/releases).

This directory contains database schemas and helper scripts for SQLite, MySQL, and PostgreSQL.
In addition, there are two scripts for testing (`test.sql`) and summarization (`summarize.sql`).

## ETL Best Practices and Command Formatting

When running ETL processes or database commands:

- **Command Formatting**: Separate commands with line breaks. For commands with options, use indented line breaks for clarity.
Example:
```
psql -d mydb -U user \
-c "SELECT * FROM table;" \
-f script.sql
```

- **Thoughtful ETL Execution**: Avoid running PSQL scripts haphazardly. Always:
- Verify prerequisites (e.g., environment variables, permissions).
- Use dry-run modes if available.
- Log outputs for auditing.
- Test on a subset first for large operations.
- Ensure idempotency where possible to allow safe reruns.
81 changes: 81 additions & 0 deletions database/qa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# QA workflow: FAERS/LAERS import log

This workflow logs basic QA metrics comparing a source data file's line count to the row count in a target database table. Results are inserted into `onsides.z_qa_faers_wc_import_log`.

## What it records
- `log_filename` (varchar(255), NOT NULL): basename of the source file (placeholder; can be customized)
- `filename` (varchar(255), NOT NULL): full path to the source file at logging time
- `onsides_release_version` (varchar(32), NOT NULL): OnSIDES release tag or short dataset tag, e.g. `v3.1.0`, `FAERS`, or `LAERS`
- `yr` (int, NOT NULL): dataset year (YY or YYYY)
- `qtr` (int, NOT NULL): dataset quarter (1–4)
- `wc_l_count` (int, NOT NULL): raw `wc -l` physical line count of the source file (includes header)
- `loaded_at` (timestamp, DEFAULT CURRENT_TIMESTAMP): when the log row was inserted
- `select_count_on_domain` (int, NULL): `SELECT COUNT(*)` on the specified domain table at logging time
- `select_count_diff` (int, NULL): `select_count_on_domain - wc_l_count`
- `select_count_diff_pct` (float, NULL): `select_count_diff / NULLIF(wc_l_count,0)` as a float
- `execution_id` (int, NULL): optional execution identifier for grouping runs
- `csv_record_count` (int, NULL): CSV-aware logical record count (header skipped; embedded newlines handled)
- `csv_count_diff` (int, NULL): `select_count_on_domain - csv_record_count`
- `csv_count_diff_pct` (float, NULL): `csv_count_diff / NULLIF(csv_record_count,0)` as a float

Notes about schema changes
- The table DDL adds the CSV-aware columns (`csv_record_count`, `csv_count_diff`, `csv_count_diff_pct`) when the table already exists. These columns are present in the current DDL.
- A previously-used `awk_nl_count` column (awk-based physical line count) was removed from the DDL; the workflow now uses `wc -l` for physical counts and a CSV-aware parser for logical counts.

## Usage

The script uses your normal PostgreSQL environment variables (e.g., `PGHOST`, `PGPORT`, `PGUSER`, `PGDATABASE`, etc.).

```
# Ensure the log table exists (one-time or on demand)
psql -v ON_ERROR_STOP=1 -f database/qa/z_qa_faers_wc_import_log.sql

# Log a dataset (example with release tag v3.1.0, YY=25, Q2)
bash database/qa/qa_faers_wc_import_log.sh \
--file /path/to/source_file.txt \
--source v3.1.0 \
--year 25 \
--quarter 2 \
--domain-table product_adverse_effect \
--domain-schema onsides \
--execution-id 123
```

Notes:
- If `--domain-schema` is omitted, it defaults to `onsides`.
- `--log-filename` is optional; defaults to `basename --file`.
- Year supports YY or YYYY (e.g., 25 or 2025).
- `select_count_on_domain` is computed from `SELECT COUNT(*) FROM <schema>.<table>`.

Convenience scripts
- `database/qa/ensure_qa_table_and_grants.sh` — idempotently creates/updates the QA log table (runs DDL and migration) and grants `rw_grp` the required privileges. Run as a superuser or pass PGUSER=postgres.
- `database/qa/run_qa_bulk.sh` — run the QA logger across all CSVs in a directory (defaults to `data/csv/`). Uses `database/qa/qa_faers_wc_import_log.sh` for each file.

Example one-line to prepare DB and run QA over CSVs:

```
# Ensure table and grants (run as postgres or a superuser)
PGUSER=postgres PGDATABASE=cem_development_2025 PGPORT=5433 bash database/qa/ensure_qa_table_and_grants.sh

# Run QA logger for all CSVs (as rw_grp)
PGUSER=rw_grp PGDATABASE=cem_development_2025 PGPORT=5433 bash database/qa/run_qa_bulk.sh data/csv v3.1.0 2025 1 onsides
```

## Troubleshooting
- Permission denied: ensure your DB user can `SELECT` from the domain table and `INSERT` into `onsides.z_qa_faers_wc_import_log`.
- Table not found: pass `--domain-schema` if your table is not in `onsides`.
- Line counts include headers: this workflow uses raw `wc -l` as requested.

### About physical vs logical row counts
- `wc -l` counts physical newline characters; it includes the header row and is unaware of CSV quoting.
- `awk 'END{print NR}'` also counts physical lines and usually equals `wc -l` (they only differ by one if a file is missing a trailing newline).
- `csv_record_count` is computed with a CSV parser: it skips the header and treats embedded newlines inside quoted fields as part of the same record. This is the correct basis for comparing to database row counts after a CSV import.

In particular for `product_label.csv`:
- `wc -l` and `awk` report 63,282 (physical lines).
- `csv_record_count` is 59,515 (logical records, excluding the header).
- `SELECT COUNT(*)` on `onsides.product_label` is 59,515, so `csv_count_diff = 0`. Any non-zero `select_count_diff` in this case is due to the use of physical lines instead of logical records.

# 11/04/2025 load

Today, we focused on logging QA metrics for the OnSIDES v3.1.0 release. This included running the QA logger for additional input files and ensuring that the logical CSV record counts match the row counts in the database tables. Discrepancies were investigated (for example, `product_label.csv` had embedded newlines in quoted fields which made `wc -l` differ from logical CSV record counts). The README and DDL were updated to reflect the CSV-aware columns and the removal of the old `awk_nl_count` field.
24 changes: 24 additions & 0 deletions database/qa/ensure_qa_table_and_grants.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/usr/bin/env bash
set -euo pipefail

# ensure_qa_table_and_grants.sh
# Idempotently ensure the QA log table exists and grant rw_grp the necessary privileges.
# Usage: ./ensure_qa_table_and_grants.sh [-d PGDATABASE] [-p PGPORT] [-u PGUSER]

PGDATABASE=${PGDATABASE:-cem_development_2025}
PGPORT=${PGPORT:-5433}
PGUSER=${PGUSER:-postgres}

SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)

echo "Ensuring onsides.z_qa_faers_wc_import_log exists (running DDL as $PGUSER on $PGDATABASE:$PGPORT)"
PGUSER=$PGUSER PGDATABASE=$PGDATABASE PGPORT=$PGPORT psql -v ON_ERROR_STOP=1 -f "$SCRIPT_DIR/z_qa_faers_wc_import_log.sql"

echo "Applying migration (rename old column if needed)"
PGUSER=$PGUSER PGDATABASE=$PGDATABASE PGPORT=$PGPORT psql -v ON_ERROR_STOP=1 -f "$SCRIPT_DIR/migrate_laers_to_onsides_release_version.sql"

echo "Granting schema usage and table privileges to rw_grp"
PGUSER=$PGUSER PGDATABASE=$PGDATABASE PGPORT=$PGPORT psql -v ON_ERROR_STOP=1 -c "GRANT USAGE ON SCHEMA onsides TO rw_grp;"
PGUSER=$PGUSER PGDATABASE=$PGDATABASE PGPORT=$PGPORT psql -v ON_ERROR_STOP=1 -c "GRANT SELECT, INSERT ON TABLE onsides.z_qa_faers_wc_import_log TO rw_grp;"

echo "Done."
29 changes: 29 additions & 0 deletions database/qa/integrity_check_and_cleanup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/usr/bin/env bash
set -euo pipefail

# Integrity check: verify no staging rows remain unmatched, then drop staging tables.
# Run after patching to confirm all rows inserted and clean up.

PGUSER=${PGUSER:-postgres}
PGDATABASE=${PGDATABASE:-cem_development_2025}
PGPORT=${PGPORT:-5433}

echo "Checking integrity and cleaning up staging tables"

psql -v ON_ERROR_STOP=1 -d "$PGDATABASE" -p "$PGPORT" -U "$PGUSER" <<'PSQL'
SET search_path TO onsides, public;

-- Check product_to_rxnorm_staging: should have 0 unmatched
SELECT 'product_to_rxnorm_staging_unmatched', COUNT(*) FROM product_to_rxnorm_staging s LEFT JOIN product_to_rxnorm p ON p.label_id = (s.label_id)::int AND p.rxnorm_product_id = s.rxnorm_product_id WHERE p.label_id IS NULL;

-- Check product_adverse_effect_staging: should have 0 unmatched
SELECT 'product_adverse_effect_staging_unmatched', COUNT(*) FROM product_adverse_effect_staging s LEFT JOIN product_adverse_effect p ON p.effect_id = (s.effect_id)::int WHERE p.effect_id IS NULL;

-- If both are 0, drop staging tables
DROP TABLE IF EXISTS product_to_rxnorm_staging;
DROP TABLE IF EXISTS product_adverse_effect_staging;

SELECT 'staging_tables_dropped', 'success';
PSQL

echo "Integrity check complete; staging tables dropped."
58 changes: 58 additions & 0 deletions database/qa/load_missing_product_adverse_effect.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
#!/usr/bin/env bash
set -euo pipefail

# Load missing rows from product_adverse_effect.csv into onsides.product_adverse_effect
# Usage: ./load_missing_product_adverse_effect.sh /path/to/product_adverse_effect.csv

CSV_FILE=${1:-data/csv/product_adverse_effect.csv}
PGUSER=${PGUSER:-postgres}
PGDATABASE=${PGDATABASE:-cem_development_2025}
PGPORT=${PGPORT:-5433}

echo "Staging CSV: $CSV_FILE"

psql -v ON_ERROR_STOP=1 -d "$PGDATABASE" -p "$PGPORT" -U "$PGUSER" <<'PSQL'
SET search_path TO onsides, public;
DROP TABLE IF EXISTS product_adverse_effect_staging;
CREATE TABLE product_adverse_effect_staging (
product_label_id text,
effect_id text,
label_section text,
effect_meddra_id text,
match_method text,
pred0 text,
pred1 text
);
PSQL

# copy using psql \copy for client-side file access
PGUSER=$PGUSER PGDATABASE=$PGDATABASE PGPORT=$PGPORT psql -v ON_ERROR_STOP=1 -c "\copy onsides.product_adverse_effect_staging FROM '$CSV_FILE' WITH (FORMAT csv, HEADER true)"

psql -v ON_ERROR_STOP=1 -d "$PGDATABASE" -p "$PGPORT" -U "$PGUSER" <<'PSQL'
SET search_path TO onsides, public;
-- Count staged rows
SELECT 'staged_count', COUNT(*) FROM product_adverse_effect_staging;

-- Insert only rows whose effect_id is missing and whose parents exist
INSERT INTO product_adverse_effect (product_label_id, effect_id, label_section, effect_meddra_id, match_method, pred0, pred1)
SELECT
(CASE WHEN trim(s.product_label_id) = '' THEN NULL ELSE (s.product_label_id)::int END)::int,
(s.effect_id)::int,
s.label_section,
(CASE WHEN trim(s.effect_meddra_id) = '' THEN NULL ELSE (s.effect_meddra_id)::int END),
s.match_method,
(CASE WHEN trim(s.pred0) = '' THEN NULL ELSE (s.pred0)::float END),
(CASE WHEN trim(s.pred1) = '' THEN NULL ELSE (s.pred1)::float END)
FROM product_adverse_effect_staging s
WHERE NOT EXISTS (SELECT 1 FROM product_adverse_effect p WHERE p.effect_id = (s.effect_id)::int)
AND (s.product_label_id IS NULL OR s.product_label_id = '' OR EXISTS (SELECT 1 FROM product_label pl WHERE pl.label_id = (s.product_label_id)::int))
AND (s.effect_meddra_id IS NULL OR s.effect_meddra_id = '' OR EXISTS (SELECT 1 FROM vocab_meddra_adverse_effect v WHERE v.meddra_id = (s.effect_meddra_id)::int));

-- Report how many now in target and how many staged remain unmatched
SELECT 'target_count', COUNT(*) FROM product_adverse_effect;
SELECT 'staged_unmatched_count', COUNT(*) FROM product_adverse_effect_staging s LEFT JOIN product_adverse_effect p ON p.effect_id = (s.effect_id)::int WHERE p.effect_id IS NULL;
-- Update sequence for effect_id
SELECT setval(pg_get_serial_sequence('product_adverse_effect','effect_id'), COALESCE((SELECT max(effect_id) FROM product_adverse_effect),0));
PSQL

echo "Done."
49 changes: 49 additions & 0 deletions database/qa/load_missing_product_to_rxnorm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/usr/bin/env bash
set -euo pipefail

# Load missing rows from product_to_rxnorm.csv into onsides.product_to_rxnorm
# Usage: ./load_missing_product_to_rxnorm.sh /path/to/product_to_rxnorm.csv

CSV_FILE=${1:-data/csv/product_to_rxnorm.csv}
PGUSER=${PGUSER:-postgres}
PGDATABASE=${PGDATABASE:-cem_development_2025}
PGPORT=${PGPORT:-5433}

echo "Staging CSV: $CSV_FILE"

psql -v ON_ERROR_STOP=1 -d "$PGDATABASE" -p "$PGPORT" -U "$PGUSER" <<'PSQL'
SET search_path TO onsides, public;
DROP TABLE IF EXISTS product_to_rxnorm_staging;
CREATE TABLE product_to_rxnorm_staging (
label_id text,
rxnorm_product_id text
);
PSQL

PGUSER=$PGUSER PGDATABASE=$PGDATABASE PGPORT=$PGPORT psql -v ON_ERROR_STOP=1 -c "\copy onsides.product_to_rxnorm_staging FROM '$CSV_FILE' WITH (FORMAT csv, HEADER true)"

psql -v ON_ERROR_STOP=1 -d "$PGDATABASE" -p "$PGPORT" -U "$PGUSER" <<'PSQL'
SET search_path TO onsides, public;
SELECT 'staged_count', COUNT(*) FROM product_to_rxnorm_staging;

INSERT INTO product_to_rxnorm (label_id, rxnorm_product_id)
SELECT
(s.label_id)::int,
s.rxnorm_product_id
FROM product_to_rxnorm_staging s
WHERE NOT EXISTS (
SELECT 1 FROM product_to_rxnorm p
WHERE p.label_id = (s.label_id)::int AND p.rxnorm_product_id = s.rxnorm_product_id
)
AND (
s.label_id IS NULL OR s.label_id = '' OR EXISTS (SELECT 1 FROM product_label pl WHERE pl.label_id = (s.label_id)::int)
)
AND (
s.rxnorm_product_id IS NULL OR s.rxnorm_product_id = '' OR EXISTS (SELECT 1 FROM vocab_rxnorm_product v WHERE v.rxnorm_id = s.rxnorm_product_id)
);

SELECT 'target_count', COUNT(*) FROM product_to_rxnorm;
SELECT 'staged_unmatched_count', COUNT(*) FROM product_to_rxnorm_staging s LEFT JOIN product_to_rxnorm p ON p.label_id = (s.label_id)::int AND p.rxnorm_product_id = s.rxnorm_product_id WHERE p.label_id IS NULL;
PSQL

echo "Done."
Loading