tatonetti-lab · wolfderby · Nov 6, 2025 · Nov 6, 2025 · Nov 7, 2025 · Nov 7, 2025
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -0,0 +1,109 @@
+# AI Coding Agent Guidelines for OnSIDES
+
+## Overview
+OnSIDES is a comprehensive database of drugs and their adverse events, extracted from drug product labels using fine-tuned natural language processing models. The project is structured to support database generation, querying, and analysis workflows.
+
+### Key Components:
+1. **Data**: Raw data files and annotations are stored in the `data/` directory.
+2. **Database**: Schema files and helper scripts for MySQL, PostgreSQL, and SQLite are in `database/schema/`.
+3. **Source Code**: Python scripts for data processing, database interactions, and predictions are in `src/onsides/`.
+4. **Snakemake Pipelines**: Workflow automation scripts for data parsing and evaluation are in `snakemake/`.
+5. **Documentation**: Example queries and developer notes are in `docs/`.
+
+---
+
+## Developer Workflows
+
+### Setting Up the Database
+- Use the schema files in `database/schema/` to create the database.
+- Example scripts for loading data:
+  - `database_scripts/mysql.sh`
+  - `database_scripts/postgres.sh`
+  - `database_scripts/sqlite.sh`
+- For PostgreSQL, use `data_our_improvements/schema/postgres_v3.1.0_fixed.sql` for the latest schema.
+
+### Running Snakemake Pipelines
+- Navigate to the appropriate subdirectory in `snakemake/` (e.g., `snakemake/eu/parse/`).
+- Execute the pipeline:
+  ```bash
+  snakemake --snakefile Snakefile
+  ```
+
+### Querying the Database
+- Example SQL queries are provided in `docs/README.md`.
+- Use `summarize.sql` and `test.sql` in `database/` for testing and summarization.
+
+---
+
+## Project-Specific Conventions
+
+### Code Organization
+- Python modules are in `src/onsides/`.
+  - `db.py`: Database connection utilities.
+  - `predict.py`: Prediction logic using ClinicalBERT.
+  - `stringsearch.py`: String matching utilities.
+
+### Data Loading
+- CSV files are in `data/csv/`.
+- Use `load_remaining_onsides.sh` to load additional data.
+
+### Testing
+- Unit tests are in `src/onsides/test_stringsearch.py`.
+- Run tests with:
+  ```bash
+  pytest src/onsides/
+  ```
+
+---
+
+## Integration Points
+
+### External Dependencies
+- **Podman**: Used for containerized database setups.
+- **pgloader**: For importing SQLite databases into PostgreSQL.
+- **Snakemake**: Workflow management.
+
+### Cross-Component Communication
+- Data flows from `data/` to `database/` via schema scripts and loading utilities.
+- Snakemake pipelines automate parsing and evaluation workflows.
+
+---
+
+## Examples
+
+### Example Query: Find Ingredients for Renal Injury
+```sql
+SELECT DISTINCT i.*
+FROM product_label
+INNER JOIN product_to_rxnorm USING (label_id)
+INNER JOIN vocab_rxnorm_ingredient_to_product ON rxnorm_product_id = product_id
+INNER JOIN vocab_rxnorm_ingredient i ON ingredient_id = rxnorm_id
+INNER JOIN product_adverse_effect ON label_id = product_label_id
+INNER JOIN vocab_meddra_adverse_effect ON effect_meddra_id = meddra_id
+WHERE meddra_name = 'Renal injury';
+```
+
+### Example Query: Fraction of US Products with Headache Label
+```sql
+WITH n_headache AS (
+    SELECT COUNT(DISTINCT label_id) AS n
+    FROM product_label
+    INNER JOIN product_adverse_effect ON label_id = product_label_id
+    INNER JOIN vocab_meddra_adverse_effect ON effect_meddra_id = meddra_id
+    WHERE source = 'US' AND meddra_name = 'Headache'
+),
+n_overall AS (
+    SELECT COUNT(DISTINCT label_id) AS n
+    FROM product_label
+    INNER JOIN product_adverse_effect ON label_id = product_label_id
+    INNER JOIN vocab_meddra_adverse_effect ON effect_meddra_id = meddra_id
+    WHERE source = 'US'
+)
+SELECT CAST(n_headache.n AS real) / n_overall.n AS frac_with_headache
+FROM n_headache, n_overall;
+```
+
+---
+
+## Contact
+For questions or issues, refer to the [README.md](../README.md) or contact the maintainers via GitHub Issues.
diff --git a/.gitignore b/.gitignore
@@ -23,3 +23,6 @@ log
 database/onsides.db
 db.zip
 summary.md
+data_our_improvements/yay
+data_our_improvements/yay.pub
+.vscode/settings.json
diff --git a/database/README.md b/database/README.md
@@ -7,3 +7,22 @@ Please visit the [Releases](https://github.com/tatonetti-lab/onsides/releases).
 
 This directory contains database schemas and helper scripts for SQLite, MySQL, and PostgreSQL.
 In addition, there are two scripts for testing (`test.sql`) and summarization (`summarize.sql`).
+
+## ETL Best Practices and Command Formatting
+
+When running ETL processes or database commands:
+
+- **Command Formatting**: Separate commands with line breaks. For commands with options, use indented line breaks for clarity.
+  Example:
+  ```
+  psql -d mydb -U user \
+    -c "SELECT * FROM table;" \
+    -f script.sql
+  ```
+
+- **Thoughtful ETL Execution**: Avoid running PSQL scripts haphazardly. Always:
+  - Verify prerequisites (e.g., environment variables, permissions).
+  - Use dry-run modes if available.
+  - Log outputs for auditing.
+  - Test on a subset first for large operations.
+  - Ensure idempotency where possible to allow safe reruns.
diff --git a/database/qa/README.md b/database/qa/README.md
@@ -0,0 +1,81 @@
+# QA workflow: FAERS/LAERS import log
+
+This workflow logs basic QA metrics comparing a source data file's line count to the row count in a target database table. Results are inserted into `onsides.z_qa_faers_wc_import_log`.
+
+## What it records
+- `log_filename` (varchar(255), NOT NULL): basename of the source file (placeholder; can be customized)
+- `filename` (varchar(255), NOT NULL): full path to the source file at logging time
+- `onsides_release_version` (varchar(32), NOT NULL): OnSIDES release tag or short dataset tag, e.g. `v3.1.0`, `FAERS`, or `LAERS`
+- `yr` (int, NOT NULL): dataset year (YY or YYYY)
+- `qtr` (int, NOT NULL): dataset quarter (1–4)
+- `wc_l_count` (int, NOT NULL): raw `wc -l` physical line count of the source file (includes header)
+- `loaded_at` (timestamp, DEFAULT CURRENT_TIMESTAMP): when the log row was inserted
+- `select_count_on_domain` (int, NULL): `SELECT COUNT(*)` on the specified domain table at logging time
+- `select_count_diff` (int, NULL): `select_count_on_domain - wc_l_count`
+- `select_count_diff_pct` (float, NULL): `select_count_diff / NULLIF(wc_l_count,0)` as a float
+- `execution_id` (int, NULL): optional execution identifier for grouping runs
+- `csv_record_count` (int, NULL): CSV-aware logical record count (header skipped; embedded newlines handled)
+- `csv_count_diff` (int, NULL): `select_count_on_domain - csv_record_count`
+- `csv_count_diff_pct` (float, NULL): `csv_count_diff / NULLIF(csv_record_count,0)` as a float
+
+Notes about schema changes
+- The table DDL adds the CSV-aware columns (`csv_record_count`, `csv_count_diff`, `csv_count_diff_pct`) when the table already exists. These columns are present in the current DDL.
+- A previously-used `awk_nl_count` column (awk-based physical line count) was removed from the DDL; the workflow now uses `wc -l` for physical counts and a CSV-aware parser for logical counts.
+
+## Usage
+
+The script uses your normal PostgreSQL environment variables (e.g., `PGHOST`, `PGPORT`, `PGUSER`, `PGDATABASE`, etc.).
+
+```
+# Ensure the log table exists (one-time or on demand)
+psql -v ON_ERROR_STOP=1 -f database/qa/z_qa_faers_wc_import_log.sql
+
+# Log a dataset (example with release tag v3.1.0, YY=25, Q2)
+bash database/qa/qa_faers_wc_import_log.sh \
+  --file /path/to/source_file.txt \
+  --source v3.1.0 \
+  --year 25 \
+  --quarter 2 \
+  --domain-table product_adverse_effect \
+  --domain-schema onsides \
+  --execution-id 123
+```
+
+Notes:
+- If `--domain-schema` is omitted, it defaults to `onsides`.
+- `--log-filename` is optional; defaults to `basename --file`.
+- Year supports YY or YYYY (e.g., 25 or 2025).
+- `select_count_on_domain` is computed from `SELECT COUNT(*) FROM <schema>.<table>`.
+
+Convenience scripts
+- `database/qa/ensure_qa_table_and_grants.sh` — idempotently creates/updates the QA log table (runs DDL and migration) and grants `rw_grp` the required privileges. Run as a superuser or pass PGUSER=postgres.
+- `database/qa/run_qa_bulk.sh` — run the QA logger across all CSVs in a directory (defaults to `data/csv/`). Uses `database/qa/qa_faers_wc_import_log.sh` for each file.
+
+Example one-line to prepare DB and run QA over CSVs:
+
+```
+# Ensure table and grants (run as postgres or a superuser)
+PGUSER=postgres PGDATABASE=cem_development_2025 PGPORT=5433 bash database/qa/ensure_qa_table_and_grants.sh
+
+# Run QA logger for all CSVs (as rw_grp)
+PGUSER=rw_grp PGDATABASE=cem_development_2025 PGPORT=5433 bash database/qa/run_qa_bulk.sh data/csv v3.1.0 2025 1 onsides
+```
+
+## Troubleshooting
+- Permission denied: ensure your DB user can `SELECT` from the domain table and `INSERT` into `onsides.z_qa_faers_wc_import_log`.
+- Table not found: pass `--domain-schema` if your table is not in `onsides`.
+- Line counts include headers: this workflow uses raw `wc -l` as requested.
+
+### About physical vs logical row counts
+- `wc -l` counts physical newline characters; it includes the header row and is unaware of CSV quoting.
+- `awk 'END{print NR}'` also counts physical lines and usually equals `wc -l` (they only differ by one if a file is missing a trailing newline).
+- `csv_record_count` is computed with a CSV parser: it skips the header and treats embedded newlines inside quoted fields as part of the same record. This is the correct basis for comparing to database row counts after a CSV import.
+
+In particular for `product_label.csv`:
+- `wc -l` and `awk` report 63,282 (physical lines).
+- `csv_record_count` is 59,515 (logical records, excluding the header).
+- `SELECT COUNT(*)` on `onsides.product_label` is 59,515, so `csv_count_diff = 0`. Any non-zero `select_count_diff` in this case is due to the use of physical lines instead of logical records.
+
+# 11/04/2025 load
+
+Today, we focused on logging QA metrics for the OnSIDES v3.1.0 release. This included running the QA logger for additional input files and ensuring that the logical CSV record counts match the row counts in the database tables. Discrepancies were investigated (for example, `product_label.csv` had embedded newlines in quoted fields which made `wc -l` differ from logical CSV record counts). The README and DDL were updated to reflect the CSV-aware columns and the removal of the old `awk_nl_count` field.
diff --git a/database/qa/ensure_qa_table_and_grants.sh b/database/qa/ensure_qa_table_and_grants.sh
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# ensure_qa_table_and_grants.sh
+# Idempotently ensure the QA log table exists and grant rw_grp the necessary privileges.
+# Usage: ./ensure_qa_table_and_grants.sh [-d PGDATABASE] [-p PGPORT] [-u PGUSER]
+
+PGDATABASE=${PGDATABASE:-cem_development_2025}
+PGPORT=${PGPORT:-5433}
+PGUSER=${PGUSER:-postgres}
+
+SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+
+echo "Ensuring onsides.z_qa_faers_wc_import_log exists (running DDL as $PGUSER on $PGDATABASE:$PGPORT)"
+PGUSER=$PGUSER PGDATABASE=$PGDATABASE PGPORT=$PGPORT psql -v ON_ERROR_STOP=1 -f "$SCRIPT_DIR/z_qa_faers_wc_import_log.sql"
+
+echo "Applying migration (rename old column if needed)"
+PGUSER=$PGUSER PGDATABASE=$PGDATABASE PGPORT=$PGPORT psql -v ON_ERROR_STOP=1 -f "$SCRIPT_DIR/migrate_laers_to_onsides_release_version.sql"
+
+echo "Granting schema usage and table privileges to rw_grp"
+PGUSER=$PGUSER PGDATABASE=$PGDATABASE PGPORT=$PGPORT psql -v ON_ERROR_STOP=1 -c "GRANT USAGE ON SCHEMA onsides TO rw_grp;"
+PGUSER=$PGUSER PGDATABASE=$PGDATABASE PGPORT=$PGPORT psql -v ON_ERROR_STOP=1 -c "GRANT SELECT, INSERT ON TABLE onsides.z_qa_faers_wc_import_log TO rw_grp;"
+
+echo "Done."
diff --git a/database/qa/integrity_check_and_cleanup.sh b/database/qa/integrity_check_and_cleanup.sh
@@ -0,0 +1,29 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Integrity check: verify no staging rows remain unmatched, then drop staging tables.
+# Run after patching to confirm all rows inserted and clean up.
+
+PGUSER=${PGUSER:-postgres}
+PGDATABASE=${PGDATABASE:-cem_development_2025}
+PGPORT=${PGPORT:-5433}
+
+echo "Checking integrity and cleaning up staging tables"
+
+psql -v ON_ERROR_STOP=1 -d "$PGDATABASE" -p "$PGPORT" -U "$PGUSER" <<'PSQL'
+SET search_path TO onsides, public;
+
+-- Check product_to_rxnorm_staging: should have 0 unmatched
+SELECT 'product_to_rxnorm_staging_unmatched', COUNT(*) FROM product_to_rxnorm_staging s LEFT JOIN product_to_rxnorm p ON p.label_id = (s.label_id)::int AND p.rxnorm_product_id = s.rxnorm_product_id WHERE p.label_id IS NULL;
+
+-- Check product_adverse_effect_staging: should have 0 unmatched
+SELECT 'product_adverse_effect_staging_unmatched', COUNT(*) FROM product_adverse_effect_staging s LEFT JOIN product_adverse_effect p ON p.effect_id = (s.effect_id)::int WHERE p.effect_id IS NULL;
+
+-- If both are 0, drop staging tables
+DROP TABLE IF EXISTS product_to_rxnorm_staging;
+DROP TABLE IF EXISTS product_adverse_effect_staging;
+
+SELECT 'staging_tables_dropped', 'success';
+PSQL
+
+echo "Integrity check complete; staging tables dropped."
diff --git a/database/qa/load_missing_product_adverse_effect.sh b/database/qa/load_missing_product_adverse_effect.sh
@@ -0,0 +1,58 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Load missing rows from product_adverse_effect.csv into onsides.product_adverse_effect
+# Usage: ./load_missing_product_adverse_effect.sh /path/to/product_adverse_effect.csv
+
+CSV_FILE=${1:-data/csv/product_adverse_effect.csv}
+PGUSER=${PGUSER:-postgres}
+PGDATABASE=${PGDATABASE:-cem_development_2025}
+PGPORT=${PGPORT:-5433}
+
+echo "Staging CSV: $CSV_FILE"
+
+psql -v ON_ERROR_STOP=1 -d "$PGDATABASE" -p "$PGPORT" -U "$PGUSER" <<'PSQL'
+SET search_path TO onsides, public;
+DROP TABLE IF EXISTS product_adverse_effect_staging;
+CREATE TABLE product_adverse_effect_staging (
+  product_label_id text,
+  effect_id text,
+  label_section text,
+  effect_meddra_id text,
+  match_method text,
+  pred0 text,
+  pred1 text
+);
+PSQL
+
+# copy using psql \copy for client-side file access
+PGUSER=$PGUSER PGDATABASE=$PGDATABASE PGPORT=$PGPORT psql -v ON_ERROR_STOP=1 -c "\copy onsides.product_adverse_effect_staging FROM '$CSV_FILE' WITH (FORMAT csv, HEADER true)"
+
+psql -v ON_ERROR_STOP=1 -d "$PGDATABASE" -p "$PGPORT" -U "$PGUSER" <<'PSQL'
+SET search_path TO onsides, public;
+-- Count staged rows
+SELECT 'staged_count', COUNT(*) FROM product_adverse_effect_staging;
+
+-- Insert only rows whose effect_id is missing and whose parents exist
+INSERT INTO product_adverse_effect (product_label_id, effect_id, label_section, effect_meddra_id, match_method, pred0, pred1)
+SELECT
+  (CASE WHEN trim(s.product_label_id) = '' THEN NULL ELSE (s.product_label_id)::int END)::int,
+  (s.effect_id)::int,
+  s.label_section,
+  (CASE WHEN trim(s.effect_meddra_id) = '' THEN NULL ELSE (s.effect_meddra_id)::int END),
+  s.match_method,
+  (CASE WHEN trim(s.pred0) = '' THEN NULL ELSE (s.pred0)::float END),
+  (CASE WHEN trim(s.pred1) = '' THEN NULL ELSE (s.pred1)::float END)
+FROM product_adverse_effect_staging s
+WHERE NOT EXISTS (SELECT 1 FROM product_adverse_effect p WHERE p.effect_id = (s.effect_id)::int)
+  AND (s.product_label_id IS NULL OR s.product_label_id = '' OR EXISTS (SELECT 1 FROM product_label pl WHERE pl.label_id = (s.product_label_id)::int))
+  AND (s.effect_meddra_id IS NULL OR s.effect_meddra_id = '' OR EXISTS (SELECT 1 FROM vocab_meddra_adverse_effect v WHERE v.meddra_id = (s.effect_meddra_id)::int));
+
+-- Report how many now in target and how many staged remain unmatched
+SELECT 'target_count', COUNT(*) FROM product_adverse_effect;
+SELECT 'staged_unmatched_count', COUNT(*) FROM product_adverse_effect_staging s LEFT JOIN product_adverse_effect p ON p.effect_id = (s.effect_id)::int WHERE p.effect_id IS NULL;
+-- Update sequence for effect_id
+SELECT setval(pg_get_serial_sequence('product_adverse_effect','effect_id'), COALESCE((SELECT max(effect_id) FROM product_adverse_effect),0));
+PSQL
+
+echo "Done." 
diff --git a/database/qa/load_missing_product_to_rxnorm.sh b/database/qa/load_missing_product_to_rxnorm.sh
@@ -0,0 +1,49 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Load missing rows from product_to_rxnorm.csv into onsides.product_to_rxnorm
+# Usage: ./load_missing_product_to_rxnorm.sh /path/to/product_to_rxnorm.csv
+
+CSV_FILE=${1:-data/csv/product_to_rxnorm.csv}
+PGUSER=${PGUSER:-postgres}
+PGDATABASE=${PGDATABASE:-cem_development_2025}
+PGPORT=${PGPORT:-5433}
+
+echo "Staging CSV: $CSV_FILE"
+
+psql -v ON_ERROR_STOP=1 -d "$PGDATABASE" -p "$PGPORT" -U "$PGUSER" <<'PSQL'
+SET search_path TO onsides, public;
+DROP TABLE IF EXISTS product_to_rxnorm_staging;
+CREATE TABLE product_to_rxnorm_staging (
+  label_id text,
+  rxnorm_product_id text
+);
+PSQL
+
+PGUSER=$PGUSER PGDATABASE=$PGDATABASE PGPORT=$PGPORT psql -v ON_ERROR_STOP=1 -c "\copy onsides.product_to_rxnorm_staging FROM '$CSV_FILE' WITH (FORMAT csv, HEADER true)"
+
+psql -v ON_ERROR_STOP=1 -d "$PGDATABASE" -p "$PGPORT" -U "$PGUSER" <<'PSQL'
+SET search_path TO onsides, public;
+SELECT 'staged_count', COUNT(*) FROM product_to_rxnorm_staging;
+
+INSERT INTO product_to_rxnorm (label_id, rxnorm_product_id)
+SELECT
+  (s.label_id)::int,
+  s.rxnorm_product_id
+FROM product_to_rxnorm_staging s
+WHERE NOT EXISTS (
+  SELECT 1 FROM product_to_rxnorm p
+  WHERE p.label_id = (s.label_id)::int AND p.rxnorm_product_id = s.rxnorm_product_id
+)
+  AND (
+    s.label_id IS NULL OR s.label_id = '' OR EXISTS (SELECT 1 FROM product_label pl WHERE pl.label_id = (s.label_id)::int)
+  )
+  AND (
+    s.rxnorm_product_id IS NULL OR s.rxnorm_product_id = '' OR EXISTS (SELECT 1 FROM vocab_rxnorm_product v WHERE v.rxnorm_id = s.rxnorm_product_id)
+  );
+
+SELECT 'target_count', COUNT(*) FROM product_to_rxnorm;
+SELECT 'staged_unmatched_count', COUNT(*) FROM product_to_rxnorm_staging s LEFT JOIN product_to_rxnorm p ON p.label_id = (s.label_id)::int AND p.rxnorm_product_id = s.rxnorm_product_id WHERE p.label_id IS NULL;
+PSQL
+
+echo "Done."