Skip to content

Feature/vj-notebook#28

Draft
VarenyaJ wants to merge 166 commits intomainfrom
feature/vj-notebook
Draft

Feature/vj-notebook#28
VarenyaJ wants to merge 166 commits intomainfrom
feature/vj-notebook

Conversation

@VarenyaJ
Copy link
Owner

Refactor project; start notebook from scratch

SmartMonkey-git and others added 30 commits June 19, 2025 10:49
…experiment-notebook

Starting new work on the Trello ticket, continuing work started in "rr/model-notebook"
…'P5/scripts/data/tmp/phenopacket_dataset.csv' from 'P5/scripts/readme.md'
…in each PMID PDF. Adjust some variable names and comments
VarenyaJ and others added 30 commits August 18, 2025 12:52
…y __main__ and expose pdf-parse as its own entry point

- pyproject.toml:
  - set `p5` entry point to `P5.scripts.__main__:cli` (the full scripts CLI)
  - added `p5-pdf-parse` console script alias entry for users to run, giving direct access to the pdf-parse helper
  - kept optional sub-tool entry points (pull-git-files, create-pmid-pkl, etc.)
  - adjusted keywords and comments for clarity

- src/P5/__main__.py:
  - replaced the old standalone click group with a deprecation shim
  - running `python -m P5` now prints a friendly message directing users to `p5`
… the event of an HTPP 500 error:

- file_to_phenopacket:
  • Always request JSON from Ollama (`format="json"`)
  • Fallback: if model output is not valid JSON, write a minimal valid
    Phenopacket scaffold so one JSON file is produced per input
  • Ensure output directory exists before writing
  • Added RFC3339 timestamp helper for `metaData.created`

- pmid_downloader:
  • Hardened _get_pmcid() with retry/backoff for transient NCBI 5xx errors
  • Return None (skip PMID) instead of crashing on Entrez/HTTP errors
  • Added Entrez.email env override + graceful handling of failures

These changes ensure:
- `test_file_to_phenopacket[.pdf]` produces expected JSON outputs
- `test_pmid_downloader` no longer fails on NCBI server errors
- CLI commands always exit 0 with sensible output, even in edge cases
…d upgrade model while still revising

- Added stable `id` fields for all notebook cells to improve reproducibility
- Step 4: added `list_pmids_loaded` to track actually loaded PMIDs, updated patient ID fallback
- Step 5: improved minimal Phenopacket builder
  - skip invalid HPOs
  - ensure non-null labels
  - use UTC ISO8601 timestamps in `metaData.created`
- Temporarily upgraded model from `llama3.2:latest` → `gpt-oss:latest`
- Removed deprecated `--hidethinking` option; added `num_ctx` for larger context
- Step 6 & 7: use `list_pmids_loaded` instead of indexing into dataframe
- Step 8: updated evaluator `model` field to `ollama:gpt-oss:latest`
…d HPO extraction

- Added debug print of input char counts per PMID.
- Relaxed HPO JSON schema (require only `hpo_id`).
- Rewrote HPO prompt instructions for onset/severity/frequency + excerpts.
- Added fallback when schema-based JSON parse fails.
- Updated execution counts & timestamps.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants