Make Vaxrank's epitope/vaccine peptide scoring/filtering/ranking more configurable via YAML #200

iskandr · 2024-04-29T16:51:45Z

Vaxrank is currently configured via a mix of commandline arguments, options only accessible within the Python API, and some hardcoded assumptions.

The purpose of this PR is to reify all options related to scoring/filtering/ranking of epitope predictions and vaccine peptides in a single YAML file which can be thought of as the vaccine selection logic for an experiment or trial.

coveralls · 2024-04-29T17:02:00Z

coverage: 89.647% (-6.6%) from 96.289%
when pulling 4b59be9 on yaml-config
into a95e4e1 on main.

… index YAML Config fixes: - Fix syntax error in epitope_logic.py (missing colon after function signature) - Fix imports in __init__.py to import from correct config modules - Fix epitope_config_args.py to import DEFAULT_MIN_EPITOPE_SCORE and use args.config - Rewrite broken vaccine_config_from_args() function with correct variable names - Fix entry_point.py imports and logging.conf path for cli subpackage - Add exports to cli/__init__.py for make_vaxrank_arg_parser, run_vaxrank_from_parsed_args, main - Fix core_logic.py function signatures for vaccine_peptides_for_variant and vaccine_peptides_from_epitopes Shellinford replacement: - Remove shellinford dependency (had build issues and slow index time) - Replace with simple Python set-based kmer index for O(1) membership testing - Use pickle for fast serialization instead of CSV - Cache index to disk for reuse across runs Test fix: - Pass epitope_config with min_epitope_score=0 to test_reference_peptide_logic to prevent random IC50 values from filtering out expected epitopes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

New test files: - test_config.py: 34 tests for EpitopeConfig, VaccineConfig, config_from_args functions, CLI argument parsing, and error handling - test_reference_proteome.py: 27 tests for ReferenceProteome set-based kmer index including building, caching, loading, and membership testing - test_core_logic_config.py: 14 tests for config integration with core logic Bug fixes: - Handle empty YAML config files in epitope_config_from_args and vaccine_config_from_args (msgspec returns null for empty files) Total: 75 new tests, all passing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove unused imports (ok_, almost_eq_, gte_, pytest) - Remove unused result variable assignments - Fix f-string without placeholders in epitope_config_args Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The docm.info API has been discontinued and now redirects to GitHub, returning HTML instead of JSON. Added error handling for JSONDecodeError so the report generation gracefully handles this case by skipping the WUSTL database link. Also updated logger.warn to logger.warning to fix deprecation warning. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The DoCM API at docm.info has been retired. This replaces the online API lookup with a bundled cancer hotspots dataset from Chang et al. 2016/2017 (via maftools). Changes: - Add vaxrank/cancer_hotspots.py module for hotspot lookup - Bundle cancer_hotspots_v2.tsv data file (~2,739 unique hotspots) - Update report.py to use local hotspots instead of WUSTL API - Add comprehensive tests for cancer hotspots module (32 tests) The hotspot lookup is O(1) using a dict keyed by (gene, protein_change) and uses LRU caching to load the data file only once. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Updates: - README.md: Added YAML config docs, features section, architecture section, updated Python version requirement (3.9+), fixed outdated FM index reference - docs/conf.py: Added autodoc, napoleon, viewcode extensions; sphinx-rtd-theme - docs/index.rst: Complete rewrite with installation, usage, config file docs - docs/getting_started.rst: New quick start guide - docs/configuration.rst: New detailed config documentation with API docs - docs/api.rst: New API reference with autodoc for all modules - .github/workflows/docs.yml: New GitHub Action for building docs - vaxrank/epitope_config.py: Added comprehensive docstrings with examples - vaxrank/vaccine_config.py: Added comprehensive docstrings with examples The documentation now covers: - YAML configuration file format and options - EpitopeConfig and VaccineConfig parameter descriptions - Reference proteome filtering (set-based kmer index) - Cancer hotspot annotation feature - CLI usage examples - API reference with autodoc Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add tests/conftest.py with pytest configuration for custom markers - Add @pytest.mark.slow marker for tests that use real NetMHCpan - Add @requires_netmhcpan skip decorator for tests needing NetMHCpan - Update test_ascii_report_real_netmhc_predictor to be less brittle: - Removed assertion that "No variants" must not appear in output - NetMHCpan may legitimately find no epitopes depending on alleles - Test now verifies report generation succeeds with non-empty output - Tests can be run with -m "not slow" to skip slow NetMHCpan tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Support both underscore (--cosmic_vcf_filename) and dash (--cosmic-vcf-filename) versions of the argument for backwards compatibility with existing scripts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Standardize on dashes for CLI options. The underscore version (--cosmic_vcf_filename) is kept as a legacy fallback for backwards compatibility with existing scripts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix TypeError when max_mutations_in_report is None (add None check) - Fix potential AttributeError in cached mode for num_epitopes_per_vaccine_peptide - Use VaccineConfig defaults for CLI args instead of hardcoded values: - --padding-around-mutation now uses default_vaccine_config.padding_around_mutation - --max-vaccine-peptides-per-mutation uses default_vaccine_config.max_vaccine_peptides_per_variant - --num-epitopes-per-vaccine-peptide now has default from config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Use find_packages() instead of hardcoded package list to automatically discover all subpackages. This fixes a critical bug where pip install would not include the vaxrank.cli subpackage, causing the vaxrank command to fail. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- README.md: Fix test data path from test/ to tests/ - epitope_config.py: Fix scoring formula to match actual implementation (was showing wrong formula with power function, now shows correct logistic formula with normalization) - docs/configuration.rst: Update scoring formula to match code The documented formula was: score = 1 / (1 + (ic50 / midpoint) ^ (4 * ln(3) / width)) But the actual code uses: rescaled = (ic50 - midpoint) / width score = (1 / (1 + exp(rescaled))) / normalizer Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Delete docs/ directory with Sphinx RST files - Delete .github/workflows/docs.yml - README.md already contains comprehensive documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add mkdocs.yml config with Material theme - Add GitHub Actions workflow to build and deploy to GitHub Pages - Docs are auto-generated from README.md (no separate doc files to maintain) To enable GitHub Pages deployment: 1. Go to repo Settings > Pages 2. Set Source to "GitHub Actions" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The 1.2GB reference proteome kmer set pickle file was being loaded from disk for each variant processed during epitope prediction. This caused ~6GB of I/O overhead when processing 5 variants, making tests slow. Added module-level _kmer_set_cache dictionary that caches loaded kmer sets in memory. Cache key is (species, release, min_len, max_len). This reduces run_vaxrank time from ~85s to ~30s for the test case. Updated test to clear in-memory cache when specifically testing disk cache loading behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add tqdm progress bars for both building and loading kmer index - Compress cache files with gzip (531MB → 294MB, 45% smaller) - Use pickle.HIGHEST_PROTOCOL for faster serialization - Track compressed bytes read for accurate progress during load - Auto-convert legacy uncompressed .pkl files to .pkl.gz - Load time improved from ~69s to ~12s (6x faster) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Change bare `except:` to `except Exception:` in epitope_logic.py - Remove unused `requests` import from report.py - Use isinstance() instead of type() == for bool check in report.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Many transcripts share the same protein sequence (250k transcripts -> ~20k unique proteins). By collecting unique proteins first, we avoid redundant kmer extraction. Also switched from individual set.add() calls to batch set.update() with a generator for better performance. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix critical bug: check for Asp-Pro (DP) bonds instead of Asn-Pro (NP) Asp-Pro bonds are unstable and susceptible to hydrolysis under acidic conditions, while Asn-Pro bonds are relatively stable. The code was checking for the wrong dipeptide. - Rename asparagine_proline_bond_count to aspartate_proline_bond_count - Add edge case handling for empty sequences in gravy_score() - Add edge case handling for short sequences in max_kmer_gravy_score() - Remove duplicate overlaps_mutation assignment in EpitopePrediction - Remove dead code (unused OrderedDict initialization) in report.py - Fix typos: "determing" -> "determining", "manufacturer" -> "manufacturability" - Fix double space in docstring - Remove trailing whitespace from imports Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

iskandr and others added 23 commits January 18, 2026 10:51

read install reqs from requirements.txt

d7a87d1

making logistic scoring of epitopes modified by Config

06a9b08

added TODO list and wiring up use of Config across function calls

2d946d9

moved epitope_logic

f5ee0b8

working on fixing XLSX writing

a5dd7a6

found last error in test using xlrd

8c2b730

test excel has two variants

b4e5ab0

try parallel coveralls

65c6cbb

adding git tags to deploy

db21939

add CLI to config

2522bbe

YAML config seems to work, still need tests

7ef1b8f

added vaccine config

737199e

making a CLI sub-module

df48f1a

working on centralizing config information in msgspec.Struct objects

2fa6fc3

working on YAML config

e5922bd

broken

20a276e

Fix linting issues in new test files

cbd9f78

- Remove unused imports (ok_, almost_eq_, gte_, pytest) - Remove unused result variable assignments - Fix f-string without placeholders in epitope_config_args Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

iskandr force-pushed the yaml-config branch from 3cd422a to 2fd73bb Compare January 18, 2026 15:51

iskandr and others added 5 commits January 19, 2026 09:27

Add backwards compatibility for --cosmic_vcf_filename

65cfb74

Support both underscore (--cosmic_vcf_filename) and dash (--cosmic-vcf-filename) versions of the argument for backwards compatibility with existing scripts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Make --cosmic-vcf-filename primary, underscore version legacy

bc84af6

Standardize on dashes for CLI options. The underscore version (--cosmic_vcf_filename) is kept as a legacy fallback for backwards compatibility with existing scripts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

iskandr and others added 8 commits January 19, 2026 15:17

Remove RST documentation in favor of README.md

69dfd56

- Delete docs/ directory with Sphinx RST files - Delete .github/workflows/docs.yml - README.md already contains comprehensive documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove xlsx files and add to gitignore

e71508d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Vaxrank's epitope/vaccine peptide scoring/filtering/ranking more configurable via YAML #200

Make Vaxrank's epitope/vaccine peptide scoring/filtering/ranking more configurable via YAML #200

Uh oh!

iskandr commented Apr 29, 2024

Uh oh!

coveralls commented Apr 29, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Make Vaxrank's epitope/vaccine peptide scoring/filtering/ranking more configurable via YAML #200

Are you sure you want to change the base?

Make Vaxrank's epitope/vaccine peptide scoring/filtering/ranking more configurable via YAML #200

Uh oh!

Conversation

iskandr commented Apr 29, 2024

Uh oh!

coveralls commented Apr 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coveralls commented Apr 29, 2024 •

edited

Loading