From 65e54f5c37ecc2bdedc04f23260ed97461276d4d Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Fri, 13 Mar 2026 03:51:34 +0000 Subject: [PATCH 1/2] =?UTF-8?q?chore:=20=F0=9F=A4=96=20sync=20copilot=20in?= =?UTF-8?q?structions=20-=202026-03-13?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .github/copilot-instructions.md | 164 ++++++++++++++++++++++++++++++++ 1 file changed, 164 insertions(+) create mode 100644 .github/copilot-instructions.md diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 0000000..ab79ac9 --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,164 @@ +# CoPilot Instructions for CCBR Repositories + +## Reviewer guidance (what to look for in PRs) + +- Reviewers must validate enforcement rules: no secrets, container specified, and reproducibility pins. +- If code is AI-generated, reviewers must ensure the author documents what was changed and why, and that the PR is labeled `generated-by-AI`. +- Reviewers should verify license headers and ownership metadata (for example, `CODEOWNERS`) are present. +- Reviews must read the code and verify that it adheres to the project's coding standards, guidelines, and best practices in software engineering. + +## CI & enforcement suggestions (automatable) + +1. **PR template**: include optional AI-assistance disclosure fields (model used, high-level prompt intent, manual review confirmation). +2. **Pre-merge check (GitHub Action)**: verify `.github/copilot-instructions.md` is present in the repository and that new pipeline files include a `# CRAFT:` header. +3. **Lint jobs**: `ruff` for Python, `shellcheck` for shell, `lintr` for R, and `nf-core lint` or Snakemake lint checks where applicable. +4. **Secrets scan**: run `TruffleHog` or `Gitleaks` on PRs to detect accidental credentials. +5. **AI usage label**: if AI usage is declared, an Action should add `generated-by-AI` label (create this label if it does not exist); the PR body should end with the italicized Markdown line: _Generated using AI_, and any associated commit messages should end with the plain footer line: `Generated using AI`. + +_Sample GH Action check (concept): if AI usage is declared, require an AI-assistance disclosure field in the PR body._ + +## Security & compliance (mandatory) + +- Developers must not send PHI or sensitive NIH internal identifiers to unapproved external AI services; use synthetic examples. +- Repository content must only be sent to model providers approved by NCI/NIH policy (for example, Copilot for Business or approved internal proxies). +- For AI-assisted actions, teams must keep an auditable record including: user, repository, action, timestamp, model name, and endpoint. +- If using a server wrapper (Option C), logs must include the minimum metadata above and follow institutional retention policy. +- If policy forbids external model use for internal code, teams must use approved local/internal LLM workflows. + +## Operational notes (practical) + +- `copilot-instructions.md` should remain concise and prescriptive; keep only high-value rules and edge-case examples. +- Developers should include the CRAFT block in edited files when requesting substantial generated code to improve context quality. +- CoPilot must ask the user for permission before deleting any file unless the file was created by CoPilot for a temporary run or test. +- CoPilot must not edit any files outside of the current open workspace. + +## Code authoring guidance + +- Code must not include hard-coded secrets, credentials, or sensitive absolute paths on disk. +- Code should be designed for modularity, reusability, and maintainability. It should ideally be platform-agnostic, with special support for running on the Biowulf HPC. +- Use pre-commit to enforce code style and linting during the commit process. + +### Pipelines + +- Authors must review existing CCBR pipelines first: . +- New pipelines should follow established CCBR conventions for folder layout, rule/process naming, config structure, and test patterns. +- Pipelines must define container images and pin tool/image versions for reproducibility. +- Contributions should include a test dataset and a documented example command. + +#### Snakemake + +- In general, new pipelines should be created with Nextflow rather than Snakemake, unless there is a compelling reason to use Snakemake. +- Generate new pipelines from the CCBR_SnakemakeTemplate repo: +- For Snakemake, run `snakemake --lint` and a dry-run before PR submission. + +#### Nextflow + +- Generate new pipelines from the CCBR_NextflowTemplate repo: +- For Nextflow pipelines, authors must follow nf-core patterns and references: . +- Nextflow code must use DSL2 only (DSL1 is not allowed). +- For Nextflow, run `nf-core lint` (or equivalent checks) before PR submission. +- Where possible, reuse modules and subworkflows from CCBR/nf-modules or nf-core/modules. +- New modules and subworkflows should be tested with `nf-test`. + +### Python scripts and packages + +- Python scripts must include module and function/class docstrings. +- Where a standard CLI framework is adopted, Python CLIs should use `click` or `typer` for consistency with existing components. +- Scripts must support `--help` and document required/optional arguments. +- Python code must follow [PEP 8](https://peps.python.org/pep-0008/), use `snake_case`, and include type hints for public functions. +- Scripts must raise descriptive error messages on failure and warnings when applicable. Prefer raising an exception over printing an error message, and over returning an error code. +- Python code should pass `ruff`; +- Each script must include a documented example usage in comments or README. +- Tests should be written with `pytest`. Other testing frameworks may be used if justified. +- Do not catch bare exceptions. The exception type must always be specified. +- Only include one return statement at the end of a function. + +### R scripts and packages + +- R scripts must include function and class docstrings via roxygen2. +- CLIs must be defined using the `argparse` package. +- CLIs must support `--help` and document required/optional arguments. +- R code should pass `lintr` and `air`. +- Tests should be written with `testthat`. +- Packages should pass `devtools::check()`. +- R code should adhere to the tidyverse style guide. https://style.tidyverse.org/ +- Only include one return statement at the end of a function, if a return statement is used at all. Explicit returns are preferred but not required for R functions. + +## AI-generated commit messages (Conventional Commits) + +- Commit messages must follow [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) (as enforced in `CONTRIBUTING.md`). +- Generate messages from staged changes only (`git diff --staged`); do not include unrelated work. +- Commits should be atomic: one logical change per commit. +- If mixed changes are present, split into multiple logical commits; the number of commits does not need to equal the number of files changed. +- Subject format must be: `(optional-scope): short imperative summary` (<=72 chars), e.g., `fix(profile): update release table parser`. +- Add a body only when needed to explain **why** and notable impact; never include secrets, tokens, PHI, or large diffs. +- For AI-assisted commits, add this final italicized footer line in the commit message body: _commit message is ai-generated_ + +Suggested prompt for AI tools: + +```text +Create a Conventional Commit message from this staged diff. +Rules: +1) Use one of: feat|fix|docs|style|refactor|perf|test|build|ci|chore|revert. +2) Keep subject <= 72 chars, imperative mood, no trailing period. +3) Include optional scope when clear. +4) Add a short body only if needed (why/impact), wrapped at ~72 chars. +5) Output only the final commit message. +``` + +## Pull Requests + +When opening a pull request, use the repository's pull request template (usually it is `.github/PULL_REQUEST_TEMPLATE.md`). +Different repos have different PR templates depending on their needs. +Ensure that the pull request follows the repository's PR template and includes all required information. +Do not allow the developer to proceed with opening a PR if it does not fill out all sections of the template. +Before a PR can be moved from draft to "ready for review", all of the relevant checklist items must be checked, and any +irrelevant checklist items should be crossed out. + +When new features, bug fixes, or other behavioral changes are introduced to the code, +unit tests must be added or updated to cover the new or changed functionality. + +If there are any API or other user-facing changes, the documentation must be updated both inline via docstrings and long-form docs in the `docs/` or `vignettes/` directory. + +When a repo contains a build workflow (i.e. a workflow file in `.github/workflows` starting with `build` or named `R-CMD-check`), +the build workflow must pass before the PR can be approved. + +### Changelog + +The changelog for the repository should be maintained in a `CHANGELOG.md` file +(or `NEWS.md` for R packages) at the root of the repository. Each pull request +that introduces user-facing changes must include a concise entry with the PR +number and author username tagged. Developer-only changes (i.e. updates to CI +workflows, development notes, etc.) should never be included in the changelog. +Example: + +``` +## development version + +- Fix bug in `detect_absolute_paths()` to ignore comments. (#123, @username) +``` + +## Onboarding checklist for new developers + +- [ ] Read `.github/CONTRIBUTING.md` and `.github/copilot-instructions.md`. +- [ ] Configure VSCode workspace to open `copilot-instructions.md` by default (so Copilot Chat sees it). +- [ ] Install pre-commit and run `pre-commit install`. + +## Appendix: VSCode snippet (drop into `.vscode/snippets/craft.code-snippets`) + +```json +{ + "Insert CRAFT prompt": { + "prefix": "craft", + "body": [ + "/* C: Context: Repo=${workspaceFolderBasename}; bioinformatics pipelines; NIH HPC (Biowulf/Helix); containers: quay.io/ccbr */", + "/* R: Rules: no PHI, no secrets, containerize, pin versions, follow style */", + "/* F: Flow: inputs/ -> results/, conf/, tests/ */", + "/* T: Tests: provide a one-line TEST_CMD and expected output */", + "", + "A: $1" + ], + "description": "Insert CRAFT prompt and place cursor at Actions" + } +} +``` From 9fc008657a5e34cb57e30e06aa5b8b1d1e9b1bb3 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Fri, 13 Mar 2026 03:53:34 +0000 Subject: [PATCH 2/2] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- .tests/lint_workdir/ref/dummy | 1 - config/samples.tsv.fulltest | 2 +- docker/bowtie1/environment.txt | 2 +- docker/circRNA_finder/environment.txt | 2 +- docker/cutadapt_fqfilter/environment.yml | 8 +- docker/dcc/environment.yml | 6 +- docker/star_ucsc_cufflinks/environment.yml | 42 +- docs/dryrun_example.txt | 2 +- resources/NCLscan.config.template | 5 +- ...ruSeq_and_nextera_adapters.consolidated.fa | 2 +- resources/argparse.bash | 1 - resources/cluster.json.highmem | 2 +- resources/collapse_bed_by_names.py | 44 +- resources/dockers/ccbr_clear/Dockerfile | 4 +- resources/merge_dataframes.R | 22 +- workflow/envs/clear.yaml | 8 +- workflow/rules/preprocessing.smk | 4 +- .../Create_circExplorer_BSJ_count_matrix.py | 75 +- .../Create_circExplorer_count_matrix.py | 152 +-- workflow/scripts/Create_ciri_count_matrix.py | 106 +- workflow/scripts/_add_geneid2genepred.py | 57 +- .../_append_splice_site_flanks_to_BSJs.py | 106 +- .../scripts/_bam_filter_BSJ_for_HQonly.py | 190 ++-- workflow/scripts/_bam_get_alignment_stats.py | 73 +- workflow/scripts/_bamtobed2readendsbed.py | 58 +- workflow/scripts/_bedintersect_to_rid2jid.py | 67 +- workflow/scripts/_bedpe2bed.py | 30 +- .../scripts/_circExplorer_BSJ_get_strand.py | 53 +- workflow/scripts/_collapse_find_circ.py | 37 +- workflow/scripts/_compare_lists.py | 81 +- .../_create_circExplorer_BSJ_bam_pe.py | 804 +++++++++------ .../_create_circExplorer_BSJ_bam_se.py | 743 ++++++++------ .../_create_circExplorer_BSJ_hqonly_pe.py | 825 ++++++++++------ .../_extract_circExplorer_linear_reads.py | 681 ++++++++----- ...filter_linear_spliced_readids_w_rid2jid.py | 160 +-- workflow/scripts/_make_master_counts_table.py | 85 +- .../_merge_circExplorer_found_counts.py | 65 +- .../scripts/_merge_per_sample_counts_table.py | 922 +++++++++++------- .../scripts/_multifasta2separatefastas.sh | 2 +- workflow/scripts/_process_bamtobed.py | 109 ++- workflow/scripts/annotate_clear_quant.py | 50 +- workflow/scripts/apply_junction_filters.py | 135 ++- workflow/scripts/bam_get_max_readlen.py | 12 +- workflow/scripts/bam_split_by_regions.py | 185 ++-- workflow/scripts/bam_to_bigwig.sh | 2 +- ...xplorer_get_annotated_counts_per_sample.py | 324 ++++-- .../scripts/create_circExplorer_linear_bam.py | 804 ++++++++------- ...te_circExplorer_per_sample_counts_table.py | 63 +- .../create_dcc_per_sample_counts_table.py | 126 ++- ...reate_mapsplice_per_sample_counts_table.py | 324 ++++-- .../create_nclscan_per_sample_counts_table.py | 213 +++- workflow/scripts/filter_bam.py | 40 +- workflow/scripts/filter_bam_by_readids.py | 66 +- workflow/scripts/filter_bam_for_BSJs.py | 323 +++--- .../scripts/filter_bam_for_linear_reads.py | 81 +- .../scripts/filter_bam_for_splice_reads.py | 140 +-- workflow/scripts/filter_ciriout.py | 223 +++-- workflow/scripts/filter_dcc.py | 228 +++-- workflow/scripts/filter_junction.py | 8 +- workflow/scripts/filter_junction_human.py | 8 +- workflow/scripts/fix_gtfs.py | 156 +-- workflow/scripts/fix_refseq_gtf.py | 319 +++--- workflow/scripts/gather_cluster_stats.sh | 2 +- workflow/scripts/get_index_rl.py | 15 +- workflow/scripts/junctions2readids.py | 58 +- workflow/scripts/make_star_index.sh | 2 +- workflow/scripts/merge_ReadsPerGene_counts.R | 8 +- .../merge_counts_tables_2_counts_matrix.py | 269 +++-- workflow/scripts/reformat_hg38_2_hg19.py | 102 +- workflow/scripts/transcript2gene.py | 39 +- ...e_BSJ_reads_and_split_BSJ_bam_by_strand.py | 688 ++++++------- 71 files changed, 6578 insertions(+), 4073 deletions(-) diff --git a/.tests/lint_workdir/ref/dummy b/.tests/lint_workdir/ref/dummy index 8b13789..e69de29 100644 --- a/.tests/lint_workdir/ref/dummy +++ b/.tests/lint_workdir/ref/dummy @@ -1 +0,0 @@ - diff --git a/config/samples.tsv.fulltest b/config/samples.tsv.fulltest index 1a883ec..8f5913f 100644 --- a/config/samples.tsv.fulltest +++ b/config/samples.tsv.fulltest @@ -1,3 +1,3 @@ sampleName path_to_R1_fastq path_to_R2_fastq GI1_N /data/Ziegelbauer_lab/circRNADetection/rawdata/ccbr983/fastq2/5_GI112118_norm_S4_R1_001.fastq.gz /data/Ziegelbauer_lab/circRNADetection/rawdata/ccbr983/fastq2/5_GI112118_norm_S4_R2_001.fastq.gz -GI1_T /data/Ziegelbauer_lab/circRNADetection/rawdata/ccbr983/fastq2/6_GI112118_tum_S5_R1_001.fastq.gz \ No newline at end of file +GI1_T /data/Ziegelbauer_lab/circRNADetection/rawdata/ccbr983/fastq2/6_GI112118_tum_S5_R1_001.fastq.gz diff --git a/docker/bowtie1/environment.txt b/docker/bowtie1/environment.txt index 14ff580..1edfbc3 100644 --- a/docker/bowtie1/environment.txt +++ b/docker/bowtie1/environment.txt @@ -1 +1 @@ -bowtie=1.3.1 \ No newline at end of file +bowtie=1.3.1 diff --git a/docker/circRNA_finder/environment.txt b/docker/circRNA_finder/environment.txt index fd233e3..2514f91 100644 --- a/docker/circRNA_finder/environment.txt +++ b/docker/circRNA_finder/environment.txt @@ -1,2 +1,2 @@ samtools -STAR \ No newline at end of file +STAR diff --git a/docker/cutadapt_fqfilter/environment.yml b/docker/cutadapt_fqfilter/environment.yml index 4bc48f7..c73a55f 100644 --- a/docker/cutadapt_fqfilter/environment.yml +++ b/docker/cutadapt_fqfilter/environment.yml @@ -1,6 +1,6 @@ channels: - - conda-forge - - bioconda + - conda-forge + - bioconda dependencies: - - cutadapt - - fastq-filter \ No newline at end of file + - cutadapt + - fastq-filter diff --git a/docker/dcc/environment.yml b/docker/dcc/environment.yml index 18e2f93..5b0daba 100644 --- a/docker/dcc/environment.yml +++ b/docker/dcc/environment.yml @@ -1,5 +1,5 @@ channels: - - conda-forge - - bioconda + - conda-forge + - bioconda dependencies: - - bioconda::dcc=0.5.0 \ No newline at end of file + - bioconda::dcc=0.5.0 diff --git a/docker/star_ucsc_cufflinks/environment.yml b/docker/star_ucsc_cufflinks/environment.yml index 07edb22..23006e4 100644 --- a/docker/star_ucsc_cufflinks/environment.yml +++ b/docker/star_ucsc_cufflinks/environment.yml @@ -1,23 +1,23 @@ channels: - - conda-forge - - bioconda + - conda-forge + - bioconda dependencies: - - argparse - - bedtools=2.29.0 - - blat=35 - - bowtie2=2.5.1 - - bwa=0.7.17 - - cufflinks=2.2.1 - - gffread - - HTSeq - - novoalign=3.07.00 - - numpy - - pandas - - pysam - - python=3.6 - - sambamba=0.8.2 - - samtools=1.16.1 - - star=2.7.6a - - ucsc-bedgraphtobigwig - - ucsc-bedsort - - ucsc-gtftogenepred \ No newline at end of file + - argparse + - bedtools=2.29.0 + - blat=35 + - bowtie2=2.5.1 + - bwa=0.7.17 + - cufflinks=2.2.1 + - gffread + - HTSeq + - novoalign=3.07.00 + - numpy + - pandas + - pysam + - python=3.6 + - sambamba=0.8.2 + - samtools=1.16.1 + - star=2.7.6a + - ucsc-bedgraphtobigwig + - ucsc-bedsort + - ucsc-gtftogenepred diff --git a/docs/dryrun_example.txt b/docs/dryrun_example.txt index 7f14f1b..6a56ed5 100644 --- a/docs/dryrun_example.txt +++ b/docs/dryrun_example.txt @@ -502,4 +502,4 @@ Job counts: 2 star1p 2 star2p 20 -This was a dry-run (flag -n). The order of jobs does not reflect the order of execution. \ No newline at end of file +This was a dry-run (flag -n). The order of jobs does not reflect the order of execution. diff --git a/resources/NCLscan.config.template b/resources/NCLscan.config.template index 0f53d48..179682f 100644 --- a/resources/NCLscan.config.template +++ b/resources/NCLscan.config.template @@ -68,7 +68,7 @@ SeqOut_bin = {NCLscan_bin}/SeqOut ### Advanced parameters ### ########################### -## The following two parameters indicate the maximal read length (L) and fragment size of the used paired-end RNA-seq data (FASTQ files), where fragment size = 2L + insert size. +## The following two parameters indicate the maximal read length (L) and fragment size of the used paired-end RNA-seq data (FASTQ files), where fragment size = 2L + insert size. ## If L > 151, the users should change these two parameters to (L, 2L + insert size). max_read_len = 151 max_fragment_size = 500 @@ -96,6 +96,3 @@ bwa-mem-t = 56 ## NOTE: The memory usage of each blat process would be up to 4 GB! ## mp_blat_process = 56 - - - diff --git a/resources/TruSeq_and_nextera_adapters.consolidated.fa b/resources/TruSeq_and_nextera_adapters.consolidated.fa index de67830..8fb4b76 100755 --- a/resources/TruSeq_and_nextera_adapters.consolidated.fa +++ b/resources/TruSeq_and_nextera_adapters.consolidated.fa @@ -91,4 +91,4 @@ TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT >Barcode_Index25_F ACTGAT >Barcode_Index25_R -ATCAGT \ No newline at end of file +ATCAGT diff --git a/resources/argparse.bash b/resources/argparse.bash index ed1029b..25f935e 100755 --- a/resources/argparse.bash +++ b/resources/argparse.bash @@ -79,4 +79,3 @@ echo "INFILE: \${INFILE}" echo "OUTFILE: \${OUTFILE}" FOO fi - diff --git a/resources/cluster.json.highmem b/resources/cluster.json.highmem index b6d24f2..9a88f74 100644 --- a/resources/cluster.json.highmem +++ b/resources/cluster.json.highmem @@ -36,5 +36,5 @@ "threads": "56", "time": "48:00:00", "partition": "largemem" - } + } } diff --git a/resources/collapse_bed_by_names.py b/resources/collapse_bed_by_names.py index 5c23e3b..716098f 100644 --- a/resources/collapse_bed_by_names.py +++ b/resources/collapse_bed_by_names.py @@ -2,9 +2,10 @@ import sys import textwrap -usage_txt=textwrap.dedent("""\ +usage_txt = textwrap.dedent( + """\ Description: - The script collapses bed entries, ie, if the bed file has repeated + The script collapses bed entries, ie, if the bed file has repeated regions but with different names, then they are all collaped into a single bed entry and the names are reported as a comma separated list in the 4th column @@ -13,29 +14,32 @@ @Parameters: 1. : BED6 file that needs to be collapsed by name 2. : BED6 collaped output file -""".format(__file__)) +""".format( + __file__ + ) +) -if len(sys.argv)!=3: - exit(usage_txt) +if len(sys.argv) != 3: + exit(usage_txt) with open(sys.argv[1]) as f: - inputBedLines=f.readlines() + inputBedLines = f.readlines() -names=dict() +names = dict() for l in inputBedLines: - l=l.strip().split("\t") - tmp=[l[0],l[1],l[2],l[5]] - region_id="##".join(tmp) - if not region_id in names: - names[region_id]=list() - names[region_id].append(l[3]) + l = l.strip().split("\t") + tmp = [l[0], l[1], l[2], l[5]] + region_id = "##".join(tmp) + if not region_id in names: + names[region_id] = list() + names[region_id].append(l[3]) -outbed = open(sys.argv[2],'w') -for region_id,name in names.items(): - tmp=region_id.split("##") - namelist=",".join(name) - tmp.insert(3,namelist) - tmp.insert(4,"0") - outbed.write("\t".join(tmp)+"\n") +outbed = open(sys.argv[2], "w") +for region_id, name in names.items(): + tmp = region_id.split("##") + namelist = ",".join(name) + tmp.insert(3, namelist) + tmp.insert(4, "0") + outbed.write("\t".join(tmp) + "\n") outbed.close() diff --git a/resources/dockers/ccbr_clear/Dockerfile b/resources/dockers/ccbr_clear/Dockerfile index 809e588..975e122 100755 --- a/resources/dockers/ccbr_clear/Dockerfile +++ b/resources/dockers/ccbr_clear/Dockerfile @@ -18,9 +18,9 @@ ENV PATH="/opt2:$PATH" # Circexplorer2 --> bowtie ADD bowtie-1.1.2.tar.gz /opt2 ENV PATH="/opt2/bowtie-1.1.2:$PATH" -# Circexplorer2 --> UCSC bedtools tophat +# Circexplorer2 --> UCSC bedtools tophat RUN apt-get install -y bedtools -# Circexplorer2 --> UCSC tophat +# Circexplorer2 --> UCSC tophat ADD tophat-2.1.0.Linux_x86_64.tar.gz /opt2 ENV PATH="/opt2/tophat-2.1.0.Linux_x86_64:$PATH" # Circexplorer2 --> UCSC boostlibraries cufflinks diff --git a/resources/merge_dataframes.R b/resources/merge_dataframes.R index 105658c..512adff 100644 --- a/resources/merge_dataframes.R +++ b/resources/merge_dataframes.R @@ -1,27 +1,27 @@ #!/usr/bin/env Rscript --vanilla # suppressPackageStartupMessages(library("argparse")) -# +# # # create parser object # parser <- ArgumentParser() -# -# # specify our desired options -# # by default ArgumentParser will add an help option +# +# # specify our desired options +# # by default ArgumentParser will add an help option # parser$add_argument("-v", "--verbose", action="store_true", default=TRUE, # help="Print extra output [default]") -# parser$add_argument("--df1", +# parser$add_argument("--df1", # dest="df1", help="dataframe1") -# parser$add_argument("--df1_colname", +# parser$add_argument("--df1_colname", # help="dataframe1 columnname to merge by") -# parser$add_argument("--df2", +# parser$add_argument("--df2", # dest="df2", help="dataframe2") -# parser$add_argument("--df2_colname", +# parser$add_argument("--df2_colname", # help="dataframe2 columnname to merge by") -# parser$add_argument("--out", +# parser$add_argument("--out", # dest="out", help="out filename") -# +# # # get command line options, if help option encountered print help and exit, -# # otherwise if options not found on command line then set defaults, +# # otherwise if options not found on command line then set defaults, # args <- parser$parse_args() setwd("~/Ziegelbauer_lab/circRNADetection/scripts/circRNA/resources") diff --git a/workflow/envs/clear.yaml b/workflow/envs/clear.yaml index a41d6ae..cd0f57b 100644 --- a/workflow/envs/clear.yaml +++ b/workflow/envs/clear.yaml @@ -25,7 +25,7 @@ dependencies: - wheel=0.36.2=pyhd3deb0d_0 - zlib=1.2.11=h516909a_1010 - pip: - - clear==1.0.1 - - pybedtools==0.8.1 - - pysam==0.16.0.1 - - six==1.15.0 \ No newline at end of file + - clear==1.0.1 + - pybedtools==0.8.1 + - pysam==0.16.0.1 + - six==1.15.0 diff --git a/workflow/rules/preprocessing.smk b/workflow/rules/preprocessing.smk index 1abeaed..8be3335 100644 --- a/workflow/rules/preprocessing.smk +++ b/workflow/rules/preprocessing.smk @@ -54,7 +54,7 @@ rule cutadapt: -j {threads} \\ -o {params.tmpdir}/${{of1bn}} -p {params.tmpdir}/${{of2bn}} \\ {input.R1} {input.R2} - + # filter for average read quality fastq-filter \\ -q {params.cutadapt_q} \\ @@ -73,7 +73,7 @@ rule cutadapt: -j {threads} \\ -o {params.tmpdir}/${{of1bn}} \\ {input.R1} - + touch {output.of2} # filter for average read quality diff --git a/workflow/scripts/Create_circExplorer_BSJ_count_matrix.py b/workflow/scripts/Create_circExplorer_BSJ_count_matrix.py index ddd6895..1b96e70 100755 --- a/workflow/scripts/Create_circExplorer_BSJ_count_matrix.py +++ b/workflow/scripts/Create_circExplorer_BSJ_count_matrix.py @@ -11,21 +11,21 @@ import os import matplotlib.pyplot as plt import sys -lookupfile=sys.argv[1] -hostID=sys.argv[2] + +lookupfile = sys.argv[1] +hostID = sys.argv[2] # In[27]: def readthefile(f): - sampleName=f.name.replace(".back_spliced_junction.bed","") - x=pandas.read_csv(f,sep="\t",header=None) - x.columns=["chr","start","end","name_count","score","strand"] - x['id'] = x["chr"]+":"+x["start"].map(str)+"-"+x["end"].map(str) - x[['name',sampleName]] = x.name_count.str.split("/",expand=True) - x=x.loc[:,["id",sampleName]] - x.set_index(["id"],inplace=True) - return(x) - + sampleName = f.name.replace(".back_spliced_junction.bed", "") + x = pandas.read_csv(f, sep="\t", header=None) + x.columns = ["chr", "start", "end", "name_count", "score", "strand"] + x["id"] = x["chr"] + ":" + x["start"].map(str) + "-" + x["end"].map(str) + x[["name", sampleName]] = x.name_count.str.split("/", expand=True) + x = x.loc[:, ["id", sampleName]] + x.set_index(["id"], inplace=True) + return x # In[2]: @@ -38,26 +38,31 @@ def atof(text): retval = text return retval + def natural_keys(text): - ''' + """ alist.sort(key=natural_keys) sorts in human order http://nedbatchelder.com/blog/200712/human_sorting.html (See Toothy's implementation in the comments) float regex comes from https://stackoverflow.com/a/12643073/190597 - ''' - return [ atof(c) for c in re.split(r'[+-]?([0-9]+(?:[.][0-9]*)?|[.][0-9]+)', str(text)) ] + """ + return [ + atof(c) for c in re.split(r"[+-]?([0-9]+(?:[.][0-9]*)?|[.][0-9]+)", str(text)) + ] # In[3]: -outfilename1="circExplorer_BSJ_count_matrix.txt" -outfilename="circExplorer_BSJ_count_matrix_with_annotations.txt" - -files_circExplorer=list(Path(os.getcwd()).rglob("*.back_spliced_junction.bed")) -files_circExplorer=list(filter(lambda x:"_only.back" not in str(x),files_circExplorer)) -files_circExplorer=list(filter(lambda x: os.stat(x).st_size !=0, files_circExplorer)) +outfilename1 = "circExplorer_BSJ_count_matrix.txt" +outfilename = "circExplorer_BSJ_count_matrix_with_annotations.txt" + +files_circExplorer = list(Path(os.getcwd()).rglob("*.back_spliced_junction.bed")) +files_circExplorer = list( + filter(lambda x: "_only.back" not in str(x), files_circExplorer) +) +files_circExplorer = list(filter(lambda x: os.stat(x).st_size != 0, files_circExplorer)) files_circExplorer.sort(key=natural_keys) -if len(files_circExplorer)==0: - for f in [outfilename1,outfilename]: +if len(files_circExplorer) == 0: + for f in [outfilename1, outfilename]: if os.path.exists(f): os.remove(f) os.mknod(f) @@ -67,33 +72,35 @@ def natural_keys(text): # In[35]: -circE_count_matrix=readthefile(files_circExplorer[0]) +circE_count_matrix = readthefile(files_circExplorer[0]) print(circE_count_matrix.head()) # In[36]: -for j in range(1,len(files_circExplorer)): - x=readthefile(files_circExplorer[j]) - circE_count_matrix=pandas.concat([circE_count_matrix,x],axis=1,join="outer",sort=False) -circE_count_matrix=circE_count_matrix.sort_index() +for j in range(1, len(files_circExplorer)): + x = readthefile(files_circExplorer[j]) + circE_count_matrix = pandas.concat( + [circE_count_matrix, x], axis=1, join="outer", sort=False + ) +circE_count_matrix = circE_count_matrix.sort_index() print(circE_count_matrix.head()) # In[37]: -circE_count_matrix.fillna(0,inplace=True) +circE_count_matrix.fillna(0, inplace=True) circE_count_matrix.head() -circE_count_matrix.to_csv(outfilename1,sep="\t",header=True) +circE_count_matrix.to_csv(outfilename1, sep="\t", header=True) # In[38]: -annotations=pandas.read_csv(lookupfile,sep="\t",header=0) -annotations.set_index([hostID],inplace=True) +annotations = pandas.read_csv(lookupfile, sep="\t", header=0) +annotations.set_index([hostID], inplace=True) annotations.head() @@ -106,8 +113,8 @@ def natural_keys(text): # In[39]: -x=circE_count_matrix.join(annotations) -x.to_csv(outfilename,sep="\t",header=True) +x = circE_count_matrix.join(annotations) +x.to_csv(outfilename, sep="\t", header=True) # In[14]: @@ -115,5 +122,3 @@ def natural_keys(text): print(circE_count_matrix.shape) print(x.shape) - - diff --git a/workflow/scripts/Create_circExplorer_count_matrix.py b/workflow/scripts/Create_circExplorer_count_matrix.py index 62fa2f7..1116823 100755 --- a/workflow/scripts/Create_circExplorer_count_matrix.py +++ b/workflow/scripts/Create_circExplorer_count_matrix.py @@ -11,114 +11,123 @@ import os import matplotlib.pyplot as plt import sys -#get_ipython().run_line_magic('matplotlib', 'inline') -lookupfile=sys.argv[1] -hostID=sys.argv[2] +# get_ipython().run_line_magic('matplotlib', 'inline') + +lookupfile = sys.argv[1] +hostID = sys.argv[2] # In[2]: def atof(text): - try: - retval = float(text) - except ValueError: - retval = text - return retval + try: + retval = float(text) + except ValueError: + retval = text + return retval + def natural_keys(text): - ''' - alist.sort(key=natural_keys) sorts in human order - http://nedbatchelder.com/blog/200712/human_sorting.html - (See Toothy's implementation in the comments) - float regex comes from https://stackoverflow.com/a/12643073/190597 - ''' - return [ atof(c) for c in re.split(r'[+-]?([0-9]+(?:[.][0-9]*)?|[.][0-9]+)', str(text)) ] + """ + alist.sort(key=natural_keys) sorts in human order + http://nedbatchelder.com/blog/200712/human_sorting.html + (See Toothy's implementation in the comments) + float regex comes from https://stackoverflow.com/a/12643073/190597 + """ + return [ + atof(c) for c in re.split(r"[+-]?([0-9]+(?:[.][0-9]*)?|[.][0-9]+)", str(text)) + ] # In[3]: -outfilename1="circExplorer_count_matrix.txt" -outfilename="circExplorer_count_matrix_with_annotations.txt" +outfilename1 = "circExplorer_count_matrix.txt" +outfilename = "circExplorer_count_matrix_with_annotations.txt" -files_circExplorer=list(Path(os.getcwd()).rglob("*.circularRNA_known.txt")) -files_circExplorer=list(filter(lambda x: False if str(x).find("low_conf")!=-1 else True, files_circExplorer)) -files_circExplorer=list(filter(lambda x: os.stat(x).st_size !=0, files_circExplorer)) +files_circExplorer = list(Path(os.getcwd()).rglob("*.circularRNA_known.txt")) +files_circExplorer = list( + filter( + lambda x: False if str(x).find("low_conf") != -1 else True, files_circExplorer + ) +) +files_circExplorer = list(filter(lambda x: os.stat(x).st_size != 0, files_circExplorer)) files_circExplorer.sort(key=natural_keys) print(files_circExplorer) -if len(files_circExplorer)==0: - for f in [outfilename1,outfilename]: - if os.path.exists(f): - os.remove(f) - os.mknod(f) - exit() +if len(files_circExplorer) == 0: + for f in [outfilename1, outfilename]: + if os.path.exists(f): + os.remove(f) + os.mknod(f) + exit() # In[12]: -f=files_circExplorer[0] -sampleName=f.name.replace(".circularRNA_known.txt","") -print("Reading file:",f) -print("Sample Name:",sampleName) -x=pandas.read_csv(f,sep="\t",header=None,usecols=[0,1,2,12]) -x[hostID]=x[0].astype(str)+":"+x[1].astype(str)+"-"+x[2].astype(str) -x[sampleName+"_circE"]=x[12].astype(str) -x.drop([0,1,2,12],inplace=True,axis=1) -x.set_index([hostID],inplace=True) -circE_count_matrix=x +f = files_circExplorer[0] +sampleName = f.name.replace(".circularRNA_known.txt", "") +print("Reading file:", f) +print("Sample Name:", sampleName) +x = pandas.read_csv(f, sep="\t", header=None, usecols=[0, 1, 2, 12]) +x[hostID] = x[0].astype(str) + ":" + x[1].astype(str) + "-" + x[2].astype(str) +x[sampleName + "_circE"] = x[12].astype(str) +x.drop([0, 1, 2, 12], inplace=True, axis=1) +x.set_index([hostID], inplace=True) +circE_count_matrix = x # In[8]: - -print(circE_count_matrix.head(),circE_count_matrix.tail()) +print(circE_count_matrix.head(), circE_count_matrix.tail()) print(circE_count_matrix.shape) # In[13]: -for i in range(1,len(files_circExplorer)): - f=files_circExplorer[i] - print("Currently reading file:"+str(f)) - x=pandas.read_csv(f,sep="\t",header=None,usecols=[0,1,2,12]) - print("Head of this file looks like this:") - print(x.head()) - sampleName=f.name.replace(".circularRNA_known.txt","") - # x=pandas.read_csv(f,sep="\t",header=None,usecols=[0,1,2,12]) - print("SampleName is:"+sampleName) - x[hostID]=x[0].astype(str)+":"+x[1].astype(str)+"-"+x[2].astype(str) - x[sampleName+"_circE"]=x[12].astype(str) - print(x.head()) - x.drop([0,1,2,12],inplace=True,axis=1) - x.set_index([hostID],inplace=True) - print(x.head()) - print("Before concat") - print(circE_count_matrix.head()) - - -# In[14]: - - - circE_count_matrix = circE_count_matrix.loc[~circE_count_matrix.index.duplicated(keep='first')] - x = x.loc[~x.index.duplicated(keep='first')] - circE_count_matrix=pandas.concat([circE_count_matrix,x],axis=1,join="outer",sort=False) - print("After concat") - print(circE_count_matrix.head()) +for i in range(1, len(files_circExplorer)): + f = files_circExplorer[i] + print("Currently reading file:" + str(f)) + x = pandas.read_csv(f, sep="\t", header=None, usecols=[0, 1, 2, 12]) + print("Head of this file looks like this:") + print(x.head()) + sampleName = f.name.replace(".circularRNA_known.txt", "") + # x=pandas.read_csv(f,sep="\t",header=None,usecols=[0,1,2,12]) + print("SampleName is:" + sampleName) + x[hostID] = x[0].astype(str) + ":" + x[1].astype(str) + "-" + x[2].astype(str) + x[sampleName + "_circE"] = x[12].astype(str) + print(x.head()) + x.drop([0, 1, 2, 12], inplace=True, axis=1) + x.set_index([hostID], inplace=True) + print(x.head()) + print("Before concat") + print(circE_count_matrix.head()) + + # In[14]: + + circE_count_matrix = circE_count_matrix.loc[ + ~circE_count_matrix.index.duplicated(keep="first") + ] + x = x.loc[~x.index.duplicated(keep="first")] + circE_count_matrix = pandas.concat( + [circE_count_matrix, x], axis=1, join="outer", sort=False + ) + print("After concat") + print(circE_count_matrix.head()) # In[9]: -circE_count_matrix.fillna(0,inplace=True) +circE_count_matrix.fillna(0, inplace=True) print(circE_count_matrix.head()) -circE_count_matrix.to_csv(outfilename1,sep="\t",header=True) +circE_count_matrix.to_csv(outfilename1, sep="\t", header=True) # In[10]: -annotations=pandas.read_csv(lookupfile,sep="\t",header=0) -annotations.set_index([hostID],inplace=True) +annotations = pandas.read_csv(lookupfile, sep="\t", header=0) +annotations.set_index([hostID], inplace=True) annotations.head() @@ -131,8 +140,8 @@ def natural_keys(text): # In[12]: -x=circE_count_matrix.join(annotations) -x.to_csv(outfilename,sep="\t",header=True) +x = circE_count_matrix.join(annotations) +x.to_csv(outfilename, sep="\t", header=True) # In[14]: @@ -140,4 +149,3 @@ def natural_keys(text): print(circE_count_matrix.shape) print(x.shape) - diff --git a/workflow/scripts/Create_ciri_count_matrix.py b/workflow/scripts/Create_ciri_count_matrix.py index d38d21e..ebebc06 100755 --- a/workflow/scripts/Create_ciri_count_matrix.py +++ b/workflow/scripts/Create_ciri_count_matrix.py @@ -11,10 +11,11 @@ import os import matplotlib.pyplot as plt import sys -#get_ipython().run_line_magic('matplotlib', 'inline') -lookupfile=sys.argv[1] -hostID=sys.argv[2] +# get_ipython().run_line_magic('matplotlib', 'inline') + +lookupfile = sys.argv[1] +hostID = sys.argv[2] # In[2]: @@ -25,41 +26,55 @@ def atof(text): retval = text return retval + def natural_keys(text): - ''' + """ alist.sort(key=natural_keys) sorts in human order http://nedbatchelder.com/blog/200712/human_sorting.html (See Toothy's implementation in the comments) float regex comes from https://stackoverflow.com/a/12643073/190597 - ''' - return [ atof(c) for c in re.split(r'[+-]?([0-9]+(?:[.][0-9]*)?|[.][0-9]+)', str(text)) ] + """ + return [ + atof(c) for c in re.split(r"[+-]?([0-9]+(?:[.][0-9]*)?|[.][0-9]+)", str(text)) + ] # In[3]: -#files_circExplorer=list(Path(os.getcwd()).rglob("*_human_only.circularRNA_known.txt")) -files_ciri=list(Path(os.getcwd()).rglob("*.ciri.out")) -#filter out files in the "old" folder -#files_circExplorer=list(filter(lambda x: not re.search('/old/', str(x)), files_circExplorer)) -files_ciri=list(filter(lambda x: not re.search('/old/', str(x)), files_ciri)) -#files_circExplorer.sort(key=natural_keys) +# files_circExplorer=list(Path(os.getcwd()).rglob("*_human_only.circularRNA_known.txt")) +files_ciri = list(Path(os.getcwd()).rglob("*.ciri.out")) +# filter out files in the "old" folder +# files_circExplorer=list(filter(lambda x: not re.search('/old/', str(x)), files_circExplorer)) +files_ciri = list(filter(lambda x: not re.search("/old/", str(x)), files_ciri)) +# files_circExplorer.sort(key=natural_keys) files_ciri.sort(key=natural_keys) # In[4]: -f=files_ciri[0] -sampleName=f.name.replace(".ciri.out","") -x=pandas.read_csv(f,sep="\t",header=0,usecols=["chr","circRNA_start","circRNA_end","#junction_reads"]) +f = files_ciri[0] +sampleName = f.name.replace(".ciri.out", "") +x = pandas.read_csv( + f, + sep="\t", + header=0, + usecols=["chr", "circRNA_start", "circRNA_end", "#junction_reads"], +) print(x.head()) -x["circRNA_start"]=x["circRNA_start"].astype(int)-1 -x[hostID]=x["chr"].astype(str)+":"+x["circRNA_start"].astype(str)+"-"+x["circRNA_end"].astype(str) -x[sampleName+"_ciri"]=x["#junction_reads"].astype(str) -x.drop(["chr","circRNA_start","circRNA_end","#junction_reads"],inplace=True,axis=1) -x.set_index([hostID],inplace=True) -ciri_count_matrix=x +x["circRNA_start"] = x["circRNA_start"].astype(int) - 1 +x[hostID] = ( + x["chr"].astype(str) + + ":" + + x["circRNA_start"].astype(str) + + "-" + + x["circRNA_end"].astype(str) +) +x[sampleName + "_ciri"] = x["#junction_reads"].astype(str) +x.drop(["chr", "circRNA_start", "circRNA_end", "#junction_reads"], inplace=True, axis=1) +x.set_index([hostID], inplace=True) +ciri_count_matrix = x print(ciri_count_matrix.head()) @@ -67,43 +82,54 @@ def natural_keys(text): for f in files_ciri[1:]: - sampleName=f.name.replace(".ciri.out","") - print(f,sampleName) - x=pandas.read_csv(f,sep="\t",header=0,usecols=["chr","circRNA_start","circRNA_end","#junction_reads"]) - x["circRNA_start"]=x["circRNA_start"].astype(int)-1 - x[hostID]=x["chr"].astype(str)+":"+x["circRNA_start"].astype(str)+"-"+x["circRNA_end"].astype(str) - x[sampleName+"_ciri"]=x["#junction_reads"].astype(str) - x.drop(["chr","circRNA_start","circRNA_end","#junction_reads"],inplace=True,axis=1) - x.set_index([hostID],inplace=True) - ciri_count_matrix=pandas.concat([ciri_count_matrix,x],axis=1,join="outer",sort=False) + sampleName = f.name.replace(".ciri.out", "") + print(f, sampleName) + x = pandas.read_csv( + f, + sep="\t", + header=0, + usecols=["chr", "circRNA_start", "circRNA_end", "#junction_reads"], + ) + x["circRNA_start"] = x["circRNA_start"].astype(int) - 1 + x[hostID] = ( + x["chr"].astype(str) + + ":" + + x["circRNA_start"].astype(str) + + "-" + + x["circRNA_end"].astype(str) + ) + x[sampleName + "_ciri"] = x["#junction_reads"].astype(str) + x.drop( + ["chr", "circRNA_start", "circRNA_end", "#junction_reads"], inplace=True, axis=1 + ) + x.set_index([hostID], inplace=True) + ciri_count_matrix = pandas.concat( + [ciri_count_matrix, x], axis=1, join="outer", sort=False + ) ciri_count_matrix.head() # In[6]: -ciri_count_matrix.fillna(0,inplace=True) +ciri_count_matrix.fillna(0, inplace=True) ciri_count_matrix.head() -ciri_count_matrix.to_csv("ciri_count_matrix.txt",sep="\t",header=True) +ciri_count_matrix.to_csv("ciri_count_matrix.txt", sep="\t", header=True) # In[7]: -annotations=pandas.read_csv(lookupfile,sep="\t",header=0) -annotations.set_index([hostID],inplace=True) +annotations = pandas.read_csv(lookupfile, sep="\t", header=0) +annotations.set_index([hostID], inplace=True) annotations.head() # In[8]: -x=ciri_count_matrix.join(annotations) -x.to_csv("ciri_count_matrix_with_annotations.txt",sep="\t",header=True) +x = ciri_count_matrix.join(annotations) +x.to_csv("ciri_count_matrix_with_annotations.txt", sep="\t", header=True) # In[ ]: - - - - diff --git a/workflow/scripts/_add_geneid2genepred.py b/workflow/scripts/_add_geneid2genepred.py index 6f21698..6f7bf6d 100755 --- a/workflow/scripts/_add_geneid2genepred.py +++ b/workflow/scripts/_add_geneid2genepred.py @@ -1,34 +1,35 @@ import sys -def get_id(s,whatid): - s=s.split() - for i,j in enumerate(s): - if j==whatid: - r=s[i+1] - r=r.replace('"','') - r=r.replace(';','') - return r - -gtffile=sys.argv[1] -transcript2gene=dict() +def get_id(s, whatid): + s = s.split() + for i, j in enumerate(s): + if j == whatid: + r = s[i + 1] + r = r.replace('"', "") + r = r.replace(";", "") + return r + + +gtffile = sys.argv[1] +transcript2gene = dict() for i in open(gtffile).readlines(): - if i.startswith("#"): - continue - i=i.strip().split("\t") - if i[2]!="transcript": - continue - gid=get_id(i[8],"gene_id") - tid=get_id(i[8],"transcript_id") -# print("%s\t%s"%(tid,gid)) - transcript2gene[tid]=gid + if i.startswith("#"): + continue + i = i.strip().split("\t") + if i[2] != "transcript": + continue + gid = get_id(i[8], "gene_id") + tid = get_id(i[8], "transcript_id") + # print("%s\t%s"%(tid,gid)) + transcript2gene[tid] = gid for i in open(sys.argv[2]).readlines(): - j=i.strip().split("\t") - x=[] - tid=j.pop(0) - gid=transcript2gene[tid] - x.append(gid) - x.append(tid) - x.extend(j) - print("\t".join(x)) + j = i.strip().split("\t") + x = [] + tid = j.pop(0) + gid = transcript2gene[tid] + x.append(gid) + x.append(tid) + x.extend(j) + print("\t".join(x)) diff --git a/workflow/scripts/_append_splice_site_flanks_to_BSJs.py b/workflow/scripts/_append_splice_site_flanks_to_BSJs.py index dd077ba..7dab224 100755 --- a/workflow/scripts/_append_splice_site_flanks_to_BSJs.py +++ b/workflow/scripts/_append_splice_site_flanks_to_BSJs.py @@ -6,41 +6,45 @@ class BSJ: - def __init__(self,linestr): - l=linestr.strip().split("\t") - self.chrom=l[0] - self.start=l[1] - self.end=l[2] - self.name=l[3] - self.score=l[4] - self.strand=l[5] - self.bitids=l[6] - self.rids=l[7] - self.splice_site_flank_5="" #donor - self.splice_site_flank_3="" #acceptor - + def __init__(self, linestr): + l = linestr.strip().split("\t") + self.chrom = l[0] + self.start = l[1] + self.end = l[2] + self.name = l[3] + self.score = l[4] + self.strand = l[5] + self.bitids = l[6] + self.rids = l[7] + self.splice_site_flank_5 = "" # donor + self.splice_site_flank_3 = "" # acceptor + def get_jid(self): - jid=self.chrom+"##"+str(self.start)+"##"+str(self.end) + jid = self.chrom + "##" + str(self.start) + "##" + str(self.end) return jid - - def add_flanks(self,sequences): - if self.strand == '+': + + def add_flanks(self, sequences): + if self.strand == "+": coord = int(self.end) - self.splice_site_flank_5 = sequences[self.chrom][coord:coord+2] + self.splice_site_flank_5 = sequences[self.chrom][coord : coord + 2] coord = int(self.start) - self.splice_site_flank_3 = sequences[self.chrom][coord-2:coord] - elif self.strand == '-': + self.splice_site_flank_3 = sequences[self.chrom][coord - 2 : coord] + elif self.strand == "-": coord = int(self.end) - myseq = HTSeq.Sequence(bytes(sequences[self.chrom][coord:coord+2],'utf-8'),"myseq") - revcomp = myseq.get_reverse_complement().seq.decode('utf-8') + myseq = HTSeq.Sequence( + bytes(sequences[self.chrom][coord : coord + 2], "utf-8"), "myseq" + ) + revcomp = myseq.get_reverse_complement().seq.decode("utf-8") self.splice_site_flank_3 = revcomp coord = int(self.start) - myseq = HTSeq.Sequence(bytes(sequences[self.chrom][coord-2:coord],'utf-8'),"myseq") - revcomp = myseq.get_reverse_complement().seq.decode('utf-8') + myseq = HTSeq.Sequence( + bytes(sequences[self.chrom][coord - 2 : coord], "utf-8"), "myseq" + ) + revcomp = myseq.get_reverse_complement().seq.decode("utf-8") self.splice_site_flank_5 = revcomp - def write_out_BSJ(self,outbed): - t=[] + def write_out_BSJ(self, outbed): + t = [] t.append(self.chrom) t.append(str(self.start)) t.append(str(self.end)) @@ -49,8 +53,9 @@ def write_out_BSJ(self,outbed): t.append(self.strand) t.append(self.bitids) t.append(self.rids) - t.append("##".join([self.splice_site_flank_5,self.splice_site_flank_3])) - outbed.write("\t".join(t)+"\n") + t.append("##".join([self.splice_site_flank_5, self.splice_site_flank_3])) + outbed.write("\t".join(t) + "\n") + def main(): # debug = True @@ -58,32 +63,51 @@ def main(): parser = argparse.ArgumentParser( description="Append the BSJ Donor##Acceptor column to BSJ bed file. Input BSJ bed file is output from _create_circExplorer_BSJ_bam_pe or _create_circExplorer_BSJ_bam_se scripts." ) - parser.add_argument("--reffa",dest="reffa",required=True,type=argparse.FileType('r'),default=sys.stdin, - help="reference fasta file") - parser.add_argument("--inbsjbedgz",dest="inbsjbedgz",required=True,type=str, - help="BSJ BED in gzip format") - parser.add_argument("--outbsjbedgz",dest="outbsjbedgz",required=True,type=str, - help="BSJ BED in gzip format") + parser.add_argument( + "--reffa", + dest="reffa", + required=True, + type=argparse.FileType("r"), + default=sys.stdin, + help="reference fasta file", + ) + parser.add_argument( + "--inbsjbedgz", + dest="inbsjbedgz", + required=True, + type=str, + help="BSJ BED in gzip format", + ) + parser.add_argument( + "--outbsjbedgz", + dest="outbsjbedgz", + required=True, + type=str, + help="BSJ BED in gzip format", + ) args = parser.parse_args() print("Reading...reference sequences...") - sequences = dict((s[1], s[0]) for s in HTSeq.FastaReader(args.reffa, raw_iterator=True)) - print("Done reading...%d sequences!"%(len(sequences))) + sequences = dict( + (s[1], s[0]) for s in HTSeq.FastaReader(args.reffa, raw_iterator=True) + ) + print("Done reading...%d sequences!" % (len(sequences))) print("Reading/Writing...BSJs...") bsjs = dict() - with gzip.open(args.outbsjbedgz,'wt') as bsjfile: - with gzip.open(args.inbsjbedgz,'rt') as tfile: + with gzip.open(args.outbsjbedgz, "wt") as bsjfile: + with gzip.open(args.inbsjbedgz, "rt") as tfile: for l in tfile: bsj = BSJ(l) bsj.add_flanks(sequences) bsj.write_out_BSJ(bsjfile) - bsjs[bsj.get_jid()]=1 + bsjs[bsj.get_jid()] = 1 tfile.close() bsjfile.close() - print("Done reading/writing...%d BSJs!"%(len(bsjs))) + print("Done reading/writing...%d BSJs!" % (len(bsjs))) print("Finished!") + if __name__ == "__main__": - main() \ No newline at end of file + main() diff --git a/workflow/scripts/_bam_filter_BSJ_for_HQonly.py b/workflow/scripts/_bam_filter_BSJ_for_HQonly.py index 8315b0b..7a7ae6c 100755 --- a/workflow/scripts/_bam_filter_BSJ_for_HQonly.py +++ b/workflow/scripts/_bam_filter_BSJ_for_HQonly.py @@ -3,43 +3,47 @@ import pysam import os -def read_regions(regionsfile,host,additives,viruses): - host=host.split(",") - additives=additives.split(",") - viruses=viruses.split(",") - infile=open(regionsfile,'r') - regions=dict() + +def read_regions(regionsfile, host, additives, viruses): + host = host.split(",") + additives = additives.split(",") + viruses = viruses.split(",") + infile = open(regionsfile, "r") + regions = dict() for l in infile.readlines(): l = l.strip().split("\t") - region_name=l[0] - regions[region_name]=dict() - regions[region_name]['sequences']=dict() + region_name = l[0] + regions[region_name] = dict() + regions[region_name]["sequences"] = dict() if region_name in host: - regions[region_name]['host_additive_virus']="host" + regions[region_name]["host_additive_virus"] = "host" elif region_name in additives: - regions[region_name]['host_additive_virus']="additive" + regions[region_name]["host_additive_virus"] = "additive" elif region_name in viruses: - regions[region_name]['host_additive_virus']="virus" + regions[region_name]["host_additive_virus"] = "virus" else: exit("%s has unknown region. Its not a host or a additive or a virus!!") - sequence_names=l[1].split() + sequence_names = l[1].split() for s in sequence_names: - regions[region_name]['sequences'][s]=1 + regions[region_name]["sequences"][s] = 1 return regions -def _get_host_additive_virus(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: - return v['host_additive_virus'] + +def _get_host_additive_virus(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: + return v["host_additive_virus"] else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) -def _get_regionname_from_seqname(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: + +def _get_regionname_from_seqname(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: return k else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) + def main(): # debug = True @@ -49,76 +53,128 @@ def main(): This RG is used to extract reads from inbam and save them. """ ) - parser.add_argument("-i","--inbam",dest="inbam",required=True,type=str, - help="BSJ bam with RG set") - parser.add_argument('-t','--sample_counts_table', dest='countstable', type=str, required=True, - help='final all sample counts matrix') # get coordinates of the circRNA + parser.add_argument( + "-i", + "--inbam", + dest="inbam", + required=True, + type=str, + help="BSJ bam with RG set", + ) + parser.add_argument( + "-t", + "--sample_counts_table", + dest="countstable", + type=str, + required=True, + help="final all sample counts matrix", + ) # get coordinates of the circRNA # parser.add_argument("-o","--outbam",dest="outbam",required=True,type=argparse.FileType('w'), # help="Output bam file ... both strands") - parser.add_argument("-s",'--sample_name', dest='samplename', type=str, required=False, default = 'sample1', - help='Sample Name') - parser.add_argument("-o","--outbam",dest="outbam",required=True,type=str, - help="Output bam file ... both strands") - parser.add_argument('--regions', dest='regions', type=str, required=True, - help='regions file eg. ref.fa.regions') - parser.add_argument('--host', dest='host', type=str, required=True, - help='host name eg.hg38... single value') - parser.add_argument('--additives', dest='additives', type=str, required=True, - help='additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out') - parser.add_argument('--viruses', dest='viruses', type=str, required=True, - help='virus name(s) eg.NC_009333.1... comma-separated list') - args = parser.parse_args() - - indf = pd.read_csv(args.countstable,sep="\t",header=0,compression='gzip') - indf = indf.loc[indf['HQ']=="Y"] + parser.add_argument( + "-s", + "--sample_name", + dest="samplename", + type=str, + required=False, + default="sample1", + help="Sample Name", + ) + parser.add_argument( + "-o", + "--outbam", + dest="outbam", + required=True, + type=str, + help="Output bam file ... both strands", + ) + parser.add_argument( + "--regions", + dest="regions", + type=str, + required=True, + help="regions file eg. ref.fa.regions", + ) + parser.add_argument( + "--host", + dest="host", + type=str, + required=True, + help="host name eg.hg38... single value", + ) + parser.add_argument( + "--additives", + dest="additives", + type=str, + required=True, + help="additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out", + ) + parser.add_argument( + "--viruses", + dest="viruses", + type=str, + required=True, + help="virus name(s) eg.NC_009333.1... comma-separated list", + ) + args = parser.parse_args() + + indf = pd.read_csv(args.countstable, sep="\t", header=0, compression="gzip") + indf = indf.loc[indf["HQ"] == "Y"] RGlist = dict() - for index,row in indf.iterrows(): - jid = row['chrom']+"##"+str(row['start'])+"##"+str(row['end']) - RGlist[jid]=1 - print("Number of RGs: ",len(RGlist)) + for index, row in indf.iterrows(): + jid = row["chrom"] + "##" + str(row["start"]) + "##" + str(row["end"]) + RGlist[jid] = 1 + print("Number of RGs: ", len(RGlist)) samfile = pysam.AlignmentFile(args.inbam, "rb") samheader = samfile.header.to_dict() sequences = list() - for v in samheader['SQ']: - sequences.append(v['SN']) - seqname2regionname=dict() - hosts=set() - viruses=set() - regions = read_regions(regionsfile=args.regions,host=args.host,additives=args.additives,viruses=args.viruses) + for v in samheader["SQ"]: + sequences.append(v["SN"]) + seqname2regionname = dict() + hosts = set() + viruses = set() + regions = read_regions( + regionsfile=args.regions, + host=args.host, + additives=args.additives, + viruses=args.viruses, + ) for s in sequences: - hav = _get_host_additive_virus(regions,s) + hav = _get_host_additive_virus(regions, s) if hav == "host": - hostname = _get_regionname_from_seqname(regions,s) - seqname2regionname[s]=hostname + hostname = _get_regionname_from_seqname(regions, s) + seqname2regionname[s] = hostname hosts.add(hostname) if hav == "virus": - virusname = _get_regionname_from_seqname(regions,s) - seqname2regionname[s]=virusname + virusname = _get_regionname_from_seqname(regions, s) + seqname2regionname[s] = virusname viruses.add(virusname) - outbam = pysam.AlignmentFile(args.outbam, "wb", template=samfile) outputbams = dict() outdir = os.path.dirname(args.outbam) for h in hosts: - outbamname = os.path.join(outdir,args.samplename+"."+h+".HQ_only.BSJ.bam") - outputbams[h] = pysam.AlignmentFile(outbamname, "wb", header = samheader) + outbamname = os.path.join( + outdir, args.samplename + "." + h + ".HQ_only.BSJ.bam" + ) + outputbams[h] = pysam.AlignmentFile(outbamname, "wb", header=samheader) for v in viruses: - outbamname = os.path.join(outdir,args.samplename+"."+v+".HQ_only.BSJ.bam") - outputbams[v] = pysam.AlignmentFile(outbamname, "wb", header = samheader) - + outbamname = os.path.join( + outdir, args.samplename + "." + v + ".HQ_only.BSJ.bam" + ) + outputbams[v] = pysam.AlignmentFile(outbamname, "wb", header=samheader) for read in samfile.fetch(): rg = read.get_tag("RG") rg = rg.split("##") - rg = rg[:len(rg)-1] + rg = rg[: len(rg) - 1] rg = "##".join(rg) if rg in RGlist: - regionname=_get_regionname_from_seqname(regions,read.reference_name) + regionname = _get_regionname_from_seqname(regions, read.reference_name) if regionname in hosts: outputbams[regionname].write(read) if regionname in viruses: @@ -126,7 +182,7 @@ def main(): outbam.write(read) samfile.close() outbam.close() - for k,v in outputbams.items(): + for k, v in outputbams.items(): v.close() diff --git a/workflow/scripts/_bam_get_alignment_stats.py b/workflow/scripts/_bam_get_alignment_stats.py index 61d0720..edd9ae6 100755 --- a/workflow/scripts/_bam_get_alignment_stats.py +++ b/workflow/scripts/_bam_get_alignment_stats.py @@ -1,52 +1,71 @@ #!/usr/bin/env python3 import argparse import pysam - + + def read_regions(regionsfile): - infile=open(regionsfile,'r') - regions=dict() + infile = open(regionsfile, "r") + regions = dict() for l in infile.readlines(): l = l.strip().split("\t") - region_name=l[0] - regions[region_name]=dict() - regions[region_name]['sequences']=dict() - sequence_names=l[1].split() + region_name = l[0] + regions[region_name] = dict() + regions[region_name]["sequences"] = dict() + sequence_names = l[1].split() for s in sequence_names: - regions[region_name]['sequences'][s]=dict() - return regions + regions[region_name]["sequences"][s] = dict() + return regions def main(): - parser = argparse.ArgumentParser(description='Find BAM alignment stats for each region.') - parser.add_argument('--inbam', dest='inbam', type=str, required=True, - help='Input BAM file') - parser.add_argument('--regions', dest='regions', type=str, required=True, - help='regions file eg. ref.fa.regions') + parser = argparse.ArgumentParser( + description="Find BAM alignment stats for each region." + ) + parser.add_argument( + "--inbam", dest="inbam", type=str, required=True, help="Input BAM file" + ) + parser.add_argument( + "--regions", + dest="regions", + type=str, + required=True, + help="regions file eg. ref.fa.regions", + ) # parser.add_argument("--out",dest="outjson",required=True,type=str, # help="Output stats in JSON format") - parser.add_argument('-p',"--pe",dest="pe",required=False,action='store_true', default=False, - help="set this if BAM is paired end") - args = parser.parse_args() + parser.add_argument( + "-p", + "--pe", + dest="pe", + required=False, + action="store_true", + default=False, + help="set this if BAM is paired end", + ) + args = parser.parse_args() samfile = pysam.AlignmentFile(args.inbam, "rb") regions = read_regions(regionsfile=args.regions) region_names = regions.keys() for read in samfile.fetch(): - if args.pe and ( read.reference_id != read.next_reference_id ): continue # only works for PE ... for SE read.next_reference_id is -1 - if args.pe and ( not read.is_proper_pair ): continue - if read.is_secondary or read.is_supplementary or read.is_unmapped : continue + if args.pe and (read.reference_id != read.next_reference_id): + continue # only works for PE ... for SE read.next_reference_id is -1 + if args.pe and (not read.is_proper_pair): + continue + if read.is_secondary or read.is_supplementary or read.is_unmapped: + continue rid = read.query_name refname = samfile.get_reference_name(read.reference_id) for region in region_names: - if refname in regions[region]['sequences']: - regions[region]['sequences'][refname][rid]=1 + if refname in regions[region]["sequences"]: + regions[region]["sequences"][refname][rid] = 1 break samfile.close() for region in regions: - counts=0 - for refname in regions[region]['sequences'].keys(): - counts += len(regions[region]['sequences'][refname]) - print("%d\t%s"%(counts,region)) + counts = 0 + for refname in regions[region]["sequences"].keys(): + counts += len(regions[region]["sequences"][refname]) + print("%d\t%s" % (counts, region)) if __name__ == "__main__": - main() \ No newline at end of file + main() diff --git a/workflow/scripts/_bamtobed2readendsbed.py b/workflow/scripts/_bamtobed2readendsbed.py index 4113f3d..28e5b87 100755 --- a/workflow/scripts/_bamtobed2readendsbed.py +++ b/workflow/scripts/_bamtobed2readendsbed.py @@ -2,23 +2,30 @@ import argparse + def main(): # debug = True debug = False - parser = argparse.ArgumentParser( - ) + parser = argparse.ArgumentParser() # INPUTs - parser.add_argument("-i","--inbed",dest="inbed",required=True,type=str, - help="Input bamtobed bed file") - parser.add_argument('-o',"--outbed",dest="outbed",required=True,type=str, - help="Output bed file") + parser.add_argument( + "-i", + "--inbed", + dest="inbed", + required=True, + type=str, + help="Input bamtobed bed file", + ) + parser.add_argument( + "-o", "--outbed", dest="outbed", required=True, type=str, help="Output bed file" + ) args = parser.parse_args() - outbed = open(args.outbed,'w') - with open(args.inbed,'r') as inbed: + outbed = open(args.outbed, "w") + with open(args.inbed, "r") as inbed: for l in inbed: - l=l.strip().split("\t") - l1=[] - l2=[] + l = l.strip().split("\t") + l1 = [] + l2 = [] l1.append(l[0]) l2.append(l[0]) l1.append(l[1]) @@ -26,31 +33,32 @@ def main(): l2.append(l[2]) l2.append(l[2]) if "/" in l[3]: - x=l[3].split("/") - readname=x[0] - if x[1]=="1": - strand=l[5] + x = l[3].split("/") + readname = x[0] + if x[1] == "1": + strand = l[5] else: - if l[5]=="-": - strand="+" - elif l[5]=="+": - strand="-" + if l[5] == "-": + strand = "+" + elif l[5] == "+": + strand = "-" else: - strand=l[5] + strand = l[5] else: - strand=l[5] - readname=l[3] - readname+="##"+strand + strand = l[5] + readname = l[3] + readname += "##" + strand l1.append(readname) l2.append(readname) l1.append(".") l2.append(".") l1.append(strand) l2.append(strand) - outbed.write("\t".join(l1)+"\n") - outbed.write("\t".join(l2)+"\n") + outbed.write("\t".join(l1) + "\n") + outbed.write("\t".join(l2) + "\n") inbed.close() outbed.close() + if __name__ == "__main__": main() diff --git a/workflow/scripts/_bedintersect_to_rid2jid.py b/workflow/scripts/_bedintersect_to_rid2jid.py index 231073c..38d5830 100755 --- a/workflow/scripts/_bedintersect_to_rid2jid.py +++ b/workflow/scripts/_bedintersect_to_rid2jid.py @@ -2,38 +2,73 @@ import sys import gzip + def main(): # debug = True debug = False - parser = argparse.ArgumentParser( + parser = argparse.ArgumentParser() + parser.add_argument( + "-i", + "--bedinteresection", + dest="bedint", + required=True, + type=argparse.FileType("r"), + default=sys.stdin, + help="Input BED intersection file", + ) + parser.add_argument( + "-o", + "--rid2jid", + dest="outtsv", + required=True, + type=str, + help="Output tsv... gziped", + ) + parser.add_argument( + "-m", + "--maxdist", + dest="maxdist", + required=True, + type=int, + help="Max dist from BSJ coordinate", ) - parser.add_argument("-i","--bedinteresection",dest="bedint",required=True,type=argparse.FileType('r'),default=sys.stdin, - help="Input BED intersection file") - parser.add_argument("-o","--rid2jid",dest="outtsv",required=True,type=str, - help="Output tsv... gziped") - parser.add_argument("-m","--maxdist",dest="maxdist",required=True,type=int, - help="Max dist from BSJ coordinate") args = parser.parse_args() # outfile = open(args.outtsv,'w') # for l in args.bedint.readlines(): - with gzip.open(args.outtsv,'wt') as outfile: + with gzip.open(args.outtsv, "wt") as outfile: for l in args.bedint: - l=l.strip().split("\t") + l = l.strip().split("\t") # print(l) # print(" abs(int(l[2])-int(l[10])) <= args.maxdist :", abs(int(l[2])-int(l[10])),(abs(int(l[2])-int(l[10])) <= args.maxdist )) # print(" abs(int(l[1])-int(l[9])) <= args.maxdist :", abs(int(l[1])-int(l[9])),(abs(int(l[1])-int(l[9])) <= args.maxdist)) # print(" abs(int(l[2])-int(l[9])) <= args.maxdist : ", abs(int(l[2])-int(l[9])),(abs(int(l[2])-int(l[9])) <= args.maxdist)) # print(" abs(int(l[1])-int(l[10])) <= args.maxdist :", abs(int(l[1])-int(l[10])),(abs(int(l[1])-int(l[10])) <= args.maxdist)) - if ( abs(int(l[2])-int(l[11])) <= args.maxdist ) or ( abs(int(l[1])-int(l[10])) <= args.maxdist ) or ( abs(int(l[2])-int(l[10])) <= args.maxdist ) or ( abs(int(l[1])-int(l[11])) <= args.maxdist ): - jid=l[0]+"##"+l[1]+"##"+str(int(l[2])-1)+"##"+l[5]+"##"+l[-1] # jid format is chrom##start##end##strand##read_strand + if ( + (abs(int(l[2]) - int(l[11])) <= args.maxdist) + or (abs(int(l[1]) - int(l[10])) <= args.maxdist) + or (abs(int(l[2]) - int(l[10])) <= args.maxdist) + or (abs(int(l[1]) - int(l[11])) <= args.maxdist) + ): + jid = ( + l[0] + + "##" + + l[1] + + "##" + + str(int(l[2]) - 1) + + "##" + + l[5] + + "##" + + l[-1] + ) # jid format is chrom##start##end##strand##read_strand # outl=l[3:] - rid=l[12] - outl=[rid] + rid = l[12] + outl = [rid] outl.append(jid) - outstr="\t".join(outl) - outfile.write("%s\n"%(outstr)) + outstr = "\t".join(outl) + outfile.write("%s\n" % (outstr)) args.bedint.close() outfile.close() + if __name__ == "__main__": - main() \ No newline at end of file + main() diff --git a/workflow/scripts/_bedpe2bed.py b/workflow/scripts/_bedpe2bed.py index aa62b04..7d607b6 100755 --- a/workflow/scripts/_bedpe2bed.py +++ b/workflow/scripts/_bedpe2bed.py @@ -4,21 +4,23 @@ import gzip import pprint + def main(): # debug = True debug = False - parser = argparse.ArgumentParser( + parser = argparse.ArgumentParser() + parser.add_argument( + "-i", "--bedpe", dest="bedpe", required=True, type=str, help="Input BEDPE file" + ) + parser.add_argument( + "-o", "--bed", dest="bed", required=True, type=str, help="Output BED file" ) - parser.add_argument("-i","--bedpe",dest="bedpe",required=True,type=str, - help="Input BEDPE file") - parser.add_argument("-o","--bed",dest="bed",required=True,type=str, - help="Output BED file") args = parser.parse_args() - infile = open(args.bedpe,'r') - outfile = open(args.bed,'w') + infile = open(args.bedpe, "r") + outfile = open(args.bed, "w") for x in infile.readlines(): - x=x.strip().split("\t") - chrom=x[0] + x = x.strip().split("\t") + chrom = x[0] if int(x[1]) < int(x[4]): left = x[1] else: @@ -29,12 +31,14 @@ def main(): right = x[5] rid = x[6] score = x[7] - strand = x[8] # read1 strand - outfile.write("%s\t%s\t%s\t%s\t%s\t%s\n"%(chrom,left,right,rid,score,strand)) - + strand = x[8] # read1 strand + outfile.write( + "%s\t%s\t%s\t%s\t%s\t%s\n" % (chrom, left, right, rid, score, strand) + ) + infile.close() outfile.close() if __name__ == "__main__": - main() \ No newline at end of file + main() diff --git a/workflow/scripts/_circExplorer_BSJ_get_strand.py b/workflow/scripts/_circExplorer_BSJ_get_strand.py index a7d074f..6a21ee3 100755 --- a/workflow/scripts/_circExplorer_BSJ_get_strand.py +++ b/workflow/scripts/_circExplorer_BSJ_get_strand.py @@ -1,13 +1,17 @@ import sys -stats=dict() -mreads=int(sys.argv[3]) # minreads -#read junction.filter1 + +stats = dict() +mreads = int(sys.argv[3]) # minreads +# read junction.filter1 with open(sys.argv[1]) as junction: for l in junction.readlines(): - l=l.strip().split("\t") - if l[0]!=l[3]:continue - if l[2]!=l[5]:continue - if l[1]==l[4]:continue + l = l.strip().split("\t") + if l[0] != l[3]: + continue + if l[2] != l[5]: + continue + if l[1] == l[4]: + continue if int(l[1]) > int(l[4]): end = l[1] start = l[4] @@ -16,28 +20,29 @@ start = l[1] jid = l[0] + "##" + start + "##" + end if not jid in stats: - stats[jid]=dict() - stats[jid]["+"]=0 - stats[jid]["-"]=0 - stats[jid][l[2]]+=1 -#read back_spliced_junction.filter2.bed + stats[jid] = dict() + stats[jid]["+"] = 0 + stats[jid]["-"] = 0 + stats[jid][l[2]] += 1 +# read back_spliced_junction.filter2.bed with open(sys.argv[2]) as bsjbed: for l in bsjbed.readlines(): - l=l.strip().split("\t") - if l[1]==l[2]:continue - jname,count=l[3].split("/") - if int(count) stats[bsjid]["-"]: - strand="+" + strand = "+" else: - strand="-" - l[5]=strand - print("\t".join(l)) \ No newline at end of file + strand = "-" + l[5] = strand + print("\t".join(l)) diff --git a/workflow/scripts/_collapse_find_circ.py b/workflow/scripts/_collapse_find_circ.py index 209aea4..fa406a8 100755 --- a/workflow/scripts/_collapse_find_circ.py +++ b/workflow/scripts/_collapse_find_circ.py @@ -1,20 +1,21 @@ import sys -collection=dict() + +collection = dict() for f in sys.stdin: - f=f.strip().split("\t") - circid="##".join([f[0],f[1],f[2],f[5]]) - if not circid in collection: - collection[circid]=dict() - collection[circid]['fullline']=f - collection[circid]['count']=int(f[4]) - else: - collection[circid]['count']+=int(f[4]) -#header=["chrom","start","end","name","n_reads","strand","n_uniq","uniq_bridges","best_qual_left","best_qual_right","tissues","tiss_counts","edits","anchor_overlap","breakpoints","signal","strandmatch","category"] -#print("\t".join(header)) -count=0 -for k,v in collection.items(): - count+=1 - x=v['fullline'] - x[3]=str(count) - x[4]=str(v['count']) - print("\t".join(x)) \ No newline at end of file + f = f.strip().split("\t") + circid = "##".join([f[0], f[1], f[2], f[5]]) + if not circid in collection: + collection[circid] = dict() + collection[circid]["fullline"] = f + collection[circid]["count"] = int(f[4]) + else: + collection[circid]["count"] += int(f[4]) +# header=["chrom","start","end","name","n_reads","strand","n_uniq","uniq_bridges","best_qual_left","best_qual_right","tissues","tiss_counts","edits","anchor_overlap","breakpoints","signal","strandmatch","category"] +# print("\t".join(header)) +count = 0 +for k, v in collection.items(): + count += 1 + x = v["fullline"] + x[3] = str(count) + x[4] = str(v["count"]) + print("\t".join(x)) diff --git a/workflow/scripts/_compare_lists.py b/workflow/scripts/_compare_lists.py index 6a8f8ff..855e82e 100755 --- a/workflow/scripts/_compare_lists.py +++ b/workflow/scripts/_compare_lists.py @@ -2,34 +2,55 @@ import matplotlib import numpy import scipy -#from matplotlib_venn import venn2 -#import matplotlib.pyplot as plt -if len(sys.argv)<3: - print("python %s a_list b_list"%(sys.argv[0])) - exit() -a_set=set(list(filter(lambda x:x!="",list(map(lambda x:x.strip().split("\t")[0],open(sys.argv[1]).readlines()))))) -b_set=set(list(filter(lambda x:x!="",list(map(lambda x:x.strip().split("\t")[0],open(sys.argv[2]).readlines()))))) -a_intersect_b=a_set.intersection(b_set) -a_union_b=a_set.union(b_set) -a_only=a_set-b_set -b_only=b_set-a_set -print("Size of a_list=%d"%(len(a_set))) -print("Size of b_list=%d"%(len(b_set))) -print("a interset b=%d"%(len(a_intersect_b))) -print("a union b=%d"%(len(a_union_b))) -print("only a=%d"%(len(a_only))) -print("only b=%d"%(len(b_only))) -if len(sys.argv)==4: - def write_list_to_file(a_set,filename): - o=open(filename,'w') - for g in a_set: - o.write("%s\n"%(g)) - o.close() - write_list_to_file(a_intersect_b,"a_intersect_b.lst") - write_list_to_file(a_union_b,"a_union_b.lst") - write_list_to_file(a_only,"a_only.lst") - write_list_to_file(b_only,"b_only.lst") -#venn2(subsets = (len(a_only), len(b_only), len(a_intersect_b))) -#plt.savefig("ab_venn.png") -exit() \ No newline at end of file +# from matplotlib_venn import venn2 +# import matplotlib.pyplot as plt + +if len(sys.argv) < 3: + print("python %s a_list b_list" % (sys.argv[0])) + exit() +a_set = set( + list( + filter( + lambda x: x != "", + list( + map(lambda x: x.strip().split("\t")[0], open(sys.argv[1]).readlines()) + ), + ) + ) +) +b_set = set( + list( + filter( + lambda x: x != "", + list( + map(lambda x: x.strip().split("\t")[0], open(sys.argv[2]).readlines()) + ), + ) + ) +) +a_intersect_b = a_set.intersection(b_set) +a_union_b = a_set.union(b_set) +a_only = a_set - b_set +b_only = b_set - a_set +print("Size of a_list=%d" % (len(a_set))) +print("Size of b_list=%d" % (len(b_set))) +print("a interset b=%d" % (len(a_intersect_b))) +print("a union b=%d" % (len(a_union_b))) +print("only a=%d" % (len(a_only))) +print("only b=%d" % (len(b_only))) +if len(sys.argv) == 4: + + def write_list_to_file(a_set, filename): + o = open(filename, "w") + for g in a_set: + o.write("%s\n" % (g)) + o.close() + + write_list_to_file(a_intersect_b, "a_intersect_b.lst") + write_list_to_file(a_union_b, "a_union_b.lst") + write_list_to_file(a_only, "a_only.lst") + write_list_to_file(b_only, "b_only.lst") +# venn2(subsets = (len(a_only), len(b_only), len(a_intersect_b))) +# plt.savefig("ab_venn.png") +exit() diff --git a/workflow/scripts/_create_circExplorer_BSJ_bam_pe.py b/workflow/scripts/_create_circExplorer_BSJ_bam_pe.py index c014324..618f06e 100755 --- a/workflow/scripts/_create_circExplorer_BSJ_bam_pe.py +++ b/workflow/scripts/_create_circExplorer_BSJ_bam_pe.py @@ -5,9 +5,11 @@ import os import time + def get_ctime(): return time.ctime(time.time()) + """ This script first validates each read to be "valid" BSJ read and then splits a BSJ bam file by strand into: @@ -16,10 +18,10 @@ def get_ctime(): 3. BSJ bed file with score(number of reads supporting the BSJ) and strand information Logic (for PE reads): Each BSJ is represented by a 3 alignments in the output BAM file. -Alignment 1 is complete alignment of one of the reads in pair and -Alignments 2 and 3 are split alignment of the mate at two distinct loci on the same reference +Alignment 1 is complete alignment of one of the reads in pair and +Alignments 2 and 3 are split alignment of the mate at two distinct loci on the same reference chromosome. -These alignments are grouped together by the "HI" tags in SAM file. For example, all 3 +These alignments are grouped together by the "HI" tags in SAM file. For example, all 3 alignments for the same BSJ will have the same "HI" value... something like "HI:i:1". BSJ alignment sam bitflag combinations can have 8 different possibilities, 4 from sense strand and 4 from anti-sense strand: @@ -35,12 +37,12 @@ def get_ctime(): # |<------------------BSJ----------------->| 3. 83,163,2209 4. 339,419,2465 -# R1 -# <------ +# R1 +# <------ # 5'--|------------------------------------------|---3' # 3'--|------------------------------------------|---5' # |------> ------>| -# | R2.2 R2.1 | +# | R2.2 R2.1 | # | | # |<-----------------BSJ-------------------->| 5. 99,147,2193 @@ -55,12 +57,12 @@ def get_ctime(): # |<------------------BSJ----------------->| 7. 99,147,2145 8. 355, 403, 2401 -# R2 -# <------ +# R2 +# <------ # 5'--|------------------------------------------|---3' # 3'--|------------------------------------------|---5' # |------> ------>| -# | R1.2 R1.1 | +# | R1.2 R1.1 | # | | # |<-----------------BSJ-------------------->| """ @@ -68,38 +70,38 @@ def get_ctime(): class BSJ: def __init__(self): - self.chrom="" - self.start="" - self.end="" - self.score=0 - self.name="." - self.strand="U" - self.bitids=set() - self.rids=set() - + self.chrom = "" + self.start = "" + self.end = "" + self.score = 0 + self.name = "." + self.strand = "U" + self.bitids = set() + self.rids = set() + def plusone(self): - self.score+=1 - - def set_strand(self,strand): - self.strand=strand - - def set_chrom(self,chrom): - self.chrom=chrom - - def set_start(self,start): - self.start=start - - def set_end(self,end): - self.end=end - - def append_bitid(self,bitid): + self.score += 1 + + def set_strand(self, strand): + self.strand = strand + + def set_chrom(self, chrom): + self.chrom = chrom + + def set_start(self, start): + self.start = start + + def set_end(self, end): + self.end = end + + def append_bitid(self, bitid): self.bitids.add(bitid) - def append_rid(self,rid): + def append_rid(self, rid): self.rids.add(rid) - - def write_out_BSJ(self,outbed): - t=[] + + def write_out_BSJ(self, outbed): + t = [] t.append(self.chrom) t.append(str(self.start)) t.append(str(self.end)) @@ -108,149 +110,164 @@ def write_out_BSJ(self,outbed): t.append(self.strand) t.append(",".join(self.bitids)) t.append(",".join(self.rids)) - outbed.write("\t".join(t)+"\n") + outbed.write("\t".join(t) + "\n") - def update_score_and_found_count(self,junctions_found): + def update_score_and_found_count(self, junctions_found): self.score = len(self.rids) - jid = self.chrom + "##" + str(self.start) + "##" + str(int(self.end)-1) + "##" + self.strand - junctions_found[jid]+=self.score + jid = ( + self.chrom + + "##" + + str(self.start) + + "##" + + str(int(self.end) - 1) + + "##" + + self.strand + ) + junctions_found[jid] += self.score + - class Readinfo: - def __init__(self,readid,rname): - self.readid=readid - self.refname=rname - self.bitflags=list() - self.bitid="" - self.strand="." - self.start=-1 - self.end=-1 - self.refcoordinates=dict() - self.isread1=dict() - self.isreverse=dict() - self.issecondary=dict() - self.issupplementary=dict() - + def __init__(self, readid, rname): + self.readid = readid + self.refname = rname + self.bitflags = list() + self.bitid = "" + self.strand = "." + self.start = -1 + self.end = -1 + self.refcoordinates = dict() + self.isread1 = dict() + self.isreverse = dict() + self.issecondary = dict() + self.issupplementary = dict() + def __str__(self): - s = "readid: %s"%(self.readid) - s = "%s\tbitflags: %s"%(s,self.bitflags) - s = "%s\tbitid: %s"%(s,self.bitid) + s = "readid: %s" % (self.readid) + s = "%s\tbitflags: %s" % (s, self.bitflags) + s = "%s\tbitid: %s" % (s, self.bitid) for bf in self.bitflags: - s = "%s\t%s\trefcoordinates: %s"%(s,bf,", ".join(list(map(lambda x:str(x),self.refcoordinates[bf])))) + s = "%s\t%s\trefcoordinates: %s" % ( + s, + bf, + ", ".join(list(map(lambda x: str(x), self.refcoordinates[bf]))), + ) return s - def set_refcoordinates(self,bitflag,refpos): - self.refcoordinates[bitflag]=refpos - - def set_read1_reverse_secondary_supplementary(self,bitflag,read): + def set_refcoordinates(self, bitflag, refpos): + self.refcoordinates[bitflag] = refpos + + def set_read1_reverse_secondary_supplementary(self, bitflag, read): if read.is_read1: - self.isread1[bitflag]="Y" + self.isread1[bitflag] = "Y" else: - self.isread1[bitflag]="N" + self.isread1[bitflag] = "N" if read.is_reverse: - self.isreverse[bitflag]="Y" + self.isreverse[bitflag] = "Y" else: - self.isreverse[bitflag]="N" + self.isreverse[bitflag] = "N" if read.is_secondary: - self.issecondary[bitflag]="Y" + self.issecondary[bitflag] = "Y" else: - self.issecondary[bitflag]="N" + self.issecondary[bitflag] = "N" if read.is_supplementary: - self.issupplementary[bitflag]="Y" + self.issupplementary[bitflag] = "Y" else: - self.issupplementary[bitflag]="N" - - def append_alignment(self,read): + self.issupplementary[bitflag] = "N" + + def append_alignment(self, read): self.alignments.append(read) - - def append_bitflag(self,bf): + + def append_bitflag(self, bf): self.bitflags.append(bf) - + # def extend_ref_positions(self,refcoords): # self.refcoordinates.extend(refcoords) - + def generate_bitid(self): - bitlist=sorted(self.bitflags) - self.bitid="##".join(list(map(lambda x:str(x),bitlist))) -# self.bitid=str(bitlist[0])+"##"+str(bitlist[1])+"##"+str(bitlist[2]) - + bitlist = sorted(self.bitflags) + self.bitid = "##".join(list(map(lambda x: str(x), bitlist))) + + # self.bitid=str(bitlist[0])+"##"+str(bitlist[1])+"##"+str(bitlist[2]) + def get_strand(self): - if self.bitid=="83##163##2129": - self.strand="+" - elif self.bitid=="339##419##2385": - self.strand="+" - elif self.bitid=="83##163##2209": - self.strand="+" - elif self.bitid=="339##419##2465": - self.strand="+" - elif self.bitid=="99##147##2193": - self.strand="-" - elif self.bitid=="355##403##2449": - self.strand="-" - elif self.bitid=="99##147##2145": - self.strand="-" - elif self.bitid=="355##403##2401": - self.strand="-" - elif self.bitid=="16##2064": - self.strand="+" - elif self.bitid=="272##2320": - self.strand="+" - elif self.bitid=="0##2048": - self.strand="-" - elif self.bitid=="256##2304": - self.strand="-" - elif self.bitid=="153##2201": - self.strand="-" + if self.bitid == "83##163##2129": + self.strand = "+" + elif self.bitid == "339##419##2385": + self.strand = "+" + elif self.bitid == "83##163##2209": + self.strand = "+" + elif self.bitid == "339##419##2465": + self.strand = "+" + elif self.bitid == "99##147##2193": + self.strand = "-" + elif self.bitid == "355##403##2449": + self.strand = "-" + elif self.bitid == "99##147##2145": + self.strand = "-" + elif self.bitid == "355##403##2401": + self.strand = "-" + elif self.bitid == "16##2064": + self.strand = "+" + elif self.bitid == "272##2320": + self.strand = "+" + elif self.bitid == "0##2048": + self.strand = "-" + elif self.bitid == "256##2304": + self.strand = "-" + elif self.bitid == "153##2201": + self.strand = "-" else: - self.strand="." - + self.strand = "." + def flip_strand(self): - if self.strand=="+":self.strand="-" - if self.strand=="-":self.strand="+" + if self.strand == "+": + self.strand = "-" + if self.strand == "-": + self.strand = "+" - def validate_BSJ_read(self,junctions): + def validate_BSJ_read(self, junctions): """ Checks if read is truly a BSJ originitor. * Defines left, right and middle alignments * Left and right alignments should not overlap * Middle alignment should be between left and right alignments """ - if len(self.bitid.split("##"))==3: - left=-1 - right=-1 - middle=-1 - if self.bitid=="83##163##2129": - left=2129 - right=83 - middle=163 - if self.bitid=="339##419##2385": - left=2385 - right=339 - middle=419 - if self.bitid=="83##163##2209": - left=163 - right=2209 - middle=83 - if self.bitid=="339##419##2465": - left=419 - right=2465 - middle=339 - if self.bitid=="99##147##2145": - left=99 - right=2145 - middle=147 - if self.bitid=="355##403##2401": - left=355 - right=2401 - middle=403 - if self.bitid=="99##147##2193": - left=2193 - right=147 - middle=99 - if self.bitid=="355##403##2449": - left=2449 - right=403 - middle=355 + if len(self.bitid.split("##")) == 3: + left = -1 + right = -1 + middle = -1 + if self.bitid == "83##163##2129": + left = 2129 + right = 83 + middle = 163 + if self.bitid == "339##419##2385": + left = 2385 + right = 339 + middle = 419 + if self.bitid == "83##163##2209": + left = 163 + right = 2209 + middle = 83 + if self.bitid == "339##419##2465": + left = 419 + right = 2465 + middle = 339 + if self.bitid == "99##147##2145": + left = 99 + right = 2145 + middle = 147 + if self.bitid == "355##403##2401": + left = 355 + right = 2401 + middle = 403 + if self.bitid == "99##147##2193": + left = 2193 + right = 147 + middle = 99 + if self.bitid == "355##403##2449": + left = 2449 + right = 403 + middle = 355 # print(left,right,middle) if left == -1 or right == -1 or middle == -1: return False @@ -261,89 +278,95 @@ def validate_BSJ_read(self,junctions): # print("validate_BSJ_read",self.readid,self.refcoordinates[middle][0],self.refcoordinates[middle][-1]) leftmost = str(self.refcoordinates[left][0]) rightmost = str(self.refcoordinates[right][-1]) - possiblejid = chrom+"##"+leftmost+"##"+rightmost+"##"+self.strand + possiblejid = ( + chrom + "##" + leftmost + "##" + rightmost + "##" + self.strand + ) # print("validate_BSJ_read",self.readid,possiblejid) if possiblejid in junctions: self.start = leftmost - self.end = str(int(rightmost) + 1) # this will be added to the BED file + self.end = str(int(rightmost) + 1) # this will be added to the BED file return True else: return False - - - + def get_bsjid(self): - t=[] + t = [] t.append(self.refname) t.append(self.start) t.append(self.end) t.append(self.strand) return "##".join(t) - - def write_out_reads(self,outbam): + + def write_out_reads(self, outbam): for r in self.alignments: outbam.write(r) - - + + def get_uniq_readid(r): - rname=r.query_name - hi=r.get_tag("HI") - rid=rname+"##"+str(hi) + rname = r.query_name + hi = r.get_tag("HI") + rid = rname + "##" + str(hi) return rid + def get_bitflag(r): - bitflag=str(r).split("\t")[1] + bitflag = str(r).split("\t")[1] return int(bitflag) + def _bsjid2chrom(bsjid): - x=bsjid.split("##") + x = bsjid.split("##") return x[0] + def _bsjid2jid(bsjid): - x=bsjid.split("##") - chrom=x[0] - start=x[1] - end=str(int(x[2])-1) - jid="##".join([chrom,start,end]) - return jid,chrom - -def read_regions(regionsfile,host,additives,viruses): - host=host.split(",") - additives=additives.split(",") - viruses=viruses.split(",") - infile=open(regionsfile,'r') - regions=dict() + x = bsjid.split("##") + chrom = x[0] + start = x[1] + end = str(int(x[2]) - 1) + jid = "##".join([chrom, start, end]) + return jid, chrom + + +def read_regions(regionsfile, host, additives, viruses): + host = host.split(",") + additives = additives.split(",") + viruses = viruses.split(",") + infile = open(regionsfile, "r") + regions = dict() for l in infile.readlines(): l = l.strip().split("\t") - region_name=l[0] - regions[region_name]=dict() - regions[region_name]['sequences']=dict() + region_name = l[0] + regions[region_name] = dict() + regions[region_name]["sequences"] = dict() if region_name in host: - regions[region_name]['host_additive_virus']="host" + regions[region_name]["host_additive_virus"] = "host" elif region_name in additives: - regions[region_name]['host_additive_virus']="additive" + regions[region_name]["host_additive_virus"] = "additive" elif region_name in viruses: - regions[region_name]['host_additive_virus']="virus" + regions[region_name]["host_additive_virus"] = "virus" else: exit("%s has unknown region. Its not a host or a additive or a virus!!") - sequence_names=l[1].split() + sequence_names = l[1].split() for s in sequence_names: - regions[region_name]['sequences'][s]=1 - return regions + regions[region_name]["sequences"][s] = 1 + return regions + -def _get_host_additive_virus(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: - return v['host_additive_virus'] +def _get_host_additive_virus(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: + return v["host_additive_virus"] else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) -def _get_regionname_from_seqname(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: + +def _get_regionname_from_seqname(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: return k else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) def main(): @@ -355,193 +378,346 @@ def main(): where the chrom, start and end represent the BSJ the read is depicting. """ ) - parser.add_argument("-i","--inbam",dest="inbam",required=True,type=str, - help="Input Chimeric-only STAR2p BAM file") - parser.add_argument('-t','--sample_counts_table', dest='countstable', type=str, required=True, - help='circExplore per-sample counts table') # get coordinates of the circRNA - parser.add_argument("-s",'--sample_name', dest='samplename', type=str, required=False, default = 'sample1', - help='Sample Name: SM for RG') - parser.add_argument("-l",'--library', dest='library', type=str, required=False, default = 'lib1', - help='Sample Name: LB for RG') - parser.add_argument("-f",'--platform', dest='platform', type=str, required=False, default = 'illumina', - help='Sample Name: PL for RG') - parser.add_argument("-u",'--unit', dest='unit', type=str, required=False, default = 'unit1', - help='Sample Name: PU for RG') - parser.add_argument("-o","--outbam",dest="outbam",required=True,type=argparse.FileType('w'), - help="Output bam file ... both strands") - parser.add_argument("-p","--plusbam",dest="plusbam",required=True,type=argparse.FileType('w'), - help="Output plus strand bam file") - parser.add_argument("-m","--minusbam",dest="minusbam",required=True,type=argparse.FileType('w'), - help="Output plus strand bam file") - parser.add_argument("--outputhostbams",dest="outputhostbams",required=False,action='store_true', default=False, - help="Output individual host BAM files") - parser.add_argument("--outputvirusbams",dest="outputvirusbams",required=False,action='store_true', default=False, - help="Output individual virus BAM files") - parser.add_argument("--outdir",dest="outdir",required=False,type=str, - help="Output folder for the individual BAM files (required only if --outputhostbams or --outputvirusbams is used).") - parser.add_argument("-b","--bed",dest="bed",required=True,type=str, - help="Output BSJ bed.gz file (with strand info)") - parser.add_argument("-j","--junctionsfound",dest="junctionsfound",required=True,type=argparse.FileType('w', encoding='UTF-8'), - help="Output TSV file with counts of junctions expected vs found") - parser.add_argument('--regions', dest='regions', type=str, required=True, - help='regions file eg. ref.fa.regions') - parser.add_argument('--host', dest='host', type=str, required=True, - help='host name eg.hg38... single value') - parser.add_argument('--additives', dest='additives', type=str, required=True, - help='additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out') - parser.add_argument('--viruses', dest='viruses', type=str, required=True, - help='virus name(s) eg.NC_009333.1... comma-separated list') - args = parser.parse_args() + parser.add_argument( + "-i", + "--inbam", + dest="inbam", + required=True, + type=str, + help="Input Chimeric-only STAR2p BAM file", + ) + parser.add_argument( + "-t", + "--sample_counts_table", + dest="countstable", + type=str, + required=True, + help="circExplore per-sample counts table", + ) # get coordinates of the circRNA + parser.add_argument( + "-s", + "--sample_name", + dest="samplename", + type=str, + required=False, + default="sample1", + help="Sample Name: SM for RG", + ) + parser.add_argument( + "-l", + "--library", + dest="library", + type=str, + required=False, + default="lib1", + help="Sample Name: LB for RG", + ) + parser.add_argument( + "-f", + "--platform", + dest="platform", + type=str, + required=False, + default="illumina", + help="Sample Name: PL for RG", + ) + parser.add_argument( + "-u", + "--unit", + dest="unit", + type=str, + required=False, + default="unit1", + help="Sample Name: PU for RG", + ) + parser.add_argument( + "-o", + "--outbam", + dest="outbam", + required=True, + type=argparse.FileType("w"), + help="Output bam file ... both strands", + ) + parser.add_argument( + "-p", + "--plusbam", + dest="plusbam", + required=True, + type=argparse.FileType("w"), + help="Output plus strand bam file", + ) + parser.add_argument( + "-m", + "--minusbam", + dest="minusbam", + required=True, + type=argparse.FileType("w"), + help="Output plus strand bam file", + ) + parser.add_argument( + "--outputhostbams", + dest="outputhostbams", + required=False, + action="store_true", + default=False, + help="Output individual host BAM files", + ) + parser.add_argument( + "--outputvirusbams", + dest="outputvirusbams", + required=False, + action="store_true", + default=False, + help="Output individual virus BAM files", + ) + parser.add_argument( + "--outdir", + dest="outdir", + required=False, + type=str, + help="Output folder for the individual BAM files (required only if --outputhostbams or --outputvirusbams is used).", + ) + parser.add_argument( + "-b", + "--bed", + dest="bed", + required=True, + type=str, + help="Output BSJ bed.gz file (with strand info)", + ) + parser.add_argument( + "-j", + "--junctionsfound", + dest="junctionsfound", + required=True, + type=argparse.FileType("w", encoding="UTF-8"), + help="Output TSV file with counts of junctions expected vs found", + ) + parser.add_argument( + "--regions", + dest="regions", + type=str, + required=True, + help="regions file eg. ref.fa.regions", + ) + parser.add_argument( + "--host", + dest="host", + type=str, + required=True, + help="host name eg.hg38... single value", + ) + parser.add_argument( + "--additives", + dest="additives", + type=str, + required=True, + help="additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out", + ) + parser.add_argument( + "--viruses", + dest="viruses", + type=str, + required=True, + help="virus name(s) eg.NC_009333.1... comma-separated list", + ) + args = parser.parse_args() samfile = pysam.AlignmentFile(args.inbam, "rb") samheader = samfile.header.to_dict() - samheader['RG']=list() - junctionsfile = open(args.countstable,'r') - junctions=dict() - junctions_found=dict() - print("%s | Reading...junctions!..."%(get_ctime())) + samheader["RG"] = list() + junctionsfile = open(args.countstable, "r") + junctions = dict() + junctions_found = dict() + print("%s | Reading...junctions!..." % (get_ctime())) for l in junctionsfile.readlines(): - if "read_count" in l: continue + if "read_count" in l: + continue l = l.strip().split("\t") chrom = l[0] start = l[1] - end = str(int(l[2])-1) + end = str(int(l[2]) - 1) strand = l[3] - jid = chrom+"##"+start+"##"+end+"##"+strand # create a unique junction ID for each line in the BSJ junction file and make it the dict key ... easy for searching! - samheader['RG'].append({'ID':jid, 'LB':args.library, 'PL':args.platform, 'PU':args.unit,'SM':args.samplename}) + jid = ( + chrom + "##" + start + "##" + end + "##" + strand + ) # create a unique junction ID for each line in the BSJ junction file and make it the dict key ... easy for searching! + samheader["RG"].append( + { + "ID": jid, + "LB": args.library, + "PL": args.platform, + "PU": args.unit, + "SM": args.samplename, + } + ) junctions[jid] = int(l[4]) junctions_found[jid] = 0 junctionsfile.close() sequences = list() - for v in samheader['SQ']: - sequences.append(v['SN']) - seqname2regionname=dict() - hosts=set() - viruses=set() - regions = read_regions(regionsfile=args.regions,host=args.host,additives=args.additives,viruses=args.viruses) + for v in samheader["SQ"]: + sequences.append(v["SN"]) + seqname2regionname = dict() + hosts = set() + viruses = set() + regions = read_regions( + regionsfile=args.regions, + host=args.host, + additives=args.additives, + viruses=args.viruses, + ) for s in sequences: - hav = _get_host_additive_virus(regions,s) + hav = _get_host_additive_virus(regions, s) if hav == "host": - hostname = _get_regionname_from_seqname(regions,s) - seqname2regionname[s]=hostname + hostname = _get_regionname_from_seqname(regions, s) + seqname2regionname[s] = hostname hosts.add(hostname) if hav == "virus": - virusname = _get_regionname_from_seqname(regions,s) - seqname2regionname[s]=virusname + virusname = _get_regionname_from_seqname(regions, s) + seqname2regionname[s] = virusname viruses.add(virusname) - print("%s | Done reading %d junctions."%(get_ctime(),len(junctions))) + print("%s | Done reading %d junctions." % (get_ctime(), len(junctions))) - bigdict=dict() + bigdict = dict() # print("Opening...") # print(args.inbam) - print("%s | Reading...alignments!..."%(get_ctime())) - count=0 - count2=0 + print("%s | Reading...alignments!..." % (get_ctime())) + count = 0 + count2 = 0 for read in samfile.fetch(): - count+=1 - if debug: print(read,read.reference_id,read.next_reference_id) - if read.reference_id != read.next_reference_id: continue # only works for PE ... for SE read.next_reference_id is -1 - count2+=1 - rid=get_uniq_readid(read) # add the HI number to the readid - if debug:print(rid) + count += 1 + if debug: + print(read, read.reference_id, read.next_reference_id) + if read.reference_id != read.next_reference_id: + continue # only works for PE ... for SE read.next_reference_id is -1 + count2 += 1 + rid = get_uniq_readid(read) # add the HI number to the readid + if debug: + print(rid) if not rid in bigdict: - bigdict[rid]=Readinfo(rid,read.reference_name) + bigdict[rid] = Readinfo(rid, read.reference_name) # bigdict[rid].append_alignment(read) # since rid has HI number included ... this separates alignment by HI - bitflag=get_bitflag(read) - if debug:print(bitflag) - bigdict[rid].append_bitflag(bitflag) # each rid can have upto 3 lines in the BAM with each having its own bitflag ... collect all bigflags in a list here - refpos=list(filter(lambda x:x!=None,read.get_reference_positions(full_length=True))) - bigdict[rid].set_refcoordinates(bitflag,refpos) # maintain a list of reference coordinated that are "aligned" for each bitflag in each rid alignment + bitflag = get_bitflag(read) + if debug: + print(bitflag) + bigdict[rid].append_bitflag( + bitflag + ) # each rid can have upto 3 lines in the BAM with each having its own bitflag ... collect all bigflags in a list here + refpos = list( + filter(lambda x: x != None, read.get_reference_positions(full_length=True)) + ) + bigdict[rid].set_refcoordinates( + bitflag, refpos + ) # maintain a list of reference coordinated that are "aligned" for each bitflag in each rid alignment # bigdict[rid].set_read1_reverse_secondary_supplementary(bitflag,read) - if debug:print(bigdict[rid]) - print("%s | Done reading %d chimeric alignments. [%d same chrom chimeras]"%(get_ctime(),count,count2)) + if debug: + print(bigdict[rid]) + print( + "%s | Done reading %d chimeric alignments. [%d same chrom chimeras]" + % (get_ctime(), count, count2) + ) samfile.reset() - print("%s | Writing BAMs"%(get_ctime())) - print("%s | Re-Reading...alignments!..."%(get_ctime())) - plusfile = pysam.AlignmentFile(args.plusbam, "wb", header = samheader) - minusfile = pysam.AlignmentFile(args.minusbam, "wb", header = samheader) - outfile = pysam.AlignmentFile(args.outbam, "wb", header = samheader) + print("%s | Writing BAMs" % (get_ctime())) + print("%s | Re-Reading...alignments!..." % (get_ctime())) + plusfile = pysam.AlignmentFile(args.plusbam, "wb", header=samheader) + minusfile = pysam.AlignmentFile(args.minusbam, "wb", header=samheader) + outfile = pysam.AlignmentFile(args.outbam, "wb", header=samheader) outputbams = dict() if args.outputhostbams: for h in hosts: - outbamname = os.path.join(args.outdir,args.samplename+"."+h+".BSJ.bam") - outputbams[h] = pysam.AlignmentFile(outbamname, "wb", header = samheader) + outbamname = os.path.join( + args.outdir, args.samplename + "." + h + ".BSJ.bam" + ) + outputbams[h] = pysam.AlignmentFile(outbamname, "wb", header=samheader) if args.outputvirusbams: for v in viruses: - outbamname = os.path.join(args.outdir,args.samplename+"."+v+".BSJ.bam") - outputbams[v] = pysam.AlignmentFile(outbamname, "wb", header = samheader) - bsjdict=dict() - bitid_counts=dict() + outbamname = os.path.join( + args.outdir, args.samplename + "." + v + ".BSJ.bam" + ) + outputbams[v] = pysam.AlignmentFile(outbamname, "wb", header=samheader) + bsjdict = dict() + bitid_counts = dict() lenoutputbams = len(outputbams) for read in samfile.fetch(): - if read.reference_id != read.next_reference_id: continue - rid=get_uniq_readid(read) + if read.reference_id != read.next_reference_id: + continue + rid = get_uniq_readid(read) if rid in bigdict: - bigdict[rid].generate_bitid() # separate all bitflags for the same rid with ## and create a unique single bitflag ... bitflags are pre-sorted - if debug:print(bigdict[rid]) - bigdict[rid].get_strand() # use the unique aggregated bitid to extract the strand information ... all possible cases are explicitly covered - bigdict[rid].flip_strand() # strands are flipped than those reported in the counts table .. hence flipping! - if not bigdict[rid].validate_BSJ_read(junctions=junctions): # ensure that the read alignments leftmost and rightmost coordinates match with one of the BSJ junctions... if yes then that rid represents a BSJ. Also add start and end to the BSJ object + bigdict[ + rid + ].generate_bitid() # separate all bitflags for the same rid with ## and create a unique single bitflag ... bitflags are pre-sorted + if debug: + print(bigdict[rid]) + bigdict[ + rid + ].get_strand() # use the unique aggregated bitid to extract the strand information ... all possible cases are explicitly covered + bigdict[ + rid + ].flip_strand() # strands are flipped than those reported in the counts table .. hence flipping! + if not bigdict[rid].validate_BSJ_read( + junctions=junctions + ): # ensure that the read alignments leftmost and rightmost coordinates match with one of the BSJ junctions... if yes then that rid represents a BSJ. Also add start and end to the BSJ object continue # bigdict[rid].get_start_end() # print(bigdict[rid]) - bsjid=bigdict[rid].get_bsjid() - chrom=_bsjid2chrom(bsjid) + bsjid = bigdict[rid].get_bsjid() + chrom = _bsjid2chrom(bsjid) # jid,chrom=_bsjid2jid(bsjid) read.set_tag("RG", bsjid, value_type="Z") - if bigdict[rid].strand=="+": + if bigdict[rid].strand == "+": plusfile.write(read) - if bigdict[rid].strand=="-": + if bigdict[rid].strand == "-": minusfile.write(read) outfile.write(read) if lenoutputbams != 0: - regionname=_get_regionname_from_seqname(regions,chrom) + regionname = _get_regionname_from_seqname(regions, chrom) if regionname in hosts and args.outputhostbams: outputbams[regionname].write(read) if regionname in viruses and args.outputvirusbams: outputbams[regionname].write(read) if not bsjid in bsjdict: - bsjdict[bsjid]=BSJ() + bsjdict[bsjid] = BSJ() bsjdict[bsjid].set_chrom(bigdict[rid].refname) bsjdict[bsjid].set_start(bigdict[rid].start) bsjdict[bsjid].set_end(bigdict[rid].end) bsjdict[bsjid].set_strand(bigdict[rid].strand) bsjdict[bsjid].append_bitid(bigdict[rid].bitid) if not bigdict[rid].bitid in bitid_counts: - bitid_counts[bigdict[rid].bitid]=0 - bitid_counts[bigdict[rid].bitid]+=1 + bitid_counts[bigdict[rid].bitid] = 0 + bitid_counts[bigdict[rid].bitid] += 1 bsjdict[bsjid].append_rid(rid) plusfile.close() minusfile.close() samfile.close() outfile.close() if lenoutputbams != 0: - for k,v in outputbams.items(): + for k, v in outputbams.items(): v.close() - print("%s | Done!"%(get_ctime())) + print("%s | Done!" % (get_ctime())) for b in bitid_counts.keys(): - print(b,bitid_counts[b]) - print("%s | Writing BED"%(get_ctime())) + print(b, bitid_counts[b]) + print("%s | Writing BED" % (get_ctime())) - with gzip.open(args.bed,'wt') as bsjfile: + with gzip.open(args.bed, "wt") as bsjfile: for bsjid in bsjdict.keys(): bsjdict[bsjid].update_score_and_found_count(junctions_found) bsjdict[bsjid].write_out_BSJ(bsjfile) bsjfile.close() - - args.junctionsfound.write("#chrom\tstart\tend\tstrand\texpected_BSJ_reads\tfound_BSJ_reads\n") + args.junctionsfound.write( + "#chrom\tstart\tend\tstrand\texpected_BSJ_reads\tfound_BSJ_reads\n" + ) for jid in junctions.keys(): - x=jid.split("##") - chrom=x[0] - start=int(x[1]) - end=int(x[2])+1 - strand=x[3] - args.junctionsfound.write("%s\t%d\t%d\t%s\t%d\t%d\n"%(chrom,start,end,strand,junctions[jid],junctions_found[jid])) + x = jid.split("##") + chrom = x[0] + start = int(x[1]) + end = int(x[2]) + 1 + strand = x[3] + args.junctionsfound.write( + "%s\t%d\t%d\t%s\t%d\t%d\n" + % (chrom, start, end, strand, junctions[jid], junctions_found[jid]) + ) args.junctionsfound.close() - print("%s | ALL Done!"%(get_ctime())) - + print("%s | ALL Done!" % (get_ctime())) + if __name__ == "__main__": main() - - diff --git a/workflow/scripts/_create_circExplorer_BSJ_bam_se.py b/workflow/scripts/_create_circExplorer_BSJ_bam_se.py index 8cf5454..fc0a7c1 100755 --- a/workflow/scripts/_create_circExplorer_BSJ_bam_se.py +++ b/workflow/scripts/_create_circExplorer_BSJ_bam_se.py @@ -5,9 +5,11 @@ import os import time + def get_ctime(): return time.ctime(time.time()) + """ This script first validates each read to be "valid" BSJ read and then splits a BSJ bam file by strand into: @@ -16,9 +18,9 @@ def get_ctime(): 3. BSJ bed file with score(number of reads supporting the BSJ) and strand information Logic (for SE reads): Each BSJ is represented by a 2 alignments in the output BAM file. -Alignments 1 and 2 are split alignment of read1 at two distinct loci on the same reference +Alignments 1 and 2 are split alignment of read1 at two distinct loci on the same reference chromosome. -These alignments are grouped together by the "HI" tags in SAM file. For example, all 2 +These alignments are grouped together by the "HI" tags in SAM file. For example, all 2 alignments for the same BSJ will have the same "HI" value... something like "HI:i:1". BSJ alignment sam bitflag combinations can have 4 different possibilities, 2 from sense strand and 2 from anti-sense strand: @@ -31,38 +33,38 @@ def get_ctime(): class BSJ: def __init__(self): - self.chrom="" - self.start="" - self.end="" - self.score=0 - self.name="." - self.strand="U" - self.bitids=set() - self.rids=set() - + self.chrom = "" + self.start = "" + self.end = "" + self.score = 0 + self.name = "." + self.strand = "U" + self.bitids = set() + self.rids = set() + def plusone(self): - self.score+=1 - - def set_strand(self,strand): - self.strand=strand - - def set_chrom(self,chrom): - self.chrom=chrom - - def set_start(self,start): - self.start=start - - def set_end(self,end): - self.end=end - - def append_bitid(self,bitid): + self.score += 1 + + def set_strand(self, strand): + self.strand = strand + + def set_chrom(self, chrom): + self.chrom = chrom + + def set_start(self, start): + self.start = start + + def set_end(self, end): + self.end = end + + def append_bitid(self, bitid): self.bitids.add(bitid) - def append_rid(self,rid): + def append_rid(self, rid): self.rids.add(rid) - - def write_out_BSJ(self,outbed): - t=[] + + def write_out_BSJ(self, outbed): + t = [] t.append(self.chrom) t.append(str(self.start)) t.append(str(self.end)) @@ -71,192 +73,212 @@ def write_out_BSJ(self,outbed): t.append(self.strand) t.append(",".join(self.bitids)) t.append(",".join(self.rids)) - outbed.write("\t".join(t)+"\n") + outbed.write("\t".join(t) + "\n") - def update_score_and_found_count(self,junctions_found): + def update_score_and_found_count(self, junctions_found): self.score = len(self.rids) - jid = self.chrom + "##" + str(self.start) + "##" + str(int(self.end)-1) + "##" + self.strand - junctions_found[jid]+=self.score - + jid = ( + self.chrom + + "##" + + str(self.start) + + "##" + + str(int(self.end) - 1) + + "##" + + self.strand + ) + junctions_found[jid] += self.score + + class Readinfo: - def __init__(self,readid,rname): - self.readid=readid - self.refname=rname + def __init__(self, readid, rname): + self.readid = readid + self.refname = rname # self.alignments=list() - self.bitflags=list() - self.bitid="" - self.strand="." - self.start=-1 - self.end=-1 - self.refcoordinates=dict() - self.isread1=dict() - self.isreverse=dict() - self.issecondary=dict() - self.cigarstrs=dict() - self.issupplementary=dict() - + self.bitflags = list() + self.bitid = "" + self.strand = "." + self.start = -1 + self.end = -1 + self.refcoordinates = dict() + self.isread1 = dict() + self.isreverse = dict() + self.issecondary = dict() + self.cigarstrs = dict() + self.issupplementary = dict() + def __str__(self): - s = "readid: %s"%(self.readid) - s = "%s\tbitflags: %s"%(s,self.bitflags) - s = "%s\tisreverse: %s"%(s,self.isreverse) - s = "%s\tbitid: %s"%(s,self.bitid) + s = "readid: %s" % (self.readid) + s = "%s\tbitflags: %s" % (s, self.bitflags) + s = "%s\tisreverse: %s" % (s, self.isreverse) + s = "%s\tbitid: %s" % (s, self.bitid) return s - def set_refcoordinates(self,bitflag,refpos): - self.refcoordinates[bitflag]=refpos - - def set_cigarstr(self,bitflag,cigarstr): - self.cigarstrs[bitflag]=cigarstr - - def set_read1_reverse_secondary_supplementary(self,bitflag,read): + def set_refcoordinates(self, bitflag, refpos): + self.refcoordinates[bitflag] = refpos + + def set_cigarstr(self, bitflag, cigarstr): + self.cigarstrs[bitflag] = cigarstr + + def set_read1_reverse_secondary_supplementary(self, bitflag, read): if read.is_read1: - self.isread1[bitflag]="Y" + self.isread1[bitflag] = "Y" else: - self.isread1[bitflag]="N" + self.isread1[bitflag] = "N" if read.is_reverse: - self.isreverse[bitflag]="Y" + self.isreverse[bitflag] = "Y" else: - self.isreverse[bitflag]="N" + self.isreverse[bitflag] = "N" if read.is_secondary: - self.issecondary[bitflag]="Y" + self.issecondary[bitflag] = "Y" else: - self.issecondary[bitflag]="N" + self.issecondary[bitflag] = "N" if read.is_supplementary: - self.issupplementary[bitflag]="Y" + self.issupplementary[bitflag] = "Y" else: - self.issupplementary[bitflag]="N" - + self.issupplementary[bitflag] = "N" + # def append_alignment(self,read): # self.alignments.append(read) - - def append_bitflag(self,bf): + + def append_bitflag(self, bf): self.bitflags.append(bf) - + # def extend_ref_positions(self,refcoords): # self.refcoordinates.extend(refcoords) - + def generate_bitid(self): - bitlist=sorted(self.bitflags) - self.bitid="##".join(list(map(lambda x:str(x),bitlist))) -# self.bitid=str(bitlist[0])+"##"+str(bitlist[1])+"##"+str(bitlist[2]) - + bitlist = sorted(self.bitflags) + self.bitid = "##".join(list(map(lambda x: str(x), bitlist))) + + # self.bitid=str(bitlist[0])+"##"+str(bitlist[1])+"##"+str(bitlist[2]) + def get_strand(self): - if self.bitid=="0##2048": - self.strand="-" - elif self.bitid=="256##2304": - self.strand="-" - elif self.bitid=="16##2064": - self.strand="+" - elif self.bitid=="272##2320": - self.strand="+" + if self.bitid == "0##2048": + self.strand = "-" + elif self.bitid == "256##2304": + self.strand = "-" + elif self.bitid == "16##2064": + self.strand = "+" + elif self.bitid == "272##2320": + self.strand = "+" else: - self.strand="U" + self.strand = "U" - def validate_BSJ_read(self,junctions): + def validate_BSJ_read(self, junctions): """ Checks if read is truly a BSJ originitor. """ - if len(self.bitid.split("##"))==2: + if len(self.bitid.split("##")) == 2: if not self.bitid in ["0##2048", "16##2064", "256##2304", "272##2320"]: return False - count=0 - refcoords=self.refcoordinates - for k,v in refcoords.items(): - count+=1 - refcoords[k]=sorted(v) - if count==1: - astart=refcoords[k][0] - aend=refcoords[k][-1] - if count==2: - bstart=refcoords[k][0] - bend=refcoords[k][-1] + count = 0 + refcoords = self.refcoordinates + for k, v in refcoords.items(): + count += 1 + refcoords[k] = sorted(v) + if count == 1: + astart = refcoords[k][0] + aend = refcoords[k][-1] + if count == 2: + bstart = refcoords[k][0] + bend = refcoords[k][-1] chrom = self.refname - possiblejid=chrom+"##"+str(astart)+"##"+str(bend)+"##"+self.strand - possiblejid2=chrom+"##"+str(bstart)+"##"+str(aend)+"##"+self.strand + possiblejid = ( + chrom + "##" + str(astart) + "##" + str(bend) + "##" + self.strand + ) + possiblejid2 = ( + chrom + "##" + str(bstart) + "##" + str(aend) + "##" + self.strand + ) # exit() if possiblejid in junctions: self.start = astart - self.end = str(int(bend) + 1) # this will be added to the BED file + self.end = str(int(bend) + 1) # this will be added to the BED file return True if possiblejid2 in junctions: self.start = bstart - self.end = str(int(aend) + 1) # this will be added to the BED file - return True + self.end = str(int(aend) + 1) # this will be added to the BED file + return True else: return False - + def get_bsjid(self): - t=[] + t = [] t.append(self.refname) t.append(str(self.start)) t.append(str(self.end)) t.append(self.strand) return "##".join(t) - + # def write_out_reads(self,outbam): # for r in self.alignments: # outbam.write(r) - - + + def get_uniq_readid(r): - rname=r.query_name - hi=r.get_tag("HI") - rid=rname+"##"+str(hi) + rname = r.query_name + hi = r.get_tag("HI") + rid = rname + "##" + str(hi) return rid + def get_bitflag(r): - bitflag=str(r).split("\t")[1] + bitflag = str(r).split("\t")[1] return int(bitflag) + def _bsjid2chrom(bsjid): - x=bsjid.split("##") + x = bsjid.split("##") return x[0] + def _bsjid2jid(bsjid): - x=bsjid.split("##") - chrom=x[0] - start=x[1] - end=str(int(x[2])-1) - jid="##".join([chrom,start,end]) - return jid,chrom - -def read_regions(regionsfile,host,additives,viruses): - host=host.split(",") - additives=additives.split(",") - viruses=viruses.split(",") - infile=open(regionsfile,'r') - regions=dict() + x = bsjid.split("##") + chrom = x[0] + start = x[1] + end = str(int(x[2]) - 1) + jid = "##".join([chrom, start, end]) + return jid, chrom + + +def read_regions(regionsfile, host, additives, viruses): + host = host.split(",") + additives = additives.split(",") + viruses = viruses.split(",") + infile = open(regionsfile, "r") + regions = dict() for l in infile.readlines(): l = l.strip().split("\t") - region_name=l[0] - regions[region_name]=dict() - regions[region_name]['sequences']=dict() + region_name = l[0] + regions[region_name] = dict() + regions[region_name]["sequences"] = dict() if region_name in host: - regions[region_name]['host_additive_virus']="host" + regions[region_name]["host_additive_virus"] = "host" elif region_name in additives: - regions[region_name]['host_additive_virus']="additive" + regions[region_name]["host_additive_virus"] = "additive" elif region_name in viruses: - regions[region_name]['host_additive_virus']="virus" + regions[region_name]["host_additive_virus"] = "virus" else: exit("%s has unknown region. Its not a host or a additive or a virus!!") - sequence_names=l[1].split() + sequence_names = l[1].split() for s in sequence_names: - regions[region_name]['sequences'][s]=1 - return regions + regions[region_name]["sequences"][s] = 1 + return regions -def _get_host_additive_virus(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: - return v['host_additive_virus'] + +def _get_host_additive_virus(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: + return v["host_additive_virus"] else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) + -def _get_regionname_from_seqname(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: +def _get_regionname_from_seqname(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: return k else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) def main(): @@ -268,160 +290,335 @@ def main(): where the chrom, start and end represent the BSJ the read is depicting. """ ) - parser.add_argument("-i","--inbam",dest="inbam",required=True,type=str, - help="Input Chimeric-only STAR2p BAM file") - parser.add_argument("-s",'--sample_name', dest='samplename', type=str, required=False, default = 'sample1', - help='Sample Name: SM for RG') - parser.add_argument("-l",'--library', dest='library', type=str, required=False, default = 'lib1', - help='Sample Name: LB for RG') - parser.add_argument("-f",'--platform', dest='platform', type=str, required=False, default = 'illumina', - help='Sample Name: PL for RG') - parser.add_argument("-u",'--unit', dest='unit', type=str, required=False, default = 'unit1', - help='Sample Name: PU for RG') - parser.add_argument('-t','--sample_counts_table', dest='countstable', type=str, required=True, - help='circExplore per-sample counts table') # get coordinates of the circRNA - parser.add_argument("-p","--plusbam",dest="plusbam",required=True,type=argparse.FileType('w'), - help="Output plus strand bam file") - parser.add_argument("-m","--minusbam",dest="minusbam",required=True,type=argparse.FileType('w'), - help="Output plus strand bam file") - parser.add_argument("-o","--outbam",dest="outbam",required=True,type=argparse.FileType('w'), - help="Output bam file ... both strands") - parser.add_argument("--outputhostbams",dest="outputhostbams",required=False,action='store_true', default=False, - help="Output individual host BAM files") - parser.add_argument("--outputvirusbams",dest="outputvirusbams",required=False,action='store_true', default=False, - help="Output individual virus BAM files") - parser.add_argument("--outdir",dest="outdir",required=False,type=str, - help="Output folder for the individual BAM files (required only if --outputhostbams or --outputvirusbams is used).") - parser.add_argument("-b","--bed",dest="bed",required=True,type=str, - help="Output BSJ bed.gz file (with strand info)") - parser.add_argument("-j","--junctionsfound",dest="junctionsfound",required=True,type=argparse.FileType('w', encoding='UTF-8'), - help="Output TSV file with counts of junctions expected vs found") - parser.add_argument('--regions', dest='regions', type=str, required=True, - help='regions file eg. ref.fa.regions') - parser.add_argument('--host', dest='host', type=str, required=True, - help='host name eg.hg38... single value') - parser.add_argument('--additives', dest='additives', type=str, required=True, - help='additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out') - parser.add_argument('--viruses', dest='viruses', type=str, required=True, - help='virus name(s) eg.NC_009333.1... comma-separated list') - args = parser.parse_args() + parser.add_argument( + "-i", + "--inbam", + dest="inbam", + required=True, + type=str, + help="Input Chimeric-only STAR2p BAM file", + ) + parser.add_argument( + "-s", + "--sample_name", + dest="samplename", + type=str, + required=False, + default="sample1", + help="Sample Name: SM for RG", + ) + parser.add_argument( + "-l", + "--library", + dest="library", + type=str, + required=False, + default="lib1", + help="Sample Name: LB for RG", + ) + parser.add_argument( + "-f", + "--platform", + dest="platform", + type=str, + required=False, + default="illumina", + help="Sample Name: PL for RG", + ) + parser.add_argument( + "-u", + "--unit", + dest="unit", + type=str, + required=False, + default="unit1", + help="Sample Name: PU for RG", + ) + parser.add_argument( + "-t", + "--sample_counts_table", + dest="countstable", + type=str, + required=True, + help="circExplore per-sample counts table", + ) # get coordinates of the circRNA + parser.add_argument( + "-p", + "--plusbam", + dest="plusbam", + required=True, + type=argparse.FileType("w"), + help="Output plus strand bam file", + ) + parser.add_argument( + "-m", + "--minusbam", + dest="minusbam", + required=True, + type=argparse.FileType("w"), + help="Output plus strand bam file", + ) + parser.add_argument( + "-o", + "--outbam", + dest="outbam", + required=True, + type=argparse.FileType("w"), + help="Output bam file ... both strands", + ) + parser.add_argument( + "--outputhostbams", + dest="outputhostbams", + required=False, + action="store_true", + default=False, + help="Output individual host BAM files", + ) + parser.add_argument( + "--outputvirusbams", + dest="outputvirusbams", + required=False, + action="store_true", + default=False, + help="Output individual virus BAM files", + ) + parser.add_argument( + "--outdir", + dest="outdir", + required=False, + type=str, + help="Output folder for the individual BAM files (required only if --outputhostbams or --outputvirusbams is used).", + ) + parser.add_argument( + "-b", + "--bed", + dest="bed", + required=True, + type=str, + help="Output BSJ bed.gz file (with strand info)", + ) + parser.add_argument( + "-j", + "--junctionsfound", + dest="junctionsfound", + required=True, + type=argparse.FileType("w", encoding="UTF-8"), + help="Output TSV file with counts of junctions expected vs found", + ) + parser.add_argument( + "--regions", + dest="regions", + type=str, + required=True, + help="regions file eg. ref.fa.regions", + ) + parser.add_argument( + "--host", + dest="host", + type=str, + required=True, + help="host name eg.hg38... single value", + ) + parser.add_argument( + "--additives", + dest="additives", + type=str, + required=True, + help="additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out", + ) + parser.add_argument( + "--viruses", + dest="viruses", + type=str, + required=True, + help="virus name(s) eg.NC_009333.1... comma-separated list", + ) + args = parser.parse_args() samfile = pysam.AlignmentFile(args.inbam, "rb") samheader = samfile.header.to_dict() - samheader['RG']=list() -# bsjfile = open(args.bed,"w") - junctionsfile = open(args.countstable,'r') - junctions=dict() - junctions_found=dict() - print("%s | Reading...junctions!..."%(get_ctime())) + samheader["RG"] = list() + # bsjfile = open(args.bed,"w") + junctionsfile = open(args.countstable, "r") + junctions = dict() + junctions_found = dict() + print("%s | Reading...junctions!..." % (get_ctime())) for l in junctionsfile.readlines(): - if "read_count" in l: continue + if "read_count" in l: + continue l = l.strip().split("\t") chrom = l[0] start = l[1] - end = str(int(l[2])-1) + end = str(int(l[2]) - 1) strand = l[3] - jid = chrom+"##"+start+"##"+end+"##"+strand # create a unique junction ID for each line in the BSJ junction file and make it the dict key ... easy for searching! - samheader['RG'].append({'ID':jid, 'LB':args.library, 'PL':args.platform, 'PU':args.unit,'SM':args.samplename}) + jid = ( + chrom + "##" + start + "##" + end + "##" + strand + ) # create a unique junction ID for each line in the BSJ junction file and make it the dict key ... easy for searching! + samheader["RG"].append( + { + "ID": jid, + "LB": args.library, + "PL": args.platform, + "PU": args.unit, + "SM": args.samplename, + } + ) junctions[jid] = int(l[4]) junctions_found[jid] = 0 junctionsfile.close() # print(junctions) sequences = list() - for v in samheader['SQ']: - sequences.append(v['SN']) - seqname2regionname=dict() - hosts=set() - viruses=set() - regions = read_regions(regionsfile=args.regions,host=args.host,additives=args.additives,viruses=args.viruses) + for v in samheader["SQ"]: + sequences.append(v["SN"]) + seqname2regionname = dict() + hosts = set() + viruses = set() + regions = read_regions( + regionsfile=args.regions, + host=args.host, + additives=args.additives, + viruses=args.viruses, + ) for s in sequences: - hav = _get_host_additive_virus(regions,s) + hav = _get_host_additive_virus(regions, s) if hav == "host": - hostname = _get_regionname_from_seqname(regions,s) - seqname2regionname[s]=hostname + hostname = _get_regionname_from_seqname(regions, s) + seqname2regionname[s] = hostname hosts.add(hostname) if hav == "virus": - virusname = _get_regionname_from_seqname(regions,s) - seqname2regionname[s]=virusname + virusname = _get_regionname_from_seqname(regions, s) + seqname2regionname[s] = virusname viruses.add(virusname) - print("%s | Done reading %d junctions."%(get_ctime(),len(junctions))) - + print("%s | Done reading %d junctions." % (get_ctime(), len(junctions))) - bigdict=dict() - print("%s | Reading...alignments!..."%(get_ctime())) - count=0 - count2=0 + bigdict = dict() + print("%s | Reading...alignments!..." % (get_ctime())) + count = 0 + count2 = 0 for read in samfile.fetch(): - count+=1 - satag=read.get_tag("SA") - satagchrids=list(map(lambda x:samfile.get_tid(x),list(filter(lambda x:x!='',list(map(lambda x:x.split(",")[0],satag.split(";"))))))) - if not read.reference_id in satagchrids: continue # specific for SE as read.next_reference_id is -1 for SE - count2+=1 - rid=get_uniq_readid(read) # add the HI number to the readid - if debug:print(rid) + count += 1 + satag = read.get_tag("SA") + satagchrids = list( + map( + lambda x: samfile.get_tid(x), + list( + filter( + lambda x: x != "", + list(map(lambda x: x.split(",")[0], satag.split(";"))), + ) + ), + ) + ) + if not read.reference_id in satagchrids: + continue # specific for SE as read.next_reference_id is -1 for SE + count2 += 1 + rid = get_uniq_readid(read) # add the HI number to the readid + if debug: + print(rid) if not rid in bigdict: - bigdict[rid]=Readinfo(rid,read.reference_name) + bigdict[rid] = Readinfo(rid, read.reference_name) # bigdict[rid].append_alignment(read) # since rid has HI number included ... this separates alignment by HI - bitflag=get_bitflag(read) - if debug:print(bitflag) - bigdict[rid].append_bitflag(bitflag) # each rid can have upto 3 lines in the BAM with each having its own bitflag ... collect all bitflags in a list here - refpos=list(filter(lambda x:x!=None,read.get_reference_positions(full_length=True))) + bitflag = get_bitflag(read) + if debug: + print(bitflag) + bigdict[rid].append_bitflag( + bitflag + ) # each rid can have upto 3 lines in the BAM with each having its own bitflag ... collect all bitflags in a list here + refpos = list( + filter(lambda x: x != None, read.get_reference_positions(full_length=True)) + ) # if debug:print(refpos) - bigdict[rid].set_refcoordinates(bitflag,refpos) # maintain a list of reference coordinated that are "aligned" for each bitflag in each rid alignment - bigdict[rid].set_cigarstr(bitflag,read.cigarstring) - bigdict[rid].set_read1_reverse_secondary_supplementary(bitflag,read) - if debug:print(bigdict[rid]) - print("%s | Done reading %d chimeric alignments. [%d same chrom chimeras]"%(get_ctime(),count,count2)) + bigdict[rid].set_refcoordinates( + bitflag, refpos + ) # maintain a list of reference coordinated that are "aligned" for each bitflag in each rid alignment + bigdict[rid].set_cigarstr(bitflag, read.cigarstring) + bigdict[rid].set_read1_reverse_secondary_supplementary(bitflag, read) + if debug: + print(bigdict[rid]) + print( + "%s | Done reading %d chimeric alignments. [%d same chrom chimeras]" + % (get_ctime(), count, count2) + ) if debug: for rid in bigdict.keys(): - print(">>>%s\t%s\t%s\t%s"%(rid,bigdict[rid].isreverse,bigdict[rid].cigarstrs,bigdict[rid].refcoordinates)) + print( + ">>>%s\t%s\t%s\t%s" + % ( + rid, + bigdict[rid].isreverse, + bigdict[rid].cigarstrs, + bigdict[rid].refcoordinates, + ) + ) samfile.reset() - print("%s | Writing BAMs"%(get_ctime())) - plusfile = pysam.AlignmentFile(args.plusbam, "wb", header = samheader) - minusfile = pysam.AlignmentFile(args.minusbam, "wb", header = samheader) - outfile = pysam.AlignmentFile(args.outbam, "wb", header = samheader) + print("%s | Writing BAMs" % (get_ctime())) + plusfile = pysam.AlignmentFile(args.plusbam, "wb", header=samheader) + minusfile = pysam.AlignmentFile(args.minusbam, "wb", header=samheader) + outfile = pysam.AlignmentFile(args.outbam, "wb", header=samheader) outputbams = dict() if args.outputhostbams: for h in hosts: - outbamname = os.path.join(args.outdir,args.samplename+"."+h+".BSJ.bam") - outputbams[h] = pysam.AlignmentFile(outbamname, "wb", header = samheader) + outbamname = os.path.join( + args.outdir, args.samplename + "." + h + ".BSJ.bam" + ) + outputbams[h] = pysam.AlignmentFile(outbamname, "wb", header=samheader) if args.outputvirusbams: for v in viruses: - outbamname = os.path.join(args.outdir,args.samplename+"."+v+".BSJ.bam") - outputbams[v] = pysam.AlignmentFile(outbamname, "wb", header = samheader) - bsjdict=dict() - bitid_counts=dict() + outbamname = os.path.join( + args.outdir, args.samplename + "." + v + ".BSJ.bam" + ) + outputbams[v] = pysam.AlignmentFile(outbamname, "wb", header=samheader) + bsjdict = dict() + bitid_counts = dict() lenoutputbams = len(outputbams) for read in samfile.fetch(): - satag=read.get_tag("SA") - satagchrids=list(map(lambda x:samfile.get_tid(x),list(filter(lambda x:x!='',list(map(lambda x:x.split(",")[0],satag.split(";"))))))) - if not read.reference_id in satagchrids: continue # specific for SE as read.next_reference_id is -1 for SE - rid=get_uniq_readid(read) - if rid in bigdict: - bigdict[rid].generate_bitid() # separate all bitflags for the same rid with ## and create a unique single bitflag ... bitflags are pre-sorted - if debug:print(bigdict[rid]) - bigdict[rid].get_strand() # use the unique aggregated bitid to extract the strand information ... all possible cases are explicitly covered - if not bigdict[rid].validate_BSJ_read(junctions=junctions): # ensure that the read alignments leftmost and rightmost coordinates match with one of the BSJ junctions... if yes then that rid represents a BSJ. Also add start and end to the BSJ object + satag = read.get_tag("SA") + satagchrids = list( + map( + lambda x: samfile.get_tid(x), + list( + filter( + lambda x: x != "", + list(map(lambda x: x.split(",")[0], satag.split(";"))), + ) + ), + ) + ) + if not read.reference_id in satagchrids: + continue # specific for SE as read.next_reference_id is -1 for SE + rid = get_uniq_readid(read) + if rid in bigdict: + bigdict[ + rid + ].generate_bitid() # separate all bitflags for the same rid with ## and create a unique single bitflag ... bitflags are pre-sorted + if debug: + print(bigdict[rid]) + bigdict[ + rid + ].get_strand() # use the unique aggregated bitid to extract the strand information ... all possible cases are explicitly covered + if not bigdict[rid].validate_BSJ_read( + junctions=junctions + ): # ensure that the read alignments leftmost and rightmost coordinates match with one of the BSJ junctions... if yes then that rid represents a BSJ. Also add start and end to the BSJ object continue # bigdict[rid].get_start_end() # print(bigdict[rid]) - bsjid=bigdict[rid].get_bsjid() - chrom=_bsjid2chrom(bsjid) + bsjid = bigdict[rid].get_bsjid() + chrom = _bsjid2chrom(bsjid) # jid,chrom=_bsjid2jid(bsjid) read.set_tag("RG", bsjid, value_type="Z") - if bigdict[rid].strand=="+": + if bigdict[rid].strand == "+": plusfile.write(read) - if bigdict[rid].strand=="-": + if bigdict[rid].strand == "-": minusfile.write(read) outfile.write(read) if lenoutputbams != 0: - regionname=_get_regionname_from_seqname(regions,chrom) + regionname = _get_regionname_from_seqname(regions, chrom) if regionname in hosts and args.outputhostbams: outputbams[regionname].write(read) if regionname in viruses and args.outputvirusbams: outputbams[regionname].write(read) if not bsjid in bsjdict: - bsjdict[bsjid]=BSJ() + bsjdict[bsjid] = BSJ() bsjdict[bsjid].set_chrom(bigdict[rid].refname) bsjdict[bsjid].set_start(bigdict[rid].start) bsjdict[bsjid].set_end(bigdict[rid].end) @@ -429,42 +626,42 @@ def main(): # bsjdict[bsjid].plusone() bsjdict[bsjid].append_bitid(bigdict[rid].bitid) if not bigdict[rid].bitid in bitid_counts: - bitid_counts[bigdict[rid].bitid]=0 - bitid_counts[bigdict[rid].bitid]+=1 + bitid_counts[bigdict[rid].bitid] = 0 + bitid_counts[bigdict[rid].bitid] += 1 bsjdict[bsjid].append_rid(rid) plusfile.close() minusfile.close() samfile.close() outfile.close() if lenoutputbams != 0: - for k,v in outputbams.items(): + for k, v in outputbams.items(): v.close() - print("%s | Done!"%(get_ctime())) + print("%s | Done!" % (get_ctime())) for b in bitid_counts.keys(): - print(b,bitid_counts[b]) - print("%s | Writing BED"%(get_ctime())) - with gzip.open(args.bed,'wt') as bsjfile: + print(b, bitid_counts[b]) + print("%s | Writing BED" % (get_ctime())) + with gzip.open(args.bed, "wt") as bsjfile: for bsjid in bsjdict.keys(): bsjdict[bsjid].update_score_and_found_count(junctions_found) bsjdict[bsjid].write_out_BSJ(bsjfile) bsjfile.close() - args.junctionsfound.write("#chrom\tstart\tend\tstrand\texpected_BSJ_reads\tfound_BSJ_reads\n") + args.junctionsfound.write( + "#chrom\tstart\tend\tstrand\texpected_BSJ_reads\tfound_BSJ_reads\n" + ) for jid in junctions.keys(): - x=jid.split("##") - chrom=x[0] - start=int(x[1]) - end=int(x[2])+1 - strand=x[3] - args.junctionsfound.write("%s\t%d\t%d\t%s\t%d\t%d\n"%(chrom,start,end,strand,junctions[jid],junctions_found[jid])) + x = jid.split("##") + chrom = x[0] + start = int(x[1]) + end = int(x[2]) + 1 + strand = x[3] + args.junctionsfound.write( + "%s\t%d\t%d\t%s\t%d\t%d\n" + % (chrom, start, end, strand, junctions[jid], junctions_found[jid]) + ) args.junctionsfound.close() - print("%s | ALL Done!"%(get_ctime())) - - - + print("%s | ALL Done!" % (get_ctime())) if __name__ == "__main__": main() - - diff --git a/workflow/scripts/_create_circExplorer_BSJ_hqonly_pe.py b/workflow/scripts/_create_circExplorer_BSJ_hqonly_pe.py index 768bc1c..fb4e2dd 100755 --- a/workflow/scripts/_create_circExplorer_BSJ_hqonly_pe.py +++ b/workflow/scripts/_create_circExplorer_BSJ_hqonly_pe.py @@ -6,9 +6,11 @@ import time import pandas as pd + def get_ctime(): return time.ctime(time.time()) + """ This script first validates each read to be "valid" BSJ read and then splits a BSJ bam file by strand into: @@ -17,10 +19,10 @@ def get_ctime(): 3. BSJ bed file with score(number of reads supporting the BSJ) and strand information Logic (for PE reads): Each BSJ is represented by a 3 alignments in the output BAM file. -Alignment 1 is complete alignment of one of the reads in pair and -Alignments 2 and 3 are split alignment of the mate at two distinct loci on the same reference +Alignment 1 is complete alignment of one of the reads in pair and +Alignments 2 and 3 are split alignment of the mate at two distinct loci on the same reference chromosome. -These alignments are grouped together by the "HI" tags in SAM file. For example, all 3 +These alignments are grouped together by the "HI" tags in SAM file. For example, all 3 alignments for the same BSJ will have the same "HI" value... something like "HI:i:1". BSJ alignment sam bitflag combinations can have 8 different possibilities, 4 from sense strand and 4 from anti-sense strand: @@ -36,12 +38,12 @@ def get_ctime(): # |<------------------BSJ----------------->| 3. 83,163,2209 4. 339,419,2465 -# R1 -# <------ +# R1 +# <------ # 5'--|------------------------------------------|---3' # 3'--|------------------------------------------|---5' # |------> ------>| -# | R2.2 R2.1 | +# | R2.2 R2.1 | # | | # |<-----------------BSJ-------------------->| 5. 99,147,2193 @@ -56,12 +58,12 @@ def get_ctime(): # |<------------------BSJ----------------->| 7. 99,147,2145 8. 355, 403, 2401 -# R2 -# <------ +# R2 +# <------ # 5'--|------------------------------------------|---3' # 3'--|------------------------------------------|---5' # |------> ------>| -# | R1.2 R1.1 | +# | R1.2 R1.1 | # | | # |<-----------------BSJ-------------------->| """ @@ -69,38 +71,38 @@ def get_ctime(): class BSJ: def __init__(self): - self.chrom="" - self.start="" - self.end="" - self.score=0 - self.name="." - self.strand="U" - self.bitids=set() - self.rids=set() - + self.chrom = "" + self.start = "" + self.end = "" + self.score = 0 + self.name = "." + self.strand = "U" + self.bitids = set() + self.rids = set() + def plusone(self): - self.score+=1 - - def set_strand(self,strand): - self.strand=strand - - def set_chrom(self,chrom): - self.chrom=chrom - - def set_start(self,start): - self.start=start - - def set_end(self,end): - self.end=end - - def append_bitid(self,bitid): + self.score += 1 + + def set_strand(self, strand): + self.strand = strand + + def set_chrom(self, chrom): + self.chrom = chrom + + def set_start(self, start): + self.start = start + + def set_end(self, end): + self.end = end + + def append_bitid(self, bitid): self.bitids.add(bitid) - def append_rid(self,rid): + def append_rid(self, rid): self.rids.add(rid) - - def write_out_BSJ(self,outbed): - t=[] + + def write_out_BSJ(self, outbed): + t = [] t.append(self.chrom) t.append(str(self.start)) t.append(str(self.end)) @@ -109,149 +111,164 @@ def write_out_BSJ(self,outbed): t.append(self.strand) t.append(",".join(self.bitids)) t.append(",".join(self.rids)) - outbed.write("\t".join(t)+"\n") + outbed.write("\t".join(t) + "\n") - def update_score_and_found_count(self,junctions_found): + def update_score_and_found_count(self, junctions_found): self.score = len(self.rids) - jid = self.chrom + "##" + str(self.start) + "##" + str(int(self.end)-1) + "##" + self.strand - junctions_found[jid]+=self.score + jid = ( + self.chrom + + "##" + + str(self.start) + + "##" + + str(int(self.end) - 1) + + "##" + + self.strand + ) + junctions_found[jid] += self.score + - class Readinfo: - def __init__(self,readid,rname): - self.readid=readid - self.refname=rname - self.bitflags=list() - self.bitid="" - self.strand="." - self.start=-1 - self.end=-1 - self.refcoordinates=dict() - self.isread1=dict() - self.isreverse=dict() - self.issecondary=dict() - self.issupplementary=dict() - + def __init__(self, readid, rname): + self.readid = readid + self.refname = rname + self.bitflags = list() + self.bitid = "" + self.strand = "." + self.start = -1 + self.end = -1 + self.refcoordinates = dict() + self.isread1 = dict() + self.isreverse = dict() + self.issecondary = dict() + self.issupplementary = dict() + def __str__(self): - s = "readid: %s"%(self.readid) - s = "%s\tbitflags: %s"%(s,self.bitflags) - s = "%s\tbitid: %s"%(s,self.bitid) + s = "readid: %s" % (self.readid) + s = "%s\tbitflags: %s" % (s, self.bitflags) + s = "%s\tbitid: %s" % (s, self.bitid) for bf in self.bitflags: - s = "%s\t%s\trefcoordinates: %s"%(s,bf,", ".join(list(map(lambda x:str(x),self.refcoordinates[bf])))) + s = "%s\t%s\trefcoordinates: %s" % ( + s, + bf, + ", ".join(list(map(lambda x: str(x), self.refcoordinates[bf]))), + ) return s - def set_refcoordinates(self,bitflag,refpos): - self.refcoordinates[bitflag]=refpos - - def set_read1_reverse_secondary_supplementary(self,bitflag,read): + def set_refcoordinates(self, bitflag, refpos): + self.refcoordinates[bitflag] = refpos + + def set_read1_reverse_secondary_supplementary(self, bitflag, read): if read.is_read1: - self.isread1[bitflag]="Y" + self.isread1[bitflag] = "Y" else: - self.isread1[bitflag]="N" + self.isread1[bitflag] = "N" if read.is_reverse: - self.isreverse[bitflag]="Y" + self.isreverse[bitflag] = "Y" else: - self.isreverse[bitflag]="N" + self.isreverse[bitflag] = "N" if read.is_secondary: - self.issecondary[bitflag]="Y" + self.issecondary[bitflag] = "Y" else: - self.issecondary[bitflag]="N" + self.issecondary[bitflag] = "N" if read.is_supplementary: - self.issupplementary[bitflag]="Y" + self.issupplementary[bitflag] = "Y" else: - self.issupplementary[bitflag]="N" - - def append_alignment(self,read): + self.issupplementary[bitflag] = "N" + + def append_alignment(self, read): self.alignments.append(read) - - def append_bitflag(self,bf): + + def append_bitflag(self, bf): self.bitflags.append(bf) - + # def extend_ref_positions(self,refcoords): # self.refcoordinates.extend(refcoords) - + def generate_bitid(self): - bitlist=sorted(self.bitflags) - self.bitid="##".join(list(map(lambda x:str(x),bitlist))) -# self.bitid=str(bitlist[0])+"##"+str(bitlist[1])+"##"+str(bitlist[2]) - + bitlist = sorted(self.bitflags) + self.bitid = "##".join(list(map(lambda x: str(x), bitlist))) + + # self.bitid=str(bitlist[0])+"##"+str(bitlist[1])+"##"+str(bitlist[2]) + def get_strand(self): - if self.bitid=="83##163##2129": - self.strand="+" - elif self.bitid=="339##419##2385": - self.strand="+" - elif self.bitid=="83##163##2209": - self.strand="+" - elif self.bitid=="339##419##2465": - self.strand="+" - elif self.bitid=="99##147##2193": - self.strand="-" - elif self.bitid=="355##403##2449": - self.strand="-" - elif self.bitid=="99##147##2145": - self.strand="-" - elif self.bitid=="355##403##2401": - self.strand="-" - elif self.bitid=="16##2064": - self.strand="+" - elif self.bitid=="272##2320": - self.strand="+" - elif self.bitid=="0##2048": - self.strand="-" - elif self.bitid=="256##2304": - self.strand="-" - elif self.bitid=="153##2201": - self.strand="-" + if self.bitid == "83##163##2129": + self.strand = "+" + elif self.bitid == "339##419##2385": + self.strand = "+" + elif self.bitid == "83##163##2209": + self.strand = "+" + elif self.bitid == "339##419##2465": + self.strand = "+" + elif self.bitid == "99##147##2193": + self.strand = "-" + elif self.bitid == "355##403##2449": + self.strand = "-" + elif self.bitid == "99##147##2145": + self.strand = "-" + elif self.bitid == "355##403##2401": + self.strand = "-" + elif self.bitid == "16##2064": + self.strand = "+" + elif self.bitid == "272##2320": + self.strand = "+" + elif self.bitid == "0##2048": + self.strand = "-" + elif self.bitid == "256##2304": + self.strand = "-" + elif self.bitid == "153##2201": + self.strand = "-" else: - self.strand="." - + self.strand = "." + def flip_strand(self): - if self.strand=="+":self.strand="-" - if self.strand=="-":self.strand="+" + if self.strand == "+": + self.strand = "-" + if self.strand == "-": + self.strand = "+" - def validate_BSJ_read(self,junctions): + def validate_BSJ_read(self, junctions): """ Checks if read is truly a BSJ originitor. * Defines left, right and middle alignments * Left and right alignments should not overlap * Middle alignment should be between left and right alignments """ - if len(self.bitid.split("##"))==3: - left=-1 - right=-1 - middle=-1 - if self.bitid=="83##163##2129": - left=2129 - right=83 - middle=163 - if self.bitid=="339##419##2385": - left=2385 - right=339 - middle=419 - if self.bitid=="83##163##2209": - left=163 - right=2209 - middle=83 - if self.bitid=="339##419##2465": - left=419 - right=2465 - middle=339 - if self.bitid=="99##147##2145": - left=99 - right=2145 - middle=147 - if self.bitid=="355##403##2401": - left=355 - right=2401 - middle=403 - if self.bitid=="99##147##2193": - left=2193 - right=147 - middle=99 - if self.bitid=="355##403##2449": - left=2449 - right=403 - middle=355 + if len(self.bitid.split("##")) == 3: + left = -1 + right = -1 + middle = -1 + if self.bitid == "83##163##2129": + left = 2129 + right = 83 + middle = 163 + if self.bitid == "339##419##2385": + left = 2385 + right = 339 + middle = 419 + if self.bitid == "83##163##2209": + left = 163 + right = 2209 + middle = 83 + if self.bitid == "339##419##2465": + left = 419 + right = 2465 + middle = 339 + if self.bitid == "99##147##2145": + left = 99 + right = 2145 + middle = 147 + if self.bitid == "355##403##2401": + left = 355 + right = 2401 + middle = 403 + if self.bitid == "99##147##2193": + left = 2193 + right = 147 + middle = 99 + if self.bitid == "355##403##2449": + left = 2449 + right = 403 + middle = 355 # print(left,right,middle) if left == -1 or right == -1 or middle == -1: return False @@ -262,89 +279,95 @@ def validate_BSJ_read(self,junctions): # print("validate_BSJ_read",self.readid,self.refcoordinates[middle][0],self.refcoordinates[middle][-1]) leftmost = str(self.refcoordinates[left][0]) rightmost = str(self.refcoordinates[right][-1]) - possiblejid = chrom+"##"+leftmost+"##"+rightmost+"##"+self.strand + possiblejid = ( + chrom + "##" + leftmost + "##" + rightmost + "##" + self.strand + ) # print("validate_BSJ_read",self.readid,possiblejid) if possiblejid in junctions: self.start = leftmost - self.end = str(int(rightmost) + 1) # this will be added to the BED file + self.end = str(int(rightmost) + 1) # this will be added to the BED file return True else: return False - - - + def get_bsjid(self): - t=[] + t = [] t.append(self.refname) t.append(self.start) t.append(self.end) t.append(self.strand) return "##".join(t) - - def write_out_reads(self,outbam): + + def write_out_reads(self, outbam): for r in self.alignments: outbam.write(r) - - + + def get_uniq_readid(r): - rname=r.query_name - hi=r.get_tag("HI") - rid=rname+"##"+str(hi) + rname = r.query_name + hi = r.get_tag("HI") + rid = rname + "##" + str(hi) return rid + def get_bitflag(r): - bitflag=str(r).split("\t")[1] + bitflag = str(r).split("\t")[1] return int(bitflag) + def _bsjid2chrom(bsjid): - x=bsjid.split("##") + x = bsjid.split("##") return x[0] + def _bsjid2jid(bsjid): - x=bsjid.split("##") - chrom=x[0] - start=x[1] - end=str(int(x[2])-1) - jid="##".join([chrom,start,end]) - return jid,chrom - -def read_regions(regionsfile,host,additives,viruses): - host=host.split(",") - additives=additives.split(",") - viruses=viruses.split(",") - infile=open(regionsfile,'r') - regions=dict() + x = bsjid.split("##") + chrom = x[0] + start = x[1] + end = str(int(x[2]) - 1) + jid = "##".join([chrom, start, end]) + return jid, chrom + + +def read_regions(regionsfile, host, additives, viruses): + host = host.split(",") + additives = additives.split(",") + viruses = viruses.split(",") + infile = open(regionsfile, "r") + regions = dict() for l in infile.readlines(): l = l.strip().split("\t") - region_name=l[0] - regions[region_name]=dict() - regions[region_name]['sequences']=dict() + region_name = l[0] + regions[region_name] = dict() + regions[region_name]["sequences"] = dict() if region_name in host: - regions[region_name]['host_additive_virus']="host" + regions[region_name]["host_additive_virus"] = "host" elif region_name in additives: - regions[region_name]['host_additive_virus']="additive" + regions[region_name]["host_additive_virus"] = "additive" elif region_name in viruses: - regions[region_name]['host_additive_virus']="virus" + regions[region_name]["host_additive_virus"] = "virus" else: exit("%s has unknown region. Its not a host or a additive or a virus!!") - sequence_names=l[1].split() + sequence_names = l[1].split() for s in sequence_names: - regions[region_name]['sequences'][s]=1 - return regions + regions[region_name]["sequences"][s] = 1 + return regions + -def _get_host_additive_virus(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: - return v['host_additive_virus'] +def _get_host_additive_virus(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: + return v["host_additive_virus"] else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) -def _get_regionname_from_seqname(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: + +def _get_regionname_from_seqname(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: return k else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) def main(): @@ -356,64 +379,190 @@ def main(): where the chrom, start and end represent the BSJ the read is depicting. """ ) - parser.add_argument("-i","--inbam",dest="inbam",required=True,type=str, - help="Input Chimeric-only STAR2p BAM file") - parser.add_argument('-t','--sample_counts_table', dest='countstable', type=str, required=True, - help='final all sample counts matrix') # get coordinates of the circRNA - parser.add_argument('--hqonly', dest='hqonly', action='store_true', - help='filter out non HQ calls') - parser.add_argument("-s",'--sample_name', dest='samplename', type=str, required=False, default = 'sample1', - help='Sample Name: SM for RG') - parser.add_argument("-l",'--library', dest='library', type=str, required=False, default = 'lib1', - help='Sample Name: LB for RG') - parser.add_argument("-f",'--platform', dest='platform', type=str, required=False, default = 'illumina', - help='Sample Name: PL for RG') - parser.add_argument("-u",'--unit', dest='unit', type=str, required=False, default = 'unit1', - help='Sample Name: PU for RG') - parser.add_argument("-o","--outbam",dest="outbam",required=True,type=argparse.FileType('w'), - help="Output bam file ... both strands") - parser.add_argument("-p","--plusbam",dest="plusbam",required=True,type=argparse.FileType('w'), - help="Output plus strand bam file") - parser.add_argument("-m","--minusbam",dest="minusbam",required=True,type=argparse.FileType('w'), - help="Output plus strand bam file") - parser.add_argument("--outputhostbams",dest="outputhostbams",required=False,action='store_true', default=False, - help="Output individual host BAM files") - parser.add_argument("--outputvirusbams",dest="outputvirusbams",required=False,action='store_true', default=False, - help="Output individual virus BAM files") - parser.add_argument("--outdir",dest="outdir",required=False,type=str, - help="Output folder for the individual BAM files (required only if --outputhostbams or --outputvirusbams is used).") - parser.add_argument("-b","--bed",dest="bed",required=True,type=str, - help="Output BSJ bed.gz file (with strand info)") - parser.add_argument("-j","--junctionsfound",dest="junctionsfound",required=True,type=argparse.FileType('w', encoding='UTF-8'), - help="Output TSV file with counts of junctions expected vs found") - parser.add_argument('--regions', dest='regions', type=str, required=True, - help='regions file eg. ref.fa.regions') - parser.add_argument('--host', dest='host', type=str, required=True, - help='host name eg.hg38... single value') - parser.add_argument('--additives', dest='additives', type=str, required=True, - help='additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out') - parser.add_argument('--viruses', dest='viruses', type=str, required=True, - help='virus name(s) eg.NC_009333.1... comma-separated list') - args = parser.parse_args() + parser.add_argument( + "-i", + "--inbam", + dest="inbam", + required=True, + type=str, + help="Input Chimeric-only STAR2p BAM file", + ) + parser.add_argument( + "-t", + "--sample_counts_table", + dest="countstable", + type=str, + required=True, + help="final all sample counts matrix", + ) # get coordinates of the circRNA + parser.add_argument( + "--hqonly", dest="hqonly", action="store_true", help="filter out non HQ calls" + ) + parser.add_argument( + "-s", + "--sample_name", + dest="samplename", + type=str, + required=False, + default="sample1", + help="Sample Name: SM for RG", + ) + parser.add_argument( + "-l", + "--library", + dest="library", + type=str, + required=False, + default="lib1", + help="Sample Name: LB for RG", + ) + parser.add_argument( + "-f", + "--platform", + dest="platform", + type=str, + required=False, + default="illumina", + help="Sample Name: PL for RG", + ) + parser.add_argument( + "-u", + "--unit", + dest="unit", + type=str, + required=False, + default="unit1", + help="Sample Name: PU for RG", + ) + parser.add_argument( + "-o", + "--outbam", + dest="outbam", + required=True, + type=argparse.FileType("w"), + help="Output bam file ... both strands", + ) + parser.add_argument( + "-p", + "--plusbam", + dest="plusbam", + required=True, + type=argparse.FileType("w"), + help="Output plus strand bam file", + ) + parser.add_argument( + "-m", + "--minusbam", + dest="minusbam", + required=True, + type=argparse.FileType("w"), + help="Output plus strand bam file", + ) + parser.add_argument( + "--outputhostbams", + dest="outputhostbams", + required=False, + action="store_true", + default=False, + help="Output individual host BAM files", + ) + parser.add_argument( + "--outputvirusbams", + dest="outputvirusbams", + required=False, + action="store_true", + default=False, + help="Output individual virus BAM files", + ) + parser.add_argument( + "--outdir", + dest="outdir", + required=False, + type=str, + help="Output folder for the individual BAM files (required only if --outputhostbams or --outputvirusbams is used).", + ) + parser.add_argument( + "-b", + "--bed", + dest="bed", + required=True, + type=str, + help="Output BSJ bed.gz file (with strand info)", + ) + parser.add_argument( + "-j", + "--junctionsfound", + dest="junctionsfound", + required=True, + type=argparse.FileType("w", encoding="UTF-8"), + help="Output TSV file with counts of junctions expected vs found", + ) + parser.add_argument( + "--regions", + dest="regions", + type=str, + required=True, + help="regions file eg. ref.fa.regions", + ) + parser.add_argument( + "--host", + dest="host", + type=str, + required=True, + help="host name eg.hg38... single value", + ) + parser.add_argument( + "--additives", + dest="additives", + type=str, + required=True, + help="additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out", + ) + parser.add_argument( + "--viruses", + dest="viruses", + type=str, + required=True, + help="virus name(s) eg.NC_009333.1... comma-separated list", + ) + args = parser.parse_args() samfile = pysam.AlignmentFile(args.inbam, "rb") samheader = samfile.header.to_dict() - samheader['RG']=list() + samheader["RG"] = list() - print("%s | Reading...junctions!..."%(get_ctime())) - indf = pd.read_csv(args.countstable,sep="\t",header=0,compression='gzip') + print("%s | Reading...junctions!..." % (get_ctime())) + indf = pd.read_csv(args.countstable, sep="\t", header=0, compression="gzip") # filter by samplename - indf = indf.loc[indf['sample_name']==args.samplename] + indf = indf.loc[indf["sample_name"] == args.samplename] # filter for hq if args.hqonly: - indf = indf.loc[indf['HQ']=="Y"] - - junctions=dict() - junctions_found=dict() - - for index,row in indf.iterrows(): - jid = row['chrom']+"##"+str(row['start'])+"##"+str(row['end'])+"##"+row['strand'] - samheader['RG'].append({'ID':jid, 'LB':args.library, 'PL':args.platform, 'PU':args.unit,'SM':args.samplename}) - junctions[jid] = max([row['circExplorer_read_count'],row['circExplorer_bwa_read_count']]) # large read count support from the "required" tools + indf = indf.loc[indf["HQ"] == "Y"] + + junctions = dict() + junctions_found = dict() + + for index, row in indf.iterrows(): + jid = ( + row["chrom"] + + "##" + + str(row["start"]) + + "##" + + str(row["end"]) + + "##" + + row["strand"] + ) + samheader["RG"].append( + { + "ID": jid, + "LB": args.library, + "PL": args.platform, + "PU": args.unit, + "SM": args.samplename, + } + ) + junctions[jid] = max( + [row["circExplorer_read_count"], row["circExplorer_bwa_read_count"]] + ) # large read count support from the "required" tools junctions_found[jid] = 0 # junctionsfile = open(args.countstable,'r') @@ -430,137 +579,171 @@ def main(): # junctions_found[jid] = 0 # junctionsfile.close() sequences = list() - for v in samheader['SQ']: - sequences.append(v['SN']) - seqname2regionname=dict() - hosts=set() - viruses=set() - regions = read_regions(regionsfile=args.regions,host=args.host,additives=args.additives,viruses=args.viruses) + for v in samheader["SQ"]: + sequences.append(v["SN"]) + seqname2regionname = dict() + hosts = set() + viruses = set() + regions = read_regions( + regionsfile=args.regions, + host=args.host, + additives=args.additives, + viruses=args.viruses, + ) for s in sequences: - hav = _get_host_additive_virus(regions,s) + hav = _get_host_additive_virus(regions, s) if hav == "host": - hostname = _get_regionname_from_seqname(regions,s) - seqname2regionname[s]=hostname + hostname = _get_regionname_from_seqname(regions, s) + seqname2regionname[s] = hostname hosts.add(hostname) if hav == "virus": - virusname = _get_regionname_from_seqname(regions,s) - seqname2regionname[s]=virusname + virusname = _get_regionname_from_seqname(regions, s) + seqname2regionname[s] = virusname viruses.add(virusname) - print("%s | Done reading %d junctions."%(get_ctime(),len(junctions))) + print("%s | Done reading %d junctions." % (get_ctime(), len(junctions))) - bigdict=dict() + bigdict = dict() # print("Opening...") # print(args.inbam) - print("%s | Reading...alignments!..."%(get_ctime())) - count=0 - count2=0 + print("%s | Reading...alignments!..." % (get_ctime())) + count = 0 + count2 = 0 for read in samfile.fetch(): - count+=1 - if debug: print(read,read.reference_id,read.next_reference_id) - if read.reference_id != read.next_reference_id: continue # only works for PE ... for SE read.next_reference_id is -1 - count2+=1 - rid=get_uniq_readid(read) # add the HI number to the readid - if debug:print(rid) + count += 1 + if debug: + print(read, read.reference_id, read.next_reference_id) + if read.reference_id != read.next_reference_id: + continue # only works for PE ... for SE read.next_reference_id is -1 + count2 += 1 + rid = get_uniq_readid(read) # add the HI number to the readid + if debug: + print(rid) if not rid in bigdict: - bigdict[rid]=Readinfo(rid,read.reference_name) + bigdict[rid] = Readinfo(rid, read.reference_name) # bigdict[rid].append_alignment(read) # since rid has HI number included ... this separates alignment by HI - bitflag=get_bitflag(read) - if debug:print(bitflag) - bigdict[rid].append_bitflag(bitflag) # each rid can have upto 3 lines in the BAM with each having its own bitflag ... collect all bigflags in a list here - refpos=list(filter(lambda x:x!=None,read.get_reference_positions(full_length=True))) - bigdict[rid].set_refcoordinates(bitflag,refpos) # maintain a list of reference coordinated that are "aligned" for each bitflag in each rid alignment + bitflag = get_bitflag(read) + if debug: + print(bitflag) + bigdict[rid].append_bitflag( + bitflag + ) # each rid can have upto 3 lines in the BAM with each having its own bitflag ... collect all bigflags in a list here + refpos = list( + filter(lambda x: x != None, read.get_reference_positions(full_length=True)) + ) + bigdict[rid].set_refcoordinates( + bitflag, refpos + ) # maintain a list of reference coordinated that are "aligned" for each bitflag in each rid alignment # bigdict[rid].set_read1_reverse_secondary_supplementary(bitflag,read) - if debug:print(bigdict[rid]) - print("%s | Done reading %d chimeric alignments. [%d same chrom chimeras]"%(get_ctime(),count,count2)) + if debug: + print(bigdict[rid]) + print( + "%s | Done reading %d chimeric alignments. [%d same chrom chimeras]" + % (get_ctime(), count, count2) + ) samfile.reset() - print("%s | Writing BAMs"%(get_ctime())) - print("%s | Re-Reading...alignments!..."%(get_ctime())) - plusfile = pysam.AlignmentFile(args.plusbam, "wb", header = samheader) - minusfile = pysam.AlignmentFile(args.minusbam, "wb", header = samheader) - outfile = pysam.AlignmentFile(args.outbam, "wb", header = samheader) + print("%s | Writing BAMs" % (get_ctime())) + print("%s | Re-Reading...alignments!..." % (get_ctime())) + plusfile = pysam.AlignmentFile(args.plusbam, "wb", header=samheader) + minusfile = pysam.AlignmentFile(args.minusbam, "wb", header=samheader) + outfile = pysam.AlignmentFile(args.outbam, "wb", header=samheader) outputbams = dict() if args.outputhostbams: for h in hosts: - outbamname = os.path.join(args.outdir,args.samplename+"."+h+".BSJ.HQonly.bam") - outputbams[h] = pysam.AlignmentFile(outbamname, "wb", header = samheader) + outbamname = os.path.join( + args.outdir, args.samplename + "." + h + ".BSJ.HQonly.bam" + ) + outputbams[h] = pysam.AlignmentFile(outbamname, "wb", header=samheader) if args.outputvirusbams: for v in viruses: - outbamname = os.path.join(args.outdir,args.samplename+"."+v+".BSJ.HQonly.bam") - outputbams[v] = pysam.AlignmentFile(outbamname, "wb", header = samheader) - bsjdict=dict() - bitid_counts=dict() + outbamname = os.path.join( + args.outdir, args.samplename + "." + v + ".BSJ.HQonly.bam" + ) + outputbams[v] = pysam.AlignmentFile(outbamname, "wb", header=samheader) + bsjdict = dict() + bitid_counts = dict() lenoutputbams = len(outputbams) for read in samfile.fetch(): - if read.reference_id != read.next_reference_id: continue - rid=get_uniq_readid(read) + if read.reference_id != read.next_reference_id: + continue + rid = get_uniq_readid(read) if rid in bigdict: - bigdict[rid].generate_bitid() # separate all bitflags for the same rid with ## and create a unique single bitflag ... bitflags are pre-sorted - if debug:print(bigdict[rid]) - bigdict[rid].get_strand() # use the unique aggregated bitid to extract the strand information ... all possible cases are explicitly covered - bigdict[rid].flip_strand() # strands are flipped than those reported in the counts table .. hence flipping! - if not bigdict[rid].validate_BSJ_read(junctions=junctions): # ensure that the read alignments leftmost and rightmost coordinates match with one of the BSJ junctions... if yes then that rid represents a BSJ. Also add start and end to the BSJ object + bigdict[ + rid + ].generate_bitid() # separate all bitflags for the same rid with ## and create a unique single bitflag ... bitflags are pre-sorted + if debug: + print(bigdict[rid]) + bigdict[ + rid + ].get_strand() # use the unique aggregated bitid to extract the strand information ... all possible cases are explicitly covered + bigdict[ + rid + ].flip_strand() # strands are flipped than those reported in the counts table .. hence flipping! + if not bigdict[rid].validate_BSJ_read( + junctions=junctions + ): # ensure that the read alignments leftmost and rightmost coordinates match with one of the BSJ junctions... if yes then that rid represents a BSJ. Also add start and end to the BSJ object continue # bigdict[rid].get_start_end() # print(bigdict[rid]) - bsjid=bigdict[rid].get_bsjid() - chrom=_bsjid2chrom(bsjid) + bsjid = bigdict[rid].get_bsjid() + chrom = _bsjid2chrom(bsjid) # jid,chrom=_bsjid2jid(bsjid) read.set_tag("RG", bsjid, value_type="Z") - if bigdict[rid].strand=="+": + if bigdict[rid].strand == "+": plusfile.write(read) - if bigdict[rid].strand=="-": + if bigdict[rid].strand == "-": minusfile.write(read) outfile.write(read) if lenoutputbams != 0: - regionname=_get_regionname_from_seqname(regions,chrom) + regionname = _get_regionname_from_seqname(regions, chrom) if regionname in hosts and args.outputhostbams: outputbams[regionname].write(read) if regionname in viruses and args.outputvirusbams: outputbams[regionname].write(read) if not bsjid in bsjdict: - bsjdict[bsjid]=BSJ() + bsjdict[bsjid] = BSJ() bsjdict[bsjid].set_chrom(bigdict[rid].refname) bsjdict[bsjid].set_start(bigdict[rid].start) bsjdict[bsjid].set_end(bigdict[rid].end) bsjdict[bsjid].set_strand(bigdict[rid].strand) bsjdict[bsjid].append_bitid(bigdict[rid].bitid) if not bigdict[rid].bitid in bitid_counts: - bitid_counts[bigdict[rid].bitid]=0 - bitid_counts[bigdict[rid].bitid]+=1 + bitid_counts[bigdict[rid].bitid] = 0 + bitid_counts[bigdict[rid].bitid] += 1 bsjdict[bsjid].append_rid(rid) plusfile.close() minusfile.close() samfile.close() outfile.close() if lenoutputbams != 0: - for k,v in outputbams.items(): + for k, v in outputbams.items(): v.close() - print("%s | Done!"%(get_ctime())) + print("%s | Done!" % (get_ctime())) for b in bitid_counts.keys(): - print(b,bitid_counts[b]) - print("%s | Writing BED"%(get_ctime())) + print(b, bitid_counts[b]) + print("%s | Writing BED" % (get_ctime())) - with gzip.open(args.bed,'wt') as bsjfile: + with gzip.open(args.bed, "wt") as bsjfile: for bsjid in bsjdict.keys(): bsjdict[bsjid].update_score_and_found_count(junctions_found) bsjdict[bsjid].write_out_BSJ(bsjfile) bsjfile.close() - - args.junctionsfound.write("#chrom\tstart\tend\tstrand\texpected_BSJ_reads\tfound_BSJ_reads\n") + args.junctionsfound.write( + "#chrom\tstart\tend\tstrand\texpected_BSJ_reads\tfound_BSJ_reads\n" + ) for jid in junctions.keys(): - x=jid.split("##") - chrom=x[0] - start=int(x[1]) - end=int(x[2])+1 - strand=x[3] - args.junctionsfound.write("%s\t%d\t%d\t%s\t%d\t%d\n"%(chrom,start,end,strand,junctions[jid],junctions_found[jid])) + x = jid.split("##") + chrom = x[0] + start = int(x[1]) + end = int(x[2]) + 1 + strand = x[3] + args.junctionsfound.write( + "%s\t%d\t%d\t%s\t%d\t%d\n" + % (chrom, start, end, strand, junctions[jid], junctions_found[jid]) + ) args.junctionsfound.close() - print("%s | ALL Done!"%(get_ctime())) - + print("%s | ALL Done!" % (get_ctime())) + if __name__ == "__main__": main() - - - diff --git a/workflow/scripts/_extract_circExplorer_linear_reads.py b/workflow/scripts/_extract_circExplorer_linear_reads.py index 0a61854..a9ec842 100755 --- a/workflow/scripts/_extract_circExplorer_linear_reads.py +++ b/workflow/scripts/_extract_circExplorer_linear_reads.py @@ -5,49 +5,54 @@ import pprint import time + def get_ctime(): return time.ctime(time.time()) + pp = pprint.PrettyPrinter(indent=4) -def read_regions(regionsfile,host,additives,viruses): - host=host.split(",") - additives=additives.split(",") - viruses=viruses.split(",") - infile=open(regionsfile,'r') - regions=dict() +def read_regions(regionsfile, host, additives, viruses): + host = host.split(",") + additives = additives.split(",") + viruses = viruses.split(",") + infile = open(regionsfile, "r") + regions = dict() for l in infile.readlines(): l = l.strip().split("\t") - region_name=l[0] - regions[region_name]=dict() - regions[region_name]['sequences']=dict() + region_name = l[0] + regions[region_name] = dict() + regions[region_name]["sequences"] = dict() if region_name in host: - regions[region_name]['host_additive_virus']="host" + regions[region_name]["host_additive_virus"] = "host" elif region_name in additives: - regions[region_name]['host_additive_virus']="additive" + regions[region_name]["host_additive_virus"] = "additive" elif region_name in viruses: - regions[region_name]['host_additive_virus']="virus" + regions[region_name]["host_additive_virus"] = "virus" else: exit("%s has unknown region. Its not a host or a additive or a virus!!") - sequence_names=l[1].split() + sequence_names = l[1].split() for s in sequence_names: - regions[region_name]['sequences'][s]=1 - return regions + regions[region_name]["sequences"][s] = 1 + return regions + -def _get_host_additive_virus(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: - return v['host_additive_virus'] +def _get_host_additive_virus(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: + return v["host_additive_virus"] else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) -def _get_regionname_from_seqname(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: + +def _get_regionname_from_seqname(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: return k else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) + def _convertjid(jid): jid = jid.split("##") @@ -57,9 +62,12 @@ def _convertjid(jid): strand = jid[3] read_strand = jid[4] strand_info = "." - if strand==read_strand: strand_info="SS" - if (strand=="+" and read_strand=="-") or (strand=="-" and read_strand=="+"): strand_info="OS" - return "##".join([chrom,start,end,strand,strand_info]) + if strand == read_strand: + strand_info = "SS" + if (strand == "+" and read_strand == "-") or (strand == "-" and read_strand == "+"): + strand_info = "OS" + return "##".join([chrom, start, end, strand, strand_info]) + def _get_shortjid(jid): jid = jid.split("##") @@ -69,7 +77,8 @@ def _get_shortjid(jid): strand = jid[3] read_strand = jid[4] strand_info = "." - return "##".join([chrom,start,end,strand]) + return "##".join([chrom, start, end, strand]) + def _get_jinfo(jid): jid = jid.split("##") @@ -79,270 +88,482 @@ def _get_jinfo(jid): strand = jid[3] read_strand = jid[4] strand_info = "." - if strand==read_strand: strand_info="SS" - if (strand=="+" and read_strand=="-") or (strand=="-" and read_strand=="+"): strand_info="OS" - short_jid = "##".join([chrom,start,end,strand]) - converted_jid = "##".join([chrom,start,end,strand,strand_info]) - return chrom,start,end,strand_info,short_jid,converted_jid,read_strand + if strand == read_strand: + strand_info = "SS" + if (strand == "+" and read_strand == "-") or (strand == "-" and read_strand == "+"): + strand_info = "OS" + short_jid = "##".join([chrom, start, end, strand]) + converted_jid = "##".join([chrom, start, end, strand, strand_info]) + return chrom, start, end, strand_info, short_jid, converted_jid, read_strand + class JID: - def __init__(self,chrom,start,end,strand): - self.chrom=chrom - self.start=start - self.end=end - self.strand=strand - self.ss_linear_count=0 - self.os_linear_count=0 - self.ss_linear_spliced_count=0 - self.os_linear_spliced_count=0 + def __init__(self, chrom, start, end, strand): + self.chrom = chrom + self.start = start + self.end = end + self.strand = strand + self.ss_linear_count = 0 + self.os_linear_count = 0 + self.ss_linear_spliced_count = 0 + self.os_linear_spliced_count = 0 - def increment_linear(self,strand_info): - if strand_info=="SS": self.ss_linear_count+=1 - if strand_info=="OS": self.os_linear_count+=1 + def increment_linear(self, strand_info): + if strand_info == "SS": + self.ss_linear_count += 1 + if strand_info == "OS": + self.os_linear_count += 1 - def increment_linear_spliced(self,strand_info): - if strand_info=="SS": self.ss_linear_spliced_count+=1 - if strand_info=="OS": self.os_linear_spliced_count+=1 + def increment_linear_spliced(self, strand_info): + if strand_info == "SS": + self.ss_linear_spliced_count += 1 + if strand_info == "OS": + self.os_linear_spliced_count += 1 def main(): # debug = True debug = False - parser = argparse.ArgumentParser( - ) + parser = argparse.ArgumentParser() # INPUTs - parser.add_argument("-i","--inbam",dest="inbam",required=True,type=str, - help="Input BAM file") - parser.add_argument('-r',"--rid2jid",dest="rid2jid",required=True,type=str, - help="readID to junctionID lookup") - parser.add_argument('-t','--sample_counts_table', dest='countstable', type=str, required=True, - help='circExplore per-sample counts table') # get coordinates of the circRNA - parser.add_argument("-s",'--sample_name', dest='samplename', type=str, required=False, default = 'sample1', - help='Sample Name: SM for RG') - parser.add_argument('-p',"--pe",dest="pe",required=False,action='store_true', default=False, - help="set this if BAM is paired end") - parser.add_argument("-l",'--library', dest='library', type=str, required=False, default = 'lib1', - help='Sample Name: LB for RG') - parser.add_argument("-f",'--platform', dest='platform', type=str, required=False, default = 'illumina', - help='Sample Name: PL for RG') - parser.add_argument("-u",'--unit', dest='unit', type=str, required=False, default = 'unit1', - help='Sample Name: PU for RG') - parser.add_argument('--regions', dest='regions', type=str, required=True, - help='regions file eg. ref.fa.regions') - parser.add_argument('--host', dest='host', type=str, required=True, - help='host name eg.hg38... single value') - parser.add_argument('--additives', dest='additives', type=str, required=True, - help='additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out') - parser.add_argument('--viruses', dest='viruses', type=str, required=True, - help='virus name(s) eg.NC_009333.1... comma-separated list') + parser.add_argument( + "-i", "--inbam", dest="inbam", required=True, type=str, help="Input BAM file" + ) + parser.add_argument( + "-r", + "--rid2jid", + dest="rid2jid", + required=True, + type=str, + help="readID to junctionID lookup", + ) + parser.add_argument( + "-t", + "--sample_counts_table", + dest="countstable", + type=str, + required=True, + help="circExplore per-sample counts table", + ) # get coordinates of the circRNA + parser.add_argument( + "-s", + "--sample_name", + dest="samplename", + type=str, + required=False, + default="sample1", + help="Sample Name: SM for RG", + ) + parser.add_argument( + "-p", + "--pe", + dest="pe", + required=False, + action="store_true", + default=False, + help="set this if BAM is paired end", + ) + parser.add_argument( + "-l", + "--library", + dest="library", + type=str, + required=False, + default="lib1", + help="Sample Name: LB for RG", + ) + parser.add_argument( + "-f", + "--platform", + dest="platform", + type=str, + required=False, + default="illumina", + help="Sample Name: PL for RG", + ) + parser.add_argument( + "-u", + "--unit", + dest="unit", + type=str, + required=False, + default="unit1", + help="Sample Name: PU for RG", + ) + parser.add_argument( + "--regions", + dest="regions", + type=str, + required=True, + help="regions file eg. ref.fa.regions", + ) + parser.add_argument( + "--host", + dest="host", + type=str, + required=True, + help="host name eg.hg38... single value", + ) + parser.add_argument( + "--additives", + dest="additives", + type=str, + required=True, + help="additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out", + ) + parser.add_argument( + "--viruses", + dest="viruses", + type=str, + required=True, + help="virus name(s) eg.NC_009333.1... comma-separated list", + ) # OUTPUTs - parser.add_argument("-o","--outbam",dest="outbam",required=True,type=str, - help="Output \"primary alignment near BSJ\" only BAM file") - parser.add_argument("--outplusbam",dest="outplusbam",required=True,type=str, - help="Output \"primary alignment near BSJ\" only plus strand BAM file") - parser.add_argument("--outminusbam",dest="outminusbam",required=True,type=str, - help="Output \"primary alignment near BSJ\" only minus strand BAM file") - parser.add_argument("--splicedbam",dest="splicedbam",required=True,type=str, - help="Output \"primary spliced alignment\" only BAM file") - parser.add_argument("--splicedbsjbam",dest="splicedbsjbam",required=True,type=str, - help="Output \"primary spliced alignment near BSJ\" only BAM file") - parser.add_argument("--splicedbsjplusbam",dest="splicedbsjplusbam",required=True,type=str, - help="Output \"primary spliced alignment near BSJ\" only plus strand BAM file") - parser.add_argument("--splicedbsjminusbam",dest="splicedbsjminusbam",required=True,type=str, - help="Output \"primary spliced alignment near BSJ\" only minus strand BAM file") - parser.add_argument("--outputhostbams",dest="outputhostbams",required=False,action='store_true', default=False, - help="Output individual host BAM files") - parser.add_argument("--outputvirusbams",dest="outputvirusbams",required=False,action='store_true', default=False, - help="Output individual virus BAM files") - parser.add_argument("--outdir",dest="outdir",required=False,type=str, - help="Output folder for the individual BAM files (required only if --outputhostbams or --outputvirusbams is used).") - parser.add_argument("-c","--countsfound",dest="countsfound",required=True,type=argparse.FileType('w', encoding='UTF-8'), - help="Output TSV file with counts of junctions found") - + parser.add_argument( + "-o", + "--outbam", + dest="outbam", + required=True, + type=str, + help='Output "primary alignment near BSJ" only BAM file', + ) + parser.add_argument( + "--outplusbam", + dest="outplusbam", + required=True, + type=str, + help='Output "primary alignment near BSJ" only plus strand BAM file', + ) + parser.add_argument( + "--outminusbam", + dest="outminusbam", + required=True, + type=str, + help='Output "primary alignment near BSJ" only minus strand BAM file', + ) + parser.add_argument( + "--splicedbam", + dest="splicedbam", + required=True, + type=str, + help='Output "primary spliced alignment" only BAM file', + ) + parser.add_argument( + "--splicedbsjbam", + dest="splicedbsjbam", + required=True, + type=str, + help='Output "primary spliced alignment near BSJ" only BAM file', + ) + parser.add_argument( + "--splicedbsjplusbam", + dest="splicedbsjplusbam", + required=True, + type=str, + help='Output "primary spliced alignment near BSJ" only plus strand BAM file', + ) + parser.add_argument( + "--splicedbsjminusbam", + dest="splicedbsjminusbam", + required=True, + type=str, + help='Output "primary spliced alignment near BSJ" only minus strand BAM file', + ) + parser.add_argument( + "--outputhostbams", + dest="outputhostbams", + required=False, + action="store_true", + default=False, + help="Output individual host BAM files", + ) + parser.add_argument( + "--outputvirusbams", + dest="outputvirusbams", + required=False, + action="store_true", + default=False, + help="Output individual virus BAM files", + ) + parser.add_argument( + "--outdir", + dest="outdir", + required=False, + type=str, + help="Output folder for the individual BAM files (required only if --outputhostbams or --outputvirusbams is used).", + ) + parser.add_argument( + "-c", + "--countsfound", + dest="countsfound", + required=True, + type=argparse.FileType("w", encoding="UTF-8"), + help="Output TSV file with counts of junctions found", + ) args = parser.parse_args() - print("%s | Reading...rid2jid!..."%(get_ctime())) + print("%s | Reading...rid2jid!..." % (get_ctime())) rid2jid = dict() - with gzip.open(args.rid2jid,'rt') as tfile: + with gzip.open(args.rid2jid, "rt") as tfile: for l in tfile: - l=l.strip().split("\t") - rid2jid[l[0]]=l[1] + l = l.strip().split("\t") + rid2jid[l[0]] = l[1] tfile.close() - print("%s | Done reading...%d rid2jid's!"%(get_ctime(),len(rid2jid))) + print("%s | Done reading...%d rid2jid's!" % (get_ctime(), len(rid2jid))) samfile = pysam.AlignmentFile(args.inbam, "rb") samheader = samfile.header.to_dict() - samheader['RG']=list() - junctionsfile = open(args.countstable,'r') - print("%s | Reading...junctions!..."%(get_ctime())) - count=0 - junction_counts=dict() + samheader["RG"] = list() + junctionsfile = open(args.countstable, "r") + print("%s | Reading...junctions!..." % (get_ctime())) + count = 0 + junction_counts = dict() # splicedbsjjid=dict() for l in junctionsfile.readlines(): - count+=1 - if "read_count" in l: continue + count += 1 + if "read_count" in l: + continue l = l.strip().split("\t") chrom = l[0] start = l[1] - end = str(int(l[2])-1) + end = str(int(l[2]) - 1) strand = l[3] - short_jid = chrom+"##"+start+"##"+end+"##"+strand # create a unique junction ID for each line in the BSJ junction file and make it the dict key ... easy for searching! - jid1 = short_jid+"##SS" # SS=sample strand ... called BSJ and read are on the same strand - jid2 = short_jid+"##OS" # OS=opposite strand ... called BSJ and read are on opposite strands - samheader['RG'].append({'ID':jid1 , 'LB':args.library, 'PL':args.platform, 'PU':args.unit,'SM':args.samplename}) - samheader['RG'].append({'ID':jid2 , 'LB':args.library, 'PL':args.platform, 'PU':args.unit,'SM':args.samplename}) + short_jid = ( + chrom + "##" + start + "##" + end + "##" + strand + ) # create a unique junction ID for each line in the BSJ junction file and make it the dict key ... easy for searching! + jid1 = ( + short_jid + "##SS" + ) # SS=sample strand ... called BSJ and read are on the same strand + jid2 = ( + short_jid + "##OS" + ) # OS=opposite strand ... called BSJ and read are on opposite strands + samheader["RG"].append( + { + "ID": jid1, + "LB": args.library, + "PL": args.platform, + "PU": args.unit, + "SM": args.samplename, + } + ) + samheader["RG"].append( + { + "ID": jid2, + "LB": args.library, + "PL": args.platform, + "PU": args.unit, + "SM": args.samplename, + } + ) # print(short_jid) - junction_counts[short_jid] = JID(chrom,start,end,strand) + junction_counts[short_jid] = JID(chrom, start, end, strand) # splicedbsjjid[jid] = dict() junctionsfile.close() # exit() sequences = list() - for v in samheader['SQ']: - sequences.append(v['SN']) - seqname2regionname=dict() - hosts=set() - viruses=set() - regions = read_regions(regionsfile=args.regions,host=args.host,additives=args.additives,viruses=args.viruses) + for v in samheader["SQ"]: + sequences.append(v["SN"]) + seqname2regionname = dict() + hosts = set() + viruses = set() + regions = read_regions( + regionsfile=args.regions, + host=args.host, + additives=args.additives, + viruses=args.viruses, + ) for s in sequences: - hav = _get_host_additive_virus(regions,s) + hav = _get_host_additive_virus(regions, s) if hav == "host": - hostname = _get_regionname_from_seqname(regions,s) - seqname2regionname[s]=hostname + hostname = _get_regionname_from_seqname(regions, s) + seqname2regionname[s] = hostname hosts.add(hostname) if hav == "virus": - virusname = _get_regionname_from_seqname(regions,s) - seqname2regionname[s]=virusname + virusname = _get_regionname_from_seqname(regions, s) + seqname2regionname[s] = virusname viruses.add(virusname) - print("%s | Done reading %d junctions."%(get_ctime(),count)) - + print("%s | Done reading %d junctions." % (get_ctime(), count)) + outbam = pysam.AlignmentFile(args.outbam, "wb", header=samheader) outplusbam = pysam.AlignmentFile(args.outplusbam, "wb", header=samheader) outminusbam = pysam.AlignmentFile(args.outminusbam, "wb", header=samheader) splicedbam = pysam.AlignmentFile(args.splicedbam, "wb", header=samheader) splicedbsjbam = pysam.AlignmentFile(args.splicedbsjbam, "wb", header=samheader) - splicedbsjplusbam = pysam.AlignmentFile(args.splicedbsjplusbam, "wb", header=samheader) - splicedbsjminusbam = pysam.AlignmentFile(args.splicedbsjminusbam, "wb", header=samheader) + splicedbsjplusbam = pysam.AlignmentFile( + args.splicedbsjplusbam, "wb", header=samheader + ) + splicedbsjminusbam = pysam.AlignmentFile( + args.splicedbsjminusbam, "wb", header=samheader + ) outputbams = dict() if args.outputhostbams: for h in hosts: - outbamname = os.path.join(args.outdir,args.samplename+"."+h+".BSJ.bam") - outputbams[h] = pysam.AlignmentFile(outbamname, "wb", header = samheader) + outbamname = os.path.join( + args.outdir, args.samplename + "." + h + ".BSJ.bam" + ) + outputbams[h] = pysam.AlignmentFile(outbamname, "wb", header=samheader) if args.outputvirusbams: for v in viruses: - outbamname = os.path.join(args.outdir,args.samplename+"."+v+".BSJ.bam") - outputbams[v] = pysam.AlignmentFile(outbamname, "wb", header = samheader) + outbamname = os.path.join( + args.outdir, args.samplename + "." + v + ".BSJ.bam" + ) + outputbams[v] = pysam.AlignmentFile(outbamname, "wb", header=samheader) lenoutputbams = len(outputbams) # pp.pprint(rid2jid) - print("%s | Opened output BAMs for writing..."%(get_ctime())) - spliced=dict() # 1=spliced - splicedbsj=dict() - count1=0 # total reads - count2=0 # total reads near BSJ - count3=0 # total spliced reads - count4=0 # total spliced reads near BSJ + print("%s | Opened output BAMs for writing..." % (get_ctime())) + spliced = dict() # 1=spliced + splicedbsj = dict() + count1 = 0 # total reads + count2 = 0 # total reads near BSJ + count3 = 0 # total spliced reads + count4 = 0 # total spliced reads near BSJ print("Reading alignments...") - mate_already_counted1=dict() - mate_already_counted2=dict() + mate_already_counted1 = dict() + mate_already_counted2 = dict() # mate_already_counted3=dict() # not needed as similar to the "spliced" dict # mate_already_counted4=dict() # not needed as similar to "spliced" dict have value 2 - last_printed=-1 + last_printed = -1 for read in samfile.fetch(): - if args.pe and ( read.reference_id != read.next_reference_id ): continue # only works for PE ... for SE read.next_reference_id is -1 - if args.pe and ( not read.is_proper_pair ): continue - if read.is_secondary or read.is_supplementary or read.is_unmapped : continue - rid=read.query_name -# count read if it has not been counted yet + if args.pe and (read.reference_id != read.next_reference_id): + continue # only works for PE ... for SE read.next_reference_id is -1 + if args.pe and (not read.is_proper_pair): + continue + if read.is_secondary or read.is_supplementary or read.is_unmapped: + continue + rid = read.query_name + # count read if it has not been counted yet if not rid in mate_already_counted1: - mate_already_counted1[rid]=1 - count1+=1 -# find cigar tuple, cigar string and generate a cigar string order -# if 0 is followed by 3 in the "cigarstringorder" value (can happen more than once in multi-spliced reads) -# then the read is spliced - cigar=read.cigarstring - cigart=read.cigartuples - cigart=cigart[list(map(lambda z:z[0],cigart)).index(0):] - cigarstringorder="" + mate_already_counted1[rid] = 1 + count1 += 1 + # find cigar tuple, cigar string and generate a cigar string order + # if 0 is followed by 3 in the "cigarstringorder" value (can happen more than once in multi-spliced reads) + # then the read is spliced + cigar = read.cigarstring + cigart = read.cigartuples + cigart = cigart[list(map(lambda z: z[0], cigart)).index(0) :] + cigarstringorder = "" for j in range(len(cigart)): - cigarstringorder+=str(cigart[j][0]) -# cigarstringorder can be like 034 or 03034 or 03 or 0303 -# check if the rid is already found to be spliced ... if not then check if it is + cigarstringorder += str(cigart[j][0]) + # cigarstringorder can be like 034 or 03034 or 03 or 0303 + # check if the rid is already found to be spliced ... if not then check if it is if not rid in spliced: - if "03" in cigarstringorder: # aka read is spliced - count3+=1 - spliced[rid]=1 -# check if the rid exists in the rid2jid lookup table + if "03" in cigarstringorder: # aka read is spliced + count3 += 1 + spliced[rid] = 1 + # check if the rid exists in the rid2jid lookup table if rid in rid2jid: # does this rid have a corresponding BSJ?? -# if rid is in rid2jid lookuptable and it is not previously counted then count it as "linear" read for that BSJ + # if rid is in rid2jid lookuptable and it is not previously counted then count it as "linear" read for that BSJ if not rid in mate_already_counted2: - mate_already_counted2[rid]=1 - count2+=1 + mate_already_counted2[rid] = 1 + count2 += 1 jid = rid2jid[rid] - chrom, jstart, jend, strand_info, short_jid, converted_jid, read_strand = _get_jinfo(jid) + ( + chrom, + jstart, + jend, + strand_info, + short_jid, + converted_jid, + read_strand, + ) = _get_jinfo(jid) junction_counts[short_jid].increment_linear(strand_info) - read.set_tag("RG", converted_jid , value_type="Z") + read.set_tag("RG", converted_jid, value_type="Z") outbam.write(read) - if read_strand=="+": outplusbam.write(read) - if read_strand=="-": outminusbam.write(read) + if read_strand == "+": + outplusbam.write(read) + if read_strand == "-": + outminusbam.write(read) if lenoutputbams != 0: - regionname=_get_regionname_from_seqname(regions,chrom) + regionname = _get_regionname_from_seqname(regions, chrom) if regionname in hosts and args.outputhostbams: outputbams[regionname].write(read) if regionname in viruses and args.outputvirusbams: outputbams[regionname].write(read) -# check if this rid's .. this alignment is spliced! -# rid could be in spliced but this may be an unspliced mate + # check if this rid's .. this alignment is spliced! + # rid could be in spliced but this may be an unspliced mate if rid in spliced and "03" in cigarstringorder: if not rid in splicedbsj: -# CIGAR has match ... followed by skip ... aka spliced read -# find number of splices -# nsplices is the number of times "03" is found in cigarstringorder -# if nsplices is gt than 1 then we have to get the coordinates of all the matches and -# try to compare each one with the BSJ coordinates + # CIGAR has match ... followed by skip ... aka spliced read + # find number of splices + # nsplices is the number of times "03" is found in cigarstringorder + # if nsplices is gt than 1 then we have to get the coordinates of all the matches and + # try to compare each one with the BSJ coordinates nsplices = cigarstringorder.count("03") if nsplices == 1: - start=int(read.reference_start)+int(cigart[0][1])+1 - end=int(start)+int(cigart[1][1])-1 + start = int(read.reference_start) + int(cigart[0][1]) + 1 + end = int(start) + int(cigart[1][1]) - 1 # print(start,end,jstart,jend) - if abs(int(start)-int(jstart))<3 or abs(int(end)-int(jend))<3: # include 2,1,0,-1,-2 - junction_counts[short_jid].increment_linear_spliced(strand_info) - splicedbsj[rid]=1 # aka read is spliced and is spliced at BSJ - count4+=1 + if ( + abs(int(start) - int(jstart)) < 3 + or abs(int(end) - int(jend)) < 3 + ): # include 2,1,0,-1,-2 + junction_counts[short_jid].increment_linear_spliced( + strand_info + ) + splicedbsj[ + rid + ] = 1 # aka read is spliced and is spliced at BSJ + count4 += 1 # splicedbsjjid[jid][rid]=1 - else: # read has multiple splicing events - for j in range(len(cigart)-1): - if cigart[j][0]==0 and cigart[j+1][0]==3: + else: # read has multiple splicing events + for j in range(len(cigart) - 1): + if cigart[j][0] == 0 and cigart[j + 1][0] == 3: add_coords = 0 - for k in range(j+1): - add_coords+=int(cigart[k][1]) - start=int(read.reference_start)+add_coords+1 - end=int(start)+int(cigart[j+1][1])-1 - if abs(int(start)-int(jstart))<3 or abs(int(end)-int(jend))<3: # include 2,1,0,-1,-2 - junction_counts[short_jid].increment_linear_spliced(strand_info) - splicedbsj[rid]=1 # aka read is spliced and is spliced at BSJ - count4+=1 + for k in range(j + 1): + add_coords += int(cigart[k][1]) + start = int(read.reference_start) + add_coords + 1 + end = int(start) + int(cigart[j + 1][1]) - 1 + if ( + abs(int(start) - int(jstart)) < 3 + or abs(int(end) - int(jend)) < 3 + ): # include 2,1,0,-1,-2 + junction_counts[short_jid].increment_linear_spliced( + strand_info + ) + splicedbsj[ + rid + ] = 1 # aka read is spliced and is spliced at BSJ + count4 += 1 # splicedbsjjid[jid][rid]=1 break - if (count1%100000==0) and (last_printed!=count1): - last_printed=count1 - print("%s | ...Processed %d reads/readpairs (%d were spliced! %d linear around BSJ! %d spliced at BSJ)"%(get_ctime(),count1,len(spliced),count2,len(splicedbsj))) - print("%s | Done processing alignments: %d reads/readpairs (%d were spliced! %d linear around BSJ! %d spliced at BSJ)"%(get_ctime(),count1,len(spliced),count2,len(splicedbsj))) + if (count1 % 100000 == 0) and (last_printed != count1): + last_printed = count1 + print( + "%s | ...Processed %d reads/readpairs (%d were spliced! %d linear around BSJ! %d spliced at BSJ)" + % (get_ctime(), count1, len(spliced), count2, len(splicedbsj)) + ) + print( + "%s | Done processing alignments: %d reads/readpairs (%d were spliced! %d linear around BSJ! %d spliced at BSJ)" + % (get_ctime(), count1, len(spliced), count2, len(splicedbsj)) + ) if lenoutputbams != 0: - for k,v in outputbams.items(): + for k, v in outputbams.items(): v.close() samfile.reset() - print("%s | Writing spliced BAMs ..."%(get_ctime())) + print("%s | Writing spliced BAMs ..." % (get_ctime())) for read in samfile.fetch(): rid = read.query_name - if rid in spliced : splicedbam.write(read) - if rid in splicedbsj : + if rid in spliced: + splicedbam.write(read) + if rid in splicedbsj: jid = rid2jid[rid] # converted_jid = _convertjid(jid) - chrom, jstart, jend, strand_info, short_jid, converted_jid, read_strand = _get_jinfo(jid) - read.set_tag("RG", converted_jid , value_type="Z") + ( + chrom, + jstart, + jend, + strand_info, + short_jid, + converted_jid, + read_strand, + ) = _get_jinfo(jid) + read.set_tag("RG", converted_jid, value_type="Z") splicedbsjbam.write(read) - if read_strand=="+": splicedbsjplusbam.write(read) - if read_strand=="-": splicedbsjminusbam.write(read) + if read_strand == "+": + splicedbsjplusbam.write(read) + if read_strand == "-": + splicedbsjminusbam.write(read) samfile.close() outbam.close() @@ -352,21 +573,35 @@ def main(): splicedbsjbam.close() splicedbsjplusbam.close() splicedbsjminusbam.close() - print("%s | Closing all BAMs"%(get_ctime())) - args.countsfound.write("#chrom\tstart\tend\tstrand\tlinear_BSJ_reads_same_strand\tlinear_spliced_BSJ_reads_same_strand\tlinear_BSJ_reads_opposite_strand\tlinear_spliced_BSJ_reads_opposite_strand\n") + print("%s | Closing all BAMs" % (get_ctime())) + args.countsfound.write( + "#chrom\tstart\tend\tstrand\tlinear_BSJ_reads_same_strand\tlinear_spliced_BSJ_reads_same_strand\tlinear_BSJ_reads_opposite_strand\tlinear_spliced_BSJ_reads_opposite_strand\n" + ) for short_jid in junction_counts.keys(): - chrom=junction_counts[short_jid].chrom - start=junction_counts[short_jid].start - end=int(junction_counts[short_jid].end)+1 - strand=junction_counts[short_jid].strand - ss_linear_count=junction_counts[short_jid].ss_linear_count - ss_linear_spliced_count=junction_counts[short_jid].ss_linear_spliced_count - os_linear_count=junction_counts[short_jid].os_linear_count - os_linear_spliced_count=junction_counts[short_jid].os_linear_spliced_count - args.countsfound.write("%s\t%s\t%s\t%s\t%d\t%d\t%d\t%d\n"%(chrom,str(start),str(end),strand,ss_linear_count,ss_linear_spliced_count,os_linear_count,os_linear_spliced_count)) + chrom = junction_counts[short_jid].chrom + start = junction_counts[short_jid].start + end = int(junction_counts[short_jid].end) + 1 + strand = junction_counts[short_jid].strand + ss_linear_count = junction_counts[short_jid].ss_linear_count + ss_linear_spliced_count = junction_counts[short_jid].ss_linear_spliced_count + os_linear_count = junction_counts[short_jid].os_linear_count + os_linear_spliced_count = junction_counts[short_jid].os_linear_spliced_count + args.countsfound.write( + "%s\t%s\t%s\t%s\t%d\t%d\t%d\t%d\n" + % ( + chrom, + str(start), + str(end), + strand, + ss_linear_count, + ss_linear_spliced_count, + os_linear_count, + os_linear_spliced_count, + ) + ) args.countsfound.close() - print("%s | DONE!!"%(get_ctime())) + print("%s | DONE!!" % (get_ctime())) if __name__ == "__main__": - main() \ No newline at end of file + main() diff --git a/workflow/scripts/_filter_linear_spliced_readids_w_rid2jid.py b/workflow/scripts/_filter_linear_spliced_readids_w_rid2jid.py index 7bbace0..f314ce7 100755 --- a/workflow/scripts/_filter_linear_spliced_readids_w_rid2jid.py +++ b/workflow/scripts/_filter_linear_spliced_readids_w_rid2jid.py @@ -3,97 +3,135 @@ import argparse import gzip + def main(): # debug = True debug = False parser = argparse.ArgumentParser( - description=""" Filter read list to only include those that are part of the rid2jid lookup! - """ + description=""" Filter read list to only include those that are part of the rid2jid lookup! + """ ) # INPUTs - parser.add_argument("--linearin",dest="linearin",required=True,type=str, - help="gzip-ed input linear readid list") - parser.add_argument("--splicedin",dest="splicedin",required=True,type=str, - help="gzip-ed input splicedin readid list") - parser.add_argument('-r',"--rid2jid",dest="rid2jid",required=True,type=str, - help="gzip-ed rid2jid lookup") + parser.add_argument( + "--linearin", + dest="linearin", + required=True, + type=str, + help="gzip-ed input linear readid list", + ) + parser.add_argument( + "--splicedin", + dest="splicedin", + required=True, + type=str, + help="gzip-ed input splicedin readid list", + ) + parser.add_argument( + "-r", + "--rid2jid", + dest="rid2jid", + required=True, + type=str, + help="gzip-ed rid2jid lookup", + ) # OUTPUTs - parser.add_argument("--linearout",dest="linearout",required=True,type=str, - help="gzip-ed output linear readid list") - parser.add_argument("--splicedout",dest="splicedout",required=True,type=str, - help="gzip-ed output linear readid list") - parser.add_argument("--jidcounts",dest="jidcounts",required=True,type=str, - help="gzip-ed output linear readid list") + parser.add_argument( + "--linearout", + dest="linearout", + required=True, + type=str, + help="gzip-ed output linear readid list", + ) + parser.add_argument( + "--splicedout", + dest="splicedout", + required=True, + type=str, + help="gzip-ed output linear readid list", + ) + parser.add_argument( + "--jidcounts", + dest="jidcounts", + required=True, + type=str, + help="gzip-ed output linear readid list", + ) args = parser.parse_args() -# SRR5762377.10004802##- NC_001806.2##88486##88645##+ -# SRR5762377.10008194##+ chrM##1031##1445##+ -# SRR5762377.10010198##+ chr45S##8599##9010##+ + # SRR5762377.10004802##- NC_001806.2##88486##88645##+ + # SRR5762377.10008194##+ chrM##1031##1445##+ + # SRR5762377.10010198##+ chr45S##8599##9010##+ linridlist = dict() sinridlist = dict() - with gzip.open(args.linearin,'rt') as inrl: + with gzip.open(args.linearin, "rt") as inrl: for r in inrl: - r=r.strip() - linridlist[r]=1 + r = r.strip() + linridlist[r] = 1 inrl.close() - with gzip.open(args.splicedin,'rt') as inrl: + with gzip.open(args.splicedin, "rt") as inrl: for r in inrl: - r=r.strip() - sinridlist[r]=1 + r = r.strip() + sinridlist[r] = 1 inrl.close() - scount=dict() - lcount=dict() - with gzip.open(args.rid2jid,'rt') as rid2jid: + scount = dict() + lcount = dict() + with gzip.open(args.rid2jid, "rt") as rid2jid: for l in rid2jid: - l=l.strip().split("\t") - jid=l[1] - if jid==".": - print(">>>>>>>>jid is dot:",l) + l = l.strip().split("\t") + jid = l[1] + if jid == ".": + print(">>>>>>>>jid is dot:", l) # jchr,jstart,jend,jstrand=jid.split("##") # jid2="##".join([jchr,jstart,jend]) - jid2=jid + jid2 = jid if not jid2 in scount: - scount[jid2]=dict() - lcount[jid2]=dict() - scount[jid2]["+"]=0 - scount[jid2]["-"]=0 - scount[jid2]["."]=0 - lcount[jid2]["+"]=0 - lcount[jid2]["-"]=0 - lcount[jid2]["."]=0 + scount[jid2] = dict() + lcount[jid2] = dict() + scount[jid2]["+"] = 0 + scount[jid2]["-"] = 0 + scount[jid2]["."] = 0 + lcount[jid2]["+"] = 0 + lcount[jid2]["-"] = 0 + lcount[jid2]["."] = 0 if "##" in l[0]: - rid,rstrand=l[0].split("##") + rid, rstrand = l[0].split("##") else: - rid=l[0] - rstrand="." + rid = l[0] + rstrand = "." if rid in linridlist: - linridlist[rid]+=1 - lcount[jid][rstrand]+=1 + linridlist[rid] += 1 + lcount[jid][rstrand] += 1 if rid in sinridlist: - sinridlist[rid]+=1 - scount[jid][rstrand]+=1 + sinridlist[rid] += 1 + scount[jid][rstrand] += 1 rid2jid.close() - with gzip.open(args.linearout,'wt') as outrl: - for k,v in linridlist.items(): - if v!=1: - outrl.write("%s\n"%k) + with gzip.open(args.linearout, "wt") as outrl: + for k, v in linridlist.items(): + if v != 1: + outrl.write("%s\n" % k) outrl.close() - with gzip.open(args.splicedout,'wt') as outrl: - for k,v in sinridlist.items(): - if v!=1: - outrl.write("%s\n"%k) + with gzip.open(args.splicedout, "wt") as outrl: + for k, v in sinridlist.items(): + if v != 1: + outrl.write("%s\n" % k) outrl.close() - countout=open(args.jidcounts,'w') + countout = open(args.jidcounts, "w") # countout.write("#chrom\tstart\tend\tlinear_+\tspliced_+\tlinear_-\tspliced_-\tlinear_.\tspliced_.\n") - countout.write("#chrom\tstart\tend\tstrand\tlinear_+\tspliced_+\tlinear_-\tspliced_-\tlinear_.\tspliced_.\n") + countout.write( + "#chrom\tstart\tend\tstrand\tlinear_+\tspliced_+\tlinear_-\tspliced_-\tlinear_.\tspliced_.\n" + ) for k in lcount.keys(): - v1=lcount[k] - v2=scount[k] - kstr=k.split("##") - k="\t".join(kstr) - countout.write("%s\t%d\t%d\t%d\t%d\t%d\t%d\n"%(k,v1["+"],v2["+"],v1["-"],v2["-"],v1["."],v2["."])) + v1 = lcount[k] + v2 = scount[k] + kstr = k.split("##") + k = "\t".join(kstr) + countout.write( + "%s\t%d\t%d\t%d\t%d\t%d\t%d\n" + % (k, v1["+"], v2["+"], v1["-"], v2["-"], v1["."], v2["."]) + ) countout.close() + if __name__ == "__main__": main() diff --git a/workflow/scripts/_make_master_counts_table.py b/workflow/scripts/_make_master_counts_table.py index 1a81dbf..75328a9 100755 --- a/workflow/scripts/_make_master_counts_table.py +++ b/workflow/scripts/_make_master_counts_table.py @@ -1,28 +1,45 @@ import pandas as pd import argparse -def _df_setcol_as_int(df,collist): + +def _df_setcol_as_int(df, collist): for c in collist: - df[[c]]=df[[c]].astype(int) + df[[c]] = df[[c]].astype(int) return df -def _df_setcol_as_str(df,collist): + +def _df_setcol_as_str(df, collist): for c in collist: - df[[c]]=df[[c]].astype(str) + df[[c]] = df[[c]].astype(str) return df -def _df_setcol_as_float(df,collist): + +def _df_setcol_as_float(df, collist): for c in collist: - df[[c]]=df[[c]].astype(float) + df[[c]] = df[[c]].astype(float) return df -def main() : - parser = argparse.ArgumentParser(description='Make Master Counts Table with circExplorer_BWA fixes') - parser.add_argument('--counttablelist', dest='counttablelist', type=str, required=True, - help='comma separted list of per sample counts tables to merge') - parser.add_argument('--minreads', dest='minreads', type=int, required=False, default=3, - help='min read filter') - parser.add_argument('-o',dest='outfile',required=True,help='master counts table') + +def main(): + parser = argparse.ArgumentParser( + description="Make Master Counts Table with circExplorer_BWA fixes" + ) + parser.add_argument( + "--counttablelist", + dest="counttablelist", + type=str, + required=True, + help="comma separted list of per sample counts tables to merge", + ) + parser.add_argument( + "--minreads", + dest="minreads", + type=int, + required=False, + default=3, + help="min read filter", + ) + parser.add_argument("-o", dest="outfile", required=True, help="master counts table") args = parser.parse_args() infiles = args.counttablelist @@ -30,37 +47,37 @@ def main() : count = 0 for f in infiles: count += 1 - if count==1: - outdf = pd.read_csv(f,sep="\t",header=0,compression='gzip') - outdf.set_index(['chrom', 'start', 'end', 'sample_name']) + if count == 1: + outdf = pd.read_csv(f, sep="\t", header=0, compression="gzip") + outdf.set_index(["chrom", "start", "end", "sample_name"]) else: - tmpdf = pd.read_csv(f,sep="\t",header=0,compression='gzip') - tmpdf.set_index(['chrom', 'start', 'end', 'sample_name']) - outdf = pd.concat([outdf , tmpdf],axis=0,join="outer",sort=False) - outdf.reset_index(drop=True,inplace=True) - outdf.fillna(-1,inplace=True) + tmpdf = pd.read_csv(f, sep="\t", header=0, compression="gzip") + tmpdf.set_index(["chrom", "start", "end", "sample_name"]) + outdf = pd.concat([outdf, tmpdf], axis=0, join="outer", sort=False) + outdf.reset_index(drop=True, inplace=True) + outdf.fillna(-1, inplace=True) # print(outdf.columns) - intcols=['start','end','ntools'] + intcols = ["start", "end", "ntools"] for c in outdf.columns: if "count" in c: intcols.append(c) # print(intcols) - strcols=list(set(outdf.columns)-set(intcols)) + strcols = list(set(outdf.columns) - set(intcols)) # print(strcols) - outdf = _df_setcol_as_int(outdf,intcols) - outdf = _df_setcol_as_str(outdf,strcols) - outdf = outdf.sort_values(by=['chrom','start','end', 'sample_name']) - + outdf = _df_setcol_as_int(outdf, intcols) + outdf = _df_setcol_as_str(outdf, strcols) + outdf = outdf.sort_values(by=["chrom", "start", "end", "sample_name"]) - intcols=['start','end','ntools'] + intcols = ["start", "end", "ntools"] for c in outdf.columns: if "count" in c: intcols.append(c) - strcols=list(set(outdf.columns)-set(intcols)) - outdf = _df_setcol_as_int(outdf,intcols) - outdf = _df_setcol_as_str(outdf,strcols) - outdf = outdf.sort_values(by=['chrom','start','end','sample_name']) - outdf.to_csv(args.outfile,sep="\t",header=True,index=False,compression='gzip') + strcols = list(set(outdf.columns) - set(intcols)) + outdf = _df_setcol_as_int(outdf, intcols) + outdf = _df_setcol_as_str(outdf, strcols) + outdf = outdf.sort_values(by=["chrom", "start", "end", "sample_name"]) + outdf.to_csv(args.outfile, sep="\t", header=True, index=False, compression="gzip") + if __name__ == "__main__": - main() \ No newline at end of file + main() diff --git a/workflow/scripts/_merge_circExplorer_found_counts.py b/workflow/scripts/_merge_circExplorer_found_counts.py index 245c260..fbac83e 100755 --- a/workflow/scripts/_merge_circExplorer_found_counts.py +++ b/workflow/scripts/_merge_circExplorer_found_counts.py @@ -2,40 +2,63 @@ import sys import pandas -def _df_setcol_as_int(df,collist): + +def _df_setcol_as_int(df, collist): for c in collist: - df[[c]]=df[[c]].astype(int) + df[[c]] = df[[c]].astype(int) return df -def _df_setcol_as_str(df,collist): + +def _df_setcol_as_str(df, collist): for c in collist: - df[[c]]=df[[c]].astype(str) + df[[c]] = df[[c]].astype(str) return df + def main(): # debug = True debug = False - parser = argparse.ArgumentParser( + parser = argparse.ArgumentParser() + parser.add_argument( + "-b", + "--bsjcounts", + dest="bsjcounts", + required=True, + type=str, + help="BSJ counts file", + ) + parser.add_argument( + "-l", + "--linearcounts", + dest="linearcounts", + required=True, + type=str, + help="Linear counts file", + ) + parser.add_argument( + "-o", + "--mergedcounts", + dest="mergedcounts", + required=True, + type=str, + help="merged counts file", ) - parser.add_argument("-b","--bsjcounts",dest="bsjcounts",required=True,type=str, - help="BSJ counts file") - parser.add_argument("-l","--linearcounts",dest="linearcounts",required=True,type=str, - help="Linear counts file") - parser.add_argument("-o","--mergedcounts",dest="mergedcounts",required=True,type=str, - help="merged counts file") args = parser.parse_args() - bcounts = pandas.read_csv(args.bsjcounts,header=0,sep="\t") - lcounts = pandas.read_csv(args.linearcounts,header=0,sep="\t") + bcounts = pandas.read_csv(args.bsjcounts, header=0, sep="\t") + lcounts = pandas.read_csv(args.linearcounts, header=0, sep="\t") print(bcounts.head()) print(lcounts.head()) - mcounts = bcounts.merge(lcounts,how='outer',on=["#chrom","start","end","strand"]) - strcols = [ '#chrom', 'strand' ] - intcols = list ( set(mcounts.columns) - set(strcols) ) - mcounts.fillna(value=0,inplace=True) - mcounts = _df_setcol_as_str(mcounts,strcols) - mcounts = _df_setcol_as_int(mcounts,intcols) - mcounts.to_csv(args.mergedcounts,index=False,doublequote=False,sep="\t") + mcounts = bcounts.merge( + lcounts, how="outer", on=["#chrom", "start", "end", "strand"] + ) + strcols = ["#chrom", "strand"] + intcols = list(set(mcounts.columns) - set(strcols)) + mcounts.fillna(value=0, inplace=True) + mcounts = _df_setcol_as_str(mcounts, strcols) + mcounts = _df_setcol_as_int(mcounts, intcols) + mcounts.to_csv(args.mergedcounts, index=False, doublequote=False, sep="\t") + if __name__ == "__main__": - main() \ No newline at end of file + main() diff --git a/workflow/scripts/_merge_per_sample_counts_table.py b/workflow/scripts/_merge_per_sample_counts_table.py index 21b5d0f..2c75842 100755 --- a/workflow/scripts/_merge_per_sample_counts_table.py +++ b/workflow/scripts/_merge_per_sample_counts_table.py @@ -4,90 +4,163 @@ import sys import gzip -def _df_setcol_as_int(df,collist): + +def _df_setcol_as_int(df, collist): for c in collist: - df[[c]]=df[[c]].astype(int) + df[[c]] = df[[c]].astype(int) return df -def _df_setcol_as_str(df,collist): + +def _df_setcol_as_str(df, collist): for c in collist: - df[[c]]=df[[c]].astype(str) + df[[c]] = df[[c]].astype(str) return df -def _df_setcol_as_float(df,collist): + +def _df_setcol_as_float(df, collist): for c in collist: - df[[c]]=df[[c]].astype(float) + df[[c]] = df[[c]].astype(float) return df + def _rev_comp(seq): seq = seq.upper() seq = seq.replace("A", "t").replace("C", "g").replace("T", "a").replace("G", "c") seq = seq.upper()[::-1] return seq + class BSJ: - def __init__(self,chrom,start,end,strand="+"): - self.chrom=chrom - self.start=int(start) - self.end=int(end) - self.strand=strand - self.splice_site_flank_5="" #donor - self.splice_site_flank_3="" #acceptor - - def add_flanks(self,sequences): # adds flanking assuming + strand + def __init__(self, chrom, start, end, strand="+"): + self.chrom = chrom + self.start = int(start) + self.end = int(end) + self.strand = strand + self.splice_site_flank_5 = "" # donor + self.splice_site_flank_3 = "" # acceptor + + def add_flanks(self, sequences): # adds flanking assuming + strand coord = int(self.end) - seq = sequences[self.chrom][coord:coord+2] + seq = sequences[self.chrom][coord : coord + 2] self.splice_site_flank_5 = seq.upper() coord = int(self.start) - seq = sequences[self.chrom][coord-2:coord] + seq = sequences[self.chrom][coord - 2 : coord] self.splice_site_flank_3 = seq.upper() - - def get_flanks(self): # returns + and - strand flanks - plus_strand = self.splice_site_flank_5+"##"+self.splice_site_flank_3 - minus_strand = _rev_comp(self.splice_site_flank_5)+"##"+_rev_comp(self.splice_site_flank_3) - return plus_strand,minus_strand - -def main() : - parser = argparse.ArgumentParser(description='Merge per sample Counts from different circRNA detection tools') - parser.add_argument('--circExplorer', dest='circE', type=str, required=True, - help='circExplorer2 per-sample counts table') - parser.add_argument('--circExplorerbwa', dest='circEbwa', type=str, required=True, - help='circExplorer2_bwa per-sample counts table') - parser.add_argument('--ciri', dest='ciri', type=str, required=True, - help='ciri2 per-sample output') - parser.add_argument('--findcirc', dest='findcirc', type=str, required=False, - help='findcirc per-sample counts table') - parser.add_argument('--dcc', dest='dcc', type=str, required=False, - help='dcc per-sample counts table') - parser.add_argument('--mapsplice', dest='mapsplice', type=str, required=False, - help='mapsplice per-sample counts table') - parser.add_argument('--nclscan', dest='nclscan', type=str, required=False, - help='nclscan per-sample counts table') - parser.add_argument('--circrnafinder', dest='circrnafinder', type=str, required=False, - help='circrnafinder per-sample counts table') - parser.add_argument('--samplename', dest='samplename', type=str, required=True, - help='Sample Name') - parser.add_argument('--min_read_count_reqd', dest='minreads', type=int, required=False, default=2, - help='Read count threshold..circRNA with lower than this number of read support are excluded! (default=2)') - parser.add_argument("--reffa",dest="reffa",required=True,type=argparse.FileType('r'),default=sys.stdin, - help="reference fasta file") - parser.add_argument('--hqcc', dest='hqcc', type=str, required=False, default="circExplorer,circExplorer_bwa", - help='Comma separated list of high confidence core callers (default="circExplorer,circExplorer_bwa")') - parser.add_argument('--hqccpn', dest='hqccpn', type=int, required=False, default=1, - help='Define n:high confidence core callers plus n callers are required to call this circRNA HQ (default 1)') - parser.add_argument('-o',dest='outfile',required=True,help='merged table') + + def get_flanks(self): # returns + and - strand flanks + plus_strand = self.splice_site_flank_5 + "##" + self.splice_site_flank_3 + minus_strand = ( + _rev_comp(self.splice_site_flank_5) + + "##" + + _rev_comp(self.splice_site_flank_3) + ) + return plus_strand, minus_strand + + +def main(): + parser = argparse.ArgumentParser( + description="Merge per sample Counts from different circRNA detection tools" + ) + parser.add_argument( + "--circExplorer", + dest="circE", + type=str, + required=True, + help="circExplorer2 per-sample counts table", + ) + parser.add_argument( + "--circExplorerbwa", + dest="circEbwa", + type=str, + required=True, + help="circExplorer2_bwa per-sample counts table", + ) + parser.add_argument( + "--ciri", dest="ciri", type=str, required=True, help="ciri2 per-sample output" + ) + parser.add_argument( + "--findcirc", + dest="findcirc", + type=str, + required=False, + help="findcirc per-sample counts table", + ) + parser.add_argument( + "--dcc", + dest="dcc", + type=str, + required=False, + help="dcc per-sample counts table", + ) + parser.add_argument( + "--mapsplice", + dest="mapsplice", + type=str, + required=False, + help="mapsplice per-sample counts table", + ) + parser.add_argument( + "--nclscan", + dest="nclscan", + type=str, + required=False, + help="nclscan per-sample counts table", + ) + parser.add_argument( + "--circrnafinder", + dest="circrnafinder", + type=str, + required=False, + help="circrnafinder per-sample counts table", + ) + parser.add_argument( + "--samplename", dest="samplename", type=str, required=True, help="Sample Name" + ) + parser.add_argument( + "--min_read_count_reqd", + dest="minreads", + type=int, + required=False, + default=2, + help="Read count threshold..circRNA with lower than this number of read support are excluded! (default=2)", + ) + parser.add_argument( + "--reffa", + dest="reffa", + required=True, + type=argparse.FileType("r"), + default=sys.stdin, + help="reference fasta file", + ) + parser.add_argument( + "--hqcc", + dest="hqcc", + type=str, + required=False, + default="circExplorer,circExplorer_bwa", + help='Comma separated list of high confidence core callers (default="circExplorer,circExplorer_bwa")', + ) + parser.add_argument( + "--hqccpn", + dest="hqccpn", + type=int, + required=False, + default=1, + help="Define n:high confidence core callers plus n callers are required to call this circRNA HQ (default 1)", + ) + parser.add_argument("-o", dest="outfile", required=True, help="merged table") args = parser.parse_args() - sn=args.samplename - hqcc=args.hqcc - hqcc=hqcc.strip().lower().split(",") - hqcclen=len(hqcc) - required_hqcols=[] - not_required_hqcols=[] - dfs=[] + sn = args.samplename + hqcc = args.hqcc + hqcc = hqcc.strip().lower().split(",") + hqcclen = len(hqcc) + required_hqcols = [] + not_required_hqcols = [] + dfs = [] # load circExplorer - circE=pandas.read_csv(args.circE,sep="\t",header=0) + circE = pandas.read_csv(args.circE, sep="\t", header=0) print(circE.columns) # columns are: # | # | Column | @@ -105,67 +178,106 @@ def main() : # | 11 | spliced_- | # | 12 | linear_. | # | 13 | spliced_. | - circE['circRNA_id']=circE['#chrom'].astype(str)+"##"+circE['start'].astype(str)+"##"+circE['end'].astype(str) - circE.rename({'strand' : 'circExplorer_strand', - 'known_novel' : 'circExplorer_annotation', - 'expected_BSJ_reads' : 'circExplorer_read_count', - 'found_BSJ_reads' : 'circExplorer_found_BSJcounts', - 'linear_+' : 'circExplorer_found_linear_BSJ_+_counts', - 'spliced_+' : 'circExplorer_found_linear_spliced_BSJ_+_counts', - 'linear_-' : 'circExplorer_found_linear_BSJ_-_counts', - 'spliced_-' : 'circExplorer_found_linear_spliced_BSJ_-_counts', - 'linear_.' : 'circExplorer_found_linear_BSJ_._counts', - 'spliced_.' : 'circExplorer_found_linear_spliced_BSJ_._counts'}, axis=1, inplace=True) - circE.drop(['#chrom','start', 'end'], axis = 1,inplace=True) - circE.set_index(['circRNA_id'],inplace=True,drop=True) - - circE.fillna(value=-1,inplace=True) + circE["circRNA_id"] = ( + circE["#chrom"].astype(str) + + "##" + + circE["start"].astype(str) + + "##" + + circE["end"].astype(str) + ) + circE.rename( + { + "strand": "circExplorer_strand", + "known_novel": "circExplorer_annotation", + "expected_BSJ_reads": "circExplorer_read_count", + "found_BSJ_reads": "circExplorer_found_BSJcounts", + "linear_+": "circExplorer_found_linear_BSJ_+_counts", + "spliced_+": "circExplorer_found_linear_spliced_BSJ_+_counts", + "linear_-": "circExplorer_found_linear_BSJ_-_counts", + "spliced_-": "circExplorer_found_linear_spliced_BSJ_-_counts", + "linear_.": "circExplorer_found_linear_BSJ_._counts", + "spliced_.": "circExplorer_found_linear_spliced_BSJ_._counts", + }, + axis=1, + inplace=True, + ) + circE.drop(["#chrom", "start", "end"], axis=1, inplace=True) + circE.set_index(["circRNA_id"], inplace=True, drop=True) + + circE.fillna(value=-1, inplace=True) print(circE.columns) - intcols = [ 'circExplorer_read_count', - 'circExplorer_found_BSJcounts', - 'circExplorer_found_linear_BSJ_+_counts', - 'circExplorer_found_linear_spliced_BSJ_+_counts', - 'circExplorer_found_linear_BSJ_-_counts', - 'circExplorer_found_linear_spliced_BSJ_-_counts', - 'circExplorer_found_linear_BSJ_._counts', - 'circExplorer_found_linear_spliced_BSJ_._counts' ] - strcols = list ( set(circE.columns) - set(intcols) ) - circE = _df_setcol_as_int(circE,intcols) - circE = _df_setcol_as_str(circE,strcols) + intcols = [ + "circExplorer_read_count", + "circExplorer_found_BSJcounts", + "circExplorer_found_linear_BSJ_+_counts", + "circExplorer_found_linear_spliced_BSJ_+_counts", + "circExplorer_found_linear_BSJ_-_counts", + "circExplorer_found_linear_spliced_BSJ_-_counts", + "circExplorer_found_linear_BSJ_._counts", + "circExplorer_found_linear_spliced_BSJ_._counts", + ] + strcols = list(set(circE.columns) - set(intcols)) + circE = _df_setcol_as_int(circE, intcols) + circE = _df_setcol_as_str(circE, strcols) dfs.append(circE) - if "circExplorer".lower() in hqcc: + if "circExplorer".lower() in hqcc: required_hqcols.append("circExplorer_read_count") else: not_required_hqcols.append("circExplorer_read_count") - #chrom start end strand read_count known_novel + # chrom start end strand read_count known_novel # circExplorer2 with BWA - circEbwa=pandas.read_csv(args.circEbwa,sep="\t",header=0) - circEbwa['circRNA_id']=circEbwa['#chrom'].astype(str)+"##"+circEbwa['start'].astype(str)+"##"+circEbwa['end'].astype(str) - circEbwa.rename({'strand' : 'circExplorer_bwa_strand', - 'known_novel' : 'circExplorer_bwa_annotation', - 'read_count' : 'circExplorer_bwa_read_count'}, axis=1, inplace=True) - circEbwa.drop(['#chrom','start', 'end'], axis = 1,inplace=True) - circEbwa.set_index(['circRNA_id'],inplace=True,drop=True) - - circEbwa.fillna(value=-1,inplace=True) - - intcols = [ 'circExplorer_bwa_read_count' ] - strcols = list ( set(circEbwa.columns) - set(intcols) ) - circEbwa = _df_setcol_as_int(circEbwa,intcols) - circEbwa = _df_setcol_as_str(circEbwa,strcols) + circEbwa = pandas.read_csv(args.circEbwa, sep="\t", header=0) + circEbwa["circRNA_id"] = ( + circEbwa["#chrom"].astype(str) + + "##" + + circEbwa["start"].astype(str) + + "##" + + circEbwa["end"].astype(str) + ) + circEbwa.rename( + { + "strand": "circExplorer_bwa_strand", + "known_novel": "circExplorer_bwa_annotation", + "read_count": "circExplorer_bwa_read_count", + }, + axis=1, + inplace=True, + ) + circEbwa.drop(["#chrom", "start", "end"], axis=1, inplace=True) + circEbwa.set_index(["circRNA_id"], inplace=True, drop=True) + + circEbwa.fillna(value=-1, inplace=True) + + intcols = ["circExplorer_bwa_read_count"] + strcols = list(set(circEbwa.columns) - set(intcols)) + circEbwa = _df_setcol_as_int(circEbwa, intcols) + circEbwa = _df_setcol_as_str(circEbwa, strcols) dfs.append(circEbwa) - if "circExplorer_bwa".lower() in hqcc: + if "circExplorer_bwa".lower() in hqcc: required_hqcols.append("circExplorer_bwa_read_count") else: not_required_hqcols.append("circExplorer_bwa_read_count") # load ciri - ciri=pandas.read_csv(args.ciri,sep="\t",header=0,usecols=['chr', 'circRNA_start', 'circRNA_end', '#junction_reads', '#non_junction_reads', 'circRNA_type', 'strand']) + ciri = pandas.read_csv( + args.ciri, + sep="\t", + header=0, + usecols=[ + "chr", + "circRNA_start", + "circRNA_end", + "#junction_reads", + "#non_junction_reads", + "circRNA_type", + "strand", + ], + ) # columns are: # circRNA_ID chr circRNA_start circRNA_end #junction_reads SM_MS_SMS #non_junction_reads junction_reads_ratio circRNA_type gene_id strand junction_reads_ID # | # | colName | Description | @@ -181,75 +293,97 @@ def main() : # | 9 | circRNA_type | type of a circRNA according to positions of its two ends on chromosome (exon, intron or intergenic_region; only available when annotation file is provided) | # | 10 | gene_id | ID of the gene(s) where an exonic or intronic circRNA locates | # | 11 | strand | strand info of a predicted circRNAs (new in CIRI2) | - # | 12 | junction_reads_ID | all of the circular junction read IDs (split by ",") - ciri["circRNA_start"]=ciri["circRNA_start"].astype(int)-1 - ciri['circRNA_id']=ciri['chr'].astype(str)+"##"+ciri['circRNA_start'].astype(str)+"##"+ciri['circRNA_end'].astype(str) - ciri.rename({ 'strand' : 'ciri_strand', - '#junction_reads' : 'ciri_read_count', - '#non_junction_reads' : 'ciri_linear_read_count', - 'circRNA_type' : 'ciri_annotation'}, axis=1, inplace=True) - ciri.drop(['chr','circRNA_start', 'circRNA_end'], axis = 1,inplace=True) - ciri.set_index(['circRNA_id'],inplace=True,drop=True) - - ciri.fillna(value=-1,inplace=True) - - intcols = [ 'ciri_read_count', - 'ciri_linear_read_count' ] - strcols = list ( set(ciri.columns) - set(intcols) ) - ciri = _df_setcol_as_int(ciri,intcols) - if len(strcols) > 0: ciri = _df_setcol_as_str(ciri,strcols) - + # | 12 | junction_reads_ID | all of the circular junction read IDs (split by ",") + ciri["circRNA_start"] = ciri["circRNA_start"].astype(int) - 1 + ciri["circRNA_id"] = ( + ciri["chr"].astype(str) + + "##" + + ciri["circRNA_start"].astype(str) + + "##" + + ciri["circRNA_end"].astype(str) + ) + ciri.rename( + { + "strand": "ciri_strand", + "#junction_reads": "ciri_read_count", + "#non_junction_reads": "ciri_linear_read_count", + "circRNA_type": "ciri_annotation", + }, + axis=1, + inplace=True, + ) + ciri.drop(["chr", "circRNA_start", "circRNA_end"], axis=1, inplace=True) + ciri.set_index(["circRNA_id"], inplace=True, drop=True) + + ciri.fillna(value=-1, inplace=True) + + intcols = ["ciri_read_count", "ciri_linear_read_count"] + strcols = list(set(ciri.columns) - set(intcols)) + ciri = _df_setcol_as_int(ciri, intcols) + if len(strcols) > 0: + ciri = _df_setcol_as_str(ciri, strcols) + dfs.append(ciri) - if "ciri".lower() in hqcc: + if "ciri".lower() in hqcc: required_hqcols.append("ciri_read_count") else: not_required_hqcols.append("ciri_read_count") if args.findcirc: - findcirc=pandas.read_csv(args.findcirc,sep="\t",header=0) + findcirc = pandas.read_csv(args.findcirc, sep="\t", header=0) print(findcirc.columns) -# add find_circ -# | # | short_name | description -# | -- | --------------- | ---------------------------------------------------------------------------------------------------------------- | -# | 1 | chrom | chromosome/contig name | -# | 2 | start | left splice site (zero-based) | -# | 3 | end | right splice site (zero-based). (Always: end > start. 5' 3' depends on strand) | -# | 4 | name | (provisional) running number/name assigned to junction | -# | 5 | n_reads | number of reads supporting the junction (BED 'score') | -# | 6 | strand | genomic strand (+ or -) | -# | 7 | n_uniq | number of distinct read sequences supporting the junction | -# | 8 | uniq_bridges | number of reads with both anchors aligning uniquely | -# | 9 | best_qual_left | alignment score margin of the best anchor alignment supporting the left splice junction (max=2 \* anchor_length) | -# | 10 | best_qual_right | same for the right splice site | -# | 11 | tissues | comma-separated, alphabetically sorted list of tissues/samples with this junction | -# | 12 | tiss_counts | comma-separated list of corresponding read-counts | -# | 13 | edits | number of mismatches in the anchor extension process | -# | 14 | anchor_overlap | number of nucleotides the breakpoint resides within one anchor | -# | 15 | breakpoints | number of alternative ways to break the read with flanking GT/AG | -# | 16 | signal | flanking dinucleotide splice signal (normally GT/AG) | -# | 17 | strandmatch | 'MATCH', 'MISMATCH' or 'NA' for non-stranded analysis | -# | 18 | category | list of keywords describing the junction. Useful for quick grep filtering | - findcirc['circRNA_id']=findcirc['chrom'].astype(str)+"##"+findcirc['start'].astype(str)+"##"+findcirc['end'].astype(str) - findcirc = findcirc.loc[:, ['circRNA_id', 'n_reads', 'strand']] - findcirc.rename({ 'strand' : 'findcirc_strand', - 'n_reads' : 'findcirc_read_count'}, axis=1, inplace=True) - findcirc.set_index(['circRNA_id'],inplace=True,drop=True) - - findcirc.fillna(value=-1,inplace=True) - - intcols = [ 'findcirc_read_count' ] - strcols = list ( set(findcirc.columns) - set(intcols) ) - findcirc = _df_setcol_as_int(findcirc,intcols) - if len(strcols) > 0: findcirc = _df_setcol_as_str(findcirc,strcols) + # add find_circ + # | # | short_name | description + # | -- | --------------- | ---------------------------------------------------------------------------------------------------------------- | + # | 1 | chrom | chromosome/contig name | + # | 2 | start | left splice site (zero-based) | + # | 3 | end | right splice site (zero-based). (Always: end > start. 5' 3' depends on strand) | + # | 4 | name | (provisional) running number/name assigned to junction | + # | 5 | n_reads | number of reads supporting the junction (BED 'score') | + # | 6 | strand | genomic strand (+ or -) | + # | 7 | n_uniq | number of distinct read sequences supporting the junction | + # | 8 | uniq_bridges | number of reads with both anchors aligning uniquely | + # | 9 | best_qual_left | alignment score margin of the best anchor alignment supporting the left splice junction (max=2 \* anchor_length) | + # | 10 | best_qual_right | same for the right splice site | + # | 11 | tissues | comma-separated, alphabetically sorted list of tissues/samples with this junction | + # | 12 | tiss_counts | comma-separated list of corresponding read-counts | + # | 13 | edits | number of mismatches in the anchor extension process | + # | 14 | anchor_overlap | number of nucleotides the breakpoint resides within one anchor | + # | 15 | breakpoints | number of alternative ways to break the read with flanking GT/AG | + # | 16 | signal | flanking dinucleotide splice signal (normally GT/AG) | + # | 17 | strandmatch | 'MATCH', 'MISMATCH' or 'NA' for non-stranded analysis | + # | 18 | category | list of keywords describing the junction. Useful for quick grep filtering | + findcirc["circRNA_id"] = ( + findcirc["chrom"].astype(str) + + "##" + + findcirc["start"].astype(str) + + "##" + + findcirc["end"].astype(str) + ) + findcirc = findcirc.loc[:, ["circRNA_id", "n_reads", "strand"]] + findcirc.rename( + {"strand": "findcirc_strand", "n_reads": "findcirc_read_count"}, + axis=1, + inplace=True, + ) + findcirc.set_index(["circRNA_id"], inplace=True, drop=True) + + findcirc.fillna(value=-1, inplace=True) + + intcols = ["findcirc_read_count"] + strcols = list(set(findcirc.columns) - set(intcols)) + findcirc = _df_setcol_as_int(findcirc, intcols) + if len(strcols) > 0: + findcirc = _df_setcol_as_str(findcirc, strcols) dfs.append(findcirc) - if "findcirc".lower() in hqcc: + if "findcirc".lower() in hqcc: required_hqcols.append("findcirc_read_count") else: not_required_hqcols.append("findcirc_read_count") # load dcc if args.dcc: - dcc=pandas.read_csv(args.dcc,sep="\t",header=0) + dcc = pandas.read_csv(args.dcc, sep="\t", header=0) # output dcc.counts_table.tsv has the following columns: # | # | ColName | # |---|----------------| @@ -260,33 +394,41 @@ def main() : # | 5 | read_count | # | 6 | linear_read_count| # | 7 | dcc_annotation | --> this is gene##JunctionType##Start-End Region from CircCoordinates file - dcc["start"]=dcc["start"].astype(int)-1 - dcc['circRNA_id']=dcc['chr'].astype(str)+"##"+dcc['start'].astype(str)+"##"+dcc['end'].astype(str) - dcc.rename({'strand': 'dcc_strand'}, axis=1, inplace=True) - dcc.rename({'read_count': 'dcc_read_count'}, axis=1, inplace=True) - dcc.rename({'linear_read_count': 'dcc_linear_read_count'}, axis=1, inplace=True) - dcc[['dcc_gene', 'dcc_junction_type', 'dcc_annotation2']] = dcc['dcc_annotation'].apply(lambda x: pandas.Series(str(x).split("##"))) - dcc.drop(['chr','start', 'end','dcc_annotation'], axis = 1,inplace=True) - dcc.rename({'dcc_annotation2': 'dcc_annotation'}, axis=1, inplace=True) - dcc.set_index(['circRNA_id'],inplace=True,drop=True) - - dcc.fillna(value=-1,inplace=True) - - intcols = [ 'dcc_read_count', - 'dcc_linear_read_count' ] - strcols = list ( set(dcc.columns) - set(intcols) ) - dcc = _df_setcol_as_int(dcc,intcols) - if len(strcols) > 0: dcc = _df_setcol_as_str(dcc,strcols) + dcc["start"] = dcc["start"].astype(int) - 1 + dcc["circRNA_id"] = ( + dcc["chr"].astype(str) + + "##" + + dcc["start"].astype(str) + + "##" + + dcc["end"].astype(str) + ) + dcc.rename({"strand": "dcc_strand"}, axis=1, inplace=True) + dcc.rename({"read_count": "dcc_read_count"}, axis=1, inplace=True) + dcc.rename({"linear_read_count": "dcc_linear_read_count"}, axis=1, inplace=True) + dcc[["dcc_gene", "dcc_junction_type", "dcc_annotation2"]] = dcc[ + "dcc_annotation" + ].apply(lambda x: pandas.Series(str(x).split("##"))) + dcc.drop(["chr", "start", "end", "dcc_annotation"], axis=1, inplace=True) + dcc.rename({"dcc_annotation2": "dcc_annotation"}, axis=1, inplace=True) + dcc.set_index(["circRNA_id"], inplace=True, drop=True) + + dcc.fillna(value=-1, inplace=True) + + intcols = ["dcc_read_count", "dcc_linear_read_count"] + strcols = list(set(dcc.columns) - set(intcols)) + dcc = _df_setcol_as_int(dcc, intcols) + if len(strcols) > 0: + dcc = _df_setcol_as_str(dcc, strcols) dfs.append(dcc) - if "DCC".lower() in hqcc: + if "DCC".lower() in hqcc: required_hqcols.append("dcc_read_count") else: not_required_hqcols.append("dcc_read_count") # load mapsplice if args.mapsplice: - mapsplice=pandas.read_csv(args.mapsplice,sep="\t",header=0) + mapsplice = pandas.read_csv(args.mapsplice, sep="\t", header=0) # output .mapslice.counts_table.tsv has the following columns: # | # | ColName | Eg. | # |---|----------------------|------------------| @@ -295,35 +437,48 @@ def main() : # | 3 | end | 1223968 | # | 4 | strand | - | # | 5 | read_count | 26 | - # | 6 | mapsplice_annotation | normal##2.811419 | <--"fusion_type"##"entropy" + # | 6 | mapsplice_annotation | normal##2.811419 | <--"fusion_type"##"entropy" # "fusion_type" is either "normal" or "overlapping" ... higher "entropy" values are better! - mapsplice["start"]=mapsplice["start"].astype(int)-1 - mapsplice['circRNA_id']=mapsplice['chrom'].astype(str)+"##"+mapsplice['start'].astype(str)+"##"+mapsplice['end'].astype(str) - mapsplice.rename({'strand': 'mapsplice_strand'}, axis=1, inplace=True) - mapsplice.rename({'read_count': 'mapsplice_read_count'}, axis=1, inplace=True) - mapsplice[['mapsplice_annotation2', 'mapsplice_entropy']] = mapsplice['mapsplice_annotation'].apply(lambda x: pandas.Series(str(x).split("##"))) - mapsplice.drop(['chrom','start', 'end','mapsplice_annotation'], axis = 1,inplace=True) - mapsplice.rename({'mapsplice_annotation2': 'mapsplice_annotation'}, axis=1, inplace=True) - mapsplice.set_index(['circRNA_id'],inplace=True,drop=True) - - mapsplice.fillna(value=-1,inplace=True) - - intcols = [ 'mapsplice_read_count' ] - mapsplice = _df_setcol_as_int(mapsplice,intcols) - floatcols = [ 'mapsplice_entropy' ] - mapsplice = _df_setcol_as_float(mapsplice,floatcols) - strcols = list ( ( set(mapsplice.columns) - set(intcols) ) - set(floatcols) ) - if len(strcols) > 0: mapsplice = _df_setcol_as_str(mapsplice,strcols) + mapsplice["start"] = mapsplice["start"].astype(int) - 1 + mapsplice["circRNA_id"] = ( + mapsplice["chrom"].astype(str) + + "##" + + mapsplice["start"].astype(str) + + "##" + + mapsplice["end"].astype(str) + ) + mapsplice.rename({"strand": "mapsplice_strand"}, axis=1, inplace=True) + mapsplice.rename({"read_count": "mapsplice_read_count"}, axis=1, inplace=True) + mapsplice[["mapsplice_annotation2", "mapsplice_entropy"]] = mapsplice[ + "mapsplice_annotation" + ].apply(lambda x: pandas.Series(str(x).split("##"))) + mapsplice.drop( + ["chrom", "start", "end", "mapsplice_annotation"], axis=1, inplace=True + ) + mapsplice.rename( + {"mapsplice_annotation2": "mapsplice_annotation"}, axis=1, inplace=True + ) + mapsplice.set_index(["circRNA_id"], inplace=True, drop=True) + + mapsplice.fillna(value=-1, inplace=True) + + intcols = ["mapsplice_read_count"] + mapsplice = _df_setcol_as_int(mapsplice, intcols) + floatcols = ["mapsplice_entropy"] + mapsplice = _df_setcol_as_float(mapsplice, floatcols) + strcols = list((set(mapsplice.columns) - set(intcols)) - set(floatcols)) + if len(strcols) > 0: + mapsplice = _df_setcol_as_str(mapsplice, strcols) dfs.append(mapsplice) - if "MapSplice".lower() in hqcc: + if "MapSplice".lower() in hqcc: required_hqcols.append("mapsplice_read_count") else: not_required_hqcols.append("mapsplice_read_count") # load nclscan if args.nclscan: - nclscan=pandas.read_csv(args.nclscan,sep="\t",header=0) + nclscan = pandas.read_csv(args.nclscan, sep="\t", header=0) # output nslscan table has the following columns: # | # | ColName | Eg. | # |---|----------------------|------------------| @@ -333,35 +488,47 @@ def main() : # | 4 | strand | - | # | 5 | read_count | 26 | # | 6 | nclscan_annotation | 1 | <--1 for intragenic 0 for intergenic - includenclscan=True - if nclscan.shape[0]==0: includenclscan=False + includenclscan = True + if nclscan.shape[0] == 0: + includenclscan = False if includenclscan: - nclscan["start"]=nclscan["start"].astype(int)-1 - nclscan['circRNA_id']=nclscan['chrom'].astype(str)+"##"+nclscan['start'].astype(str)+"##"+nclscan['end'].astype(str) - nclscan.rename({'strand': 'nclscan_strand'}, axis=1, inplace=True) - nclscan.rename({'read_count': 'nclscan_read_count'}, axis=1, inplace=True) - nclscan.drop(['chrom','start', 'end'], axis = 1,inplace=True) - nclscan = _df_setcol_as_str(nclscan,['nclscan_annotation']) - nclscan.loc[nclscan['nclscan_annotation']=="1", 'nclscan_annotation'] = "Intragenic" - nclscan.loc[nclscan['nclscan_annotation']=="0", 'nclscan_annotation'] = "Intergenic" + nclscan["start"] = nclscan["start"].astype(int) - 1 + nclscan["circRNA_id"] = ( + nclscan["chrom"].astype(str) + + "##" + + nclscan["start"].astype(str) + + "##" + + nclscan["end"].astype(str) + ) + nclscan.rename({"strand": "nclscan_strand"}, axis=1, inplace=True) + nclscan.rename({"read_count": "nclscan_read_count"}, axis=1, inplace=True) + nclscan.drop(["chrom", "start", "end"], axis=1, inplace=True) + nclscan = _df_setcol_as_str(nclscan, ["nclscan_annotation"]) + nclscan.loc[ + nclscan["nclscan_annotation"] == "1", "nclscan_annotation" + ] = "Intragenic" + nclscan.loc[ + nclscan["nclscan_annotation"] == "0", "nclscan_annotation" + ] = "Intergenic" # nclscan.loc[nclscan['nclscan_annotation']!="0" and nclscan['nclscan_annotation']!="1" , 'nclscan_annotation'] = "Unknown" - nclscan.set_index(['circRNA_id'],inplace=True,drop=True) + nclscan.set_index(["circRNA_id"], inplace=True, drop=True) - nclscan.fillna(value=-1,inplace=True) + nclscan.fillna(value=-1, inplace=True) - intcols = [ 'nclscan_read_count' ] - strcols = list ( set(nclscan.columns) - set(intcols) ) - nclscan = _df_setcol_as_int(nclscan,intcols) - if len(strcols) > 0: nclscan = _df_setcol_as_str(nclscan,strcols) + intcols = ["nclscan_read_count"] + strcols = list(set(nclscan.columns) - set(intcols)) + nclscan = _df_setcol_as_int(nclscan, intcols) + if len(strcols) > 0: + nclscan = _df_setcol_as_str(nclscan, strcols) dfs.append(nclscan) - if "NCLscan".lower() in hqcc: + if "NCLscan".lower() in hqcc: required_hqcols.append("nclscan_read_count") else: not_required_hqcols.append("nclscan_read_count") if args.circrnafinder: - circrnafinder=pandas.read_csv(args.circrnafinder,sep="\t",header=0) + circrnafinder = pandas.read_csv(args.circrnafinder, sep="\t", header=0) # output circrnafinder table has the following columns: # | # | ColName | Eg. | # |---|----------------------|------------------| @@ -370,21 +537,30 @@ def main() : # | 3 | end | 1223968 | # | 4 | strand | - | # | 5 | read_count | 26 | - circrnafinder['circRNA_id']=circrnafinder['chr'].astype(str)+"##"+circrnafinder['start'].astype(str)+"##"+circrnafinder['end'].astype(str) - circrnafinder.rename({'strand': 'circrnafinder_strand'}, axis=1, inplace=True) - circrnafinder.rename({'read_count': 'circrnafinder_read_count'}, axis=1, inplace=True) - circrnafinder.drop(['chr','start', 'end'], axis = 1,inplace=True) - circrnafinder.set_index(['circRNA_id'],inplace=True,drop=True) - - circrnafinder.fillna(value=-1,inplace=True) - - intcols = [ 'circrnafinder_read_count' ] - strcols = list ( set(circrnafinder.columns) - set(intcols) ) - circrnafinder = _df_setcol_as_int(circrnafinder,intcols) - if len(strcols) > 0: circrnafinder = _df_setcol_as_str(circrnafinder,strcols) + circrnafinder["circRNA_id"] = ( + circrnafinder["chr"].astype(str) + + "##" + + circrnafinder["start"].astype(str) + + "##" + + circrnafinder["end"].astype(str) + ) + circrnafinder.rename({"strand": "circrnafinder_strand"}, axis=1, inplace=True) + circrnafinder.rename( + {"read_count": "circrnafinder_read_count"}, axis=1, inplace=True + ) + circrnafinder.drop(["chr", "start", "end"], axis=1, inplace=True) + circrnafinder.set_index(["circRNA_id"], inplace=True, drop=True) + + circrnafinder.fillna(value=-1, inplace=True) + + intcols = ["circrnafinder_read_count"] + strcols = list(set(circrnafinder.columns) - set(intcols)) + circrnafinder = _df_setcol_as_int(circrnafinder, intcols) + if len(strcols) > 0: + circrnafinder = _df_setcol_as_str(circrnafinder, strcols) dfs.append(circrnafinder) - if "circRNAFinder".lower() in hqcc: + if "circRNAFinder".lower() in hqcc: required_hqcols.append("circrnafinder_read_count") else: not_required_hqcols.append("circrnafinder_read_count") @@ -392,178 +568,214 @@ def main() : # for df in dfs: # print(df.columns) - # merged_counts=pandas.concat(dfs,axis=1,join="outer",sort=False) # merged_counts['circRNA_id']=merged_counts.index -# above pandas.concat not working as expected -# giving error -# File "/vf/users/Ziegelbauer_lab/Pipelines/circRNA/230406_activeDev_20284a3/workflow/scripts/_merge_per_sample_counts_table.py", line 396, in -# main() -# File "/vf/users/Ziegelbauer_lab/Pipelines/circRNA/230406_activeDev_20284a3/workflow/scripts/_merge_per_sample_counts_table.py", line 289, in main -# merged_counts=pandas.concat(dfs,axis=1,join="outer",sort=False) -# File "/usr/local/Anaconda/envs/py3.7/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper -# return func(*args, **kwargs) -# File "/usr/local/Anaconda/envs/py3.7/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 307, in concat -# return op.get_result() -# File "/usr/local/Anaconda/envs/py3.7/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 528, in get_result -# indexers[ax] = obj_labels.get_indexer(new_labels) -# File "/usr/local/Anaconda/envs/py3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3442, in get_indexer -# raise InvalidIndexError(self._requires_unique_msg) -# pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects -# HENCE, replacing concat with this: - - for i,df in enumerate(dfs): - if i==0: - merged_counts=df - merged_counts['circRNA_id']=merged_counts.index - merged_counts.reset_index(inplace=True,drop=True) + # above pandas.concat not working as expected + # giving error + # File "/vf/users/Ziegelbauer_lab/Pipelines/circRNA/230406_activeDev_20284a3/workflow/scripts/_merge_per_sample_counts_table.py", line 396, in + # main() + # File "/vf/users/Ziegelbauer_lab/Pipelines/circRNA/230406_activeDev_20284a3/workflow/scripts/_merge_per_sample_counts_table.py", line 289, in main + # merged_counts=pandas.concat(dfs,axis=1,join="outer",sort=False) + # File "/usr/local/Anaconda/envs/py3.7/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper + # return func(*args, **kwargs) + # File "/usr/local/Anaconda/envs/py3.7/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 307, in concat + # return op.get_result() + # File "/usr/local/Anaconda/envs/py3.7/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 528, in get_result + # indexers[ax] = obj_labels.get_indexer(new_labels) + # File "/usr/local/Anaconda/envs/py3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3442, in get_indexer + # raise InvalidIndexError(self._requires_unique_msg) + # pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects + # HENCE, replacing concat with this: + + for i, df in enumerate(dfs): + if i == 0: + merged_counts = df + merged_counts["circRNA_id"] = merged_counts.index + merged_counts.reset_index(inplace=True, drop=True) else: - df['circRNA_id']=df.index - df.reset_index(inplace=True,drop=True) - merged_counts=pandas.merge(merged_counts,df,how='outer',on=['circRNA_id']) - + df["circRNA_id"] = df.index + df.reset_index(inplace=True, drop=True) + merged_counts = pandas.merge( + merged_counts, df, how="outer", on=["circRNA_id"] + ) + print(merged_counts.columns) - + # merged_counts.set_index(['circRNA_id'],inplace=True,drop=True) - merged_counts.fillna(-1,inplace=True) - merged_counts[ 'ntools'] = 0 - merged_counts[ 'HQ' ] = "N" - merged_counts[ 'hqcounts' ] = 0 - merged_counts[ 'nonhqcounts' ] = 0 + merged_counts.fillna(-1, inplace=True) + merged_counts["ntools"] = 0 + merged_counts["HQ"] = "N" + merged_counts["hqcounts"] = 0 + merged_counts["nonhqcounts"] = 0 - annotation_cols=['circExplorer_annotation','ciri_annotation'] + annotation_cols = ["circExplorer_annotation", "ciri_annotation"] floatcols = [] - strand_cols = ['circExplorer_strand','circExplorer_bwa_strand','ciri_strand'] - - intcols = [ 'circExplorer_read_count', - 'circExplorer_found_BSJcounts', - 'circExplorer_found_linear_BSJ_+_counts', - 'circExplorer_found_linear_spliced_BSJ_+_counts', - 'circExplorer_found_linear_BSJ_-_counts', - 'circExplorer_found_linear_spliced_BSJ_-_counts', - 'circExplorer_found_linear_BSJ_._counts', - 'circExplorer_found_linear_spliced_BSJ_._counts' ] - - intcols.extend([ 'ciri_read_count', - 'ciri_linear_read_count' ]) - - intcols.extend(['circExplorer_bwa_read_count']) - annotation_cols.extend(['circExplorer_bwa_annotation']) + strand_cols = ["circExplorer_strand", "circExplorer_bwa_strand", "ciri_strand"] + + intcols = [ + "circExplorer_read_count", + "circExplorer_found_BSJcounts", + "circExplorer_found_linear_BSJ_+_counts", + "circExplorer_found_linear_spliced_BSJ_+_counts", + "circExplorer_found_linear_BSJ_-_counts", + "circExplorer_found_linear_spliced_BSJ_-_counts", + "circExplorer_found_linear_BSJ_._counts", + "circExplorer_found_linear_spliced_BSJ_._counts", + ] + + intcols.extend(["ciri_read_count", "ciri_linear_read_count"]) + + intcols.extend(["circExplorer_bwa_read_count"]) + annotation_cols.extend(["circExplorer_bwa_annotation"]) if args.findcirc: - intcols.extend(['findcirc_read_count']) - strand_cols.append('findcirc_strand') - + intcols.extend(["findcirc_read_count"]) + strand_cols.append("findcirc_strand") + if args.dcc: - intcols.extend([ 'dcc_read_count', - 'dcc_linear_read_count' ]) - annotation_cols.extend(['dcc_gene','dcc_junction_type','dcc_annotation']) - strand_cols.append('dcc_strand') - + intcols.extend(["dcc_read_count", "dcc_linear_read_count"]) + annotation_cols.extend(["dcc_gene", "dcc_junction_type", "dcc_annotation"]) + strand_cols.append("dcc_strand") + if args.mapsplice: - intcols.extend([ 'mapsplice_read_count' ]) - floatcols.extend([ 'mapsplice_entropy' ]) - annotation_cols.extend(['mapsplice_annotation']) - strand_cols.append('mapsplice_strand') + intcols.extend(["mapsplice_read_count"]) + floatcols.extend(["mapsplice_entropy"]) + annotation_cols.extend(["mapsplice_annotation"]) + strand_cols.append("mapsplice_strand") if args.nclscan and includenclscan: - intcols.extend([ 'nclscan_read_count' ]) - annotation_cols.extend(['nclscan_annotation']) - strand_cols.append('nclscan_strand') - - if args.circrnafinder: - intcols.extend(['circrnafinder_read_count']) - strand_cols.append('circrnafinder_strand') + intcols.extend(["nclscan_read_count"]) + annotation_cols.extend(["nclscan_annotation"]) + strand_cols.append("nclscan_strand") - intcols.extend(['ntools']) - intcols.extend(['hqcounts','nonhqcounts']) - strcols = list ( ( set(merged_counts.columns) - set(intcols) ) - set(floatcols) ) - strcols.append('HQ') - merged_counts = _df_setcol_as_int(merged_counts,intcols) - if len(floatcols)>0: merged_counts = _df_setcol_as_float(merged_counts,floatcols) - merged_counts = _df_setcol_as_str(merged_counts,strcols) + if args.circrnafinder: + intcols.extend(["circrnafinder_read_count"]) + strand_cols.append("circrnafinder_strand") + + intcols.extend(["ntools"]) + intcols.extend(["hqcounts", "nonhqcounts"]) + strcols = list((set(merged_counts.columns) - set(intcols)) - set(floatcols)) + strcols.append("HQ") + merged_counts = _df_setcol_as_int(merged_counts, intcols) + if len(floatcols) > 0: + merged_counts = _df_setcol_as_float(merged_counts, floatcols) + merged_counts = _df_setcol_as_str(merged_counts, strcols) # fix annotations == -1 for c in annotation_cols: - merged_counts.loc[merged_counts[c]=="-1" , c] = "Unknown" - + merged_counts.loc[merged_counts[c] == "-1", c] = "Unknown" + for c in required_hqcols: - merged_counts.loc[merged_counts[c] >= args.minreads, 'hqcounts'] += 1 + merged_counts.loc[merged_counts[c] >= args.minreads, "hqcounts"] += 1 for c in not_required_hqcols: - merged_counts.loc[merged_counts[c] >= args.minreads, 'nonhqcounts'] += 1 - - merged_counts.loc[merged_counts['hqcounts'] == hqcclen, 'HQ'] = "Y" - merged_counts.loc[merged_counts['nonhqcounts'] < args.hqccpn, 'HQ'] = "N" - - merged_counts.loc[merged_counts['circExplorer_read_count'] >= args.minreads, 'ntools'] += 1 - merged_counts.loc[merged_counts['ciri_read_count'] >= args.minreads, 'ntools'] += 1 - merged_counts.loc[merged_counts['circExplorer_bwa_read_count'] >= args.minreads, 'ntools'] += 1 - if args.findcirc: merged_counts.loc[merged_counts['findcirc_read_count'] >= args.minreads, 'ntools'] += 1 - if args.dcc: merged_counts.loc[merged_counts['dcc_read_count'] >= args.minreads, 'ntools'] += 1 - if args.mapsplice: merged_counts.loc[merged_counts['mapsplice_read_count'] >= args.minreads, 'ntools'] += 1 - if args.nclscan and includenclscan: merged_counts.loc[merged_counts['nclscan_read_count'] >= args.minreads, 'ntools'] += 1 - if args.circrnafinder: merged_counts.loc[merged_counts['circrnafinder_read_count'] >= args.minreads, 'ntools'] += 1 - merged_counts[['chrom', 'start', 'end']] = merged_counts['circRNA_id'].str.split('##', expand=True) - - merged_counts=_df_setcol_as_int(merged_counts,['start','end','ntools']) - merged_counts=_df_setcol_as_str(merged_counts,['chrom']) + merged_counts.loc[merged_counts[c] >= args.minreads, "nonhqcounts"] += 1 + + merged_counts.loc[merged_counts["hqcounts"] == hqcclen, "HQ"] = "Y" + merged_counts.loc[merged_counts["nonhqcounts"] < args.hqccpn, "HQ"] = "N" + + merged_counts.loc[ + merged_counts["circExplorer_read_count"] >= args.minreads, "ntools" + ] += 1 + merged_counts.loc[merged_counts["ciri_read_count"] >= args.minreads, "ntools"] += 1 + merged_counts.loc[ + merged_counts["circExplorer_bwa_read_count"] >= args.minreads, "ntools" + ] += 1 + if args.findcirc: + merged_counts.loc[ + merged_counts["findcirc_read_count"] >= args.minreads, "ntools" + ] += 1 + if args.dcc: + merged_counts.loc[ + merged_counts["dcc_read_count"] >= args.minreads, "ntools" + ] += 1 + if args.mapsplice: + merged_counts.loc[ + merged_counts["mapsplice_read_count"] >= args.minreads, "ntools" + ] += 1 + if args.nclscan and includenclscan: + merged_counts.loc[ + merged_counts["nclscan_read_count"] >= args.minreads, "ntools" + ] += 1 + if args.circrnafinder: + merged_counts.loc[ + merged_counts["circrnafinder_read_count"] >= args.minreads, "ntools" + ] += 1 + merged_counts[["chrom", "start", "end"]] = merged_counts["circRNA_id"].str.split( + "##", expand=True + ) + + merged_counts = _df_setcol_as_int(merged_counts, ["start", "end", "ntools"]) + merged_counts = _df_setcol_as_str(merged_counts, ["chrom"]) # adding flanking sites - merged_counts['flanking_sites_+']="-1" - merged_counts['flanking_sites_-']="-1" + merged_counts["flanking_sites_+"] = "-1" + merged_counts["flanking_sites_-"] = "-1" - sequences = dict((s[1], s[0]) for s in HTSeq.FastaReader(args.reffa, raw_iterator=True)) + sequences = dict( + (s[1], s[0]) for s in HTSeq.FastaReader(args.reffa, raw_iterator=True) + ) for index, row in merged_counts.iterrows(): - bsj = BSJ(chrom=row['chrom'],start=row['start'],end=row['end']) + bsj = BSJ(chrom=row["chrom"], start=row["start"], end=row["end"]) bsj.add_flanks(sequences) plus_flank, minus_flank = bsj.get_flanks() - merged_counts.loc[index, 'flanking_sites_+'] = plus_flank - merged_counts.loc[index, 'flanking_sites_-'] = minus_flank + merged_counts.loc[index, "flanking_sites_+"] = plus_flank + merged_counts.loc[index, "flanking_sites_-"] = minus_flank # add samplename - merged_counts['sample_name'] = args.samplename - merged_counts=_df_setcol_as_str(merged_counts,['sample_name','flanking_sites_+','flanking_sites_-']) + merged_counts["sample_name"] = args.samplename + merged_counts = _df_setcol_as_str( + merged_counts, ["sample_name", "flanking_sites_+", "flanking_sites_-"] + ) print(merged_counts.columns) # prepare output ... reorder columns - outcols=['chrom', 'start', 'end'] + outcols = ["chrom", "start", "end"] outcols.extend(strand_cols) - outcols.extend(['flanking_sites_+','flanking_sites_-', 'sample_name', 'ntools', 'HQ']) + outcols.extend( + ["flanking_sites_+", "flanking_sites_-", "sample_name", "ntools", "HQ"] + ) # add circExplorer columns - outcols.extend(['circExplorer_read_count', - 'circExplorer_found_BSJcounts', - 'circExplorer_found_linear_BSJ_+_counts', - 'circExplorer_found_linear_spliced_BSJ_+_counts', - 'circExplorer_found_linear_BSJ_-_counts', - 'circExplorer_found_linear_spliced_BSJ_-_counts', - 'circExplorer_found_linear_BSJ_._counts', - 'circExplorer_found_linear_spliced_BSJ_._counts']) + outcols.extend( + [ + "circExplorer_read_count", + "circExplorer_found_BSJcounts", + "circExplorer_found_linear_BSJ_+_counts", + "circExplorer_found_linear_spliced_BSJ_+_counts", + "circExplorer_found_linear_BSJ_-_counts", + "circExplorer_found_linear_spliced_BSJ_-_counts", + "circExplorer_found_linear_BSJ_._counts", + "circExplorer_found_linear_spliced_BSJ_._counts", + ] + ) # add ciri columns - outcols.extend(['ciri_read_count', - 'ciri_linear_read_count']) + outcols.extend(["ciri_read_count", "ciri_linear_read_count"]) # add circExplorer_BWA columns - outcols.extend(['circExplorer_bwa_read_count']) + outcols.extend(["circExplorer_bwa_read_count"]) # add find_circ columns if args.findcirc: - outcols.extend(['findcirc_read_count']) + outcols.extend(["findcirc_read_count"]) # add DCC columns - if args.dcc: - outcols.extend(['dcc_read_count', - 'dcc_linear_read_count']) + if args.dcc: + outcols.extend(["dcc_read_count", "dcc_linear_read_count"]) # add MapSplice columns - if args.mapsplice: outcols.append('mapsplice_read_count') + if args.mapsplice: + outcols.append("mapsplice_read_count") # add NCLscan columns - if args.nclscan and includenclscan: outcols.append('nclscan_read_count') + if args.nclscan and includenclscan: + outcols.append("nclscan_read_count") # add circRNAfinder columns - if args.circrnafinder: outcols.append('circrnafinder_read_count') + if args.circrnafinder: + outcols.append("circrnafinder_read_count") - outcols.extend(['hqcounts','nonhqcounts']) + outcols.extend(["hqcounts", "nonhqcounts"]) # add annotation columns outcols.extend(annotation_cols) merged_counts = merged_counts[outcols] - merged_counts.to_csv(args.outfile,sep="\t",header=True,index=False,compression='gzip') + merged_counts.to_csv( + args.outfile, sep="\t", header=True, index=False, compression="gzip" + ) if __name__ == "__main__": diff --git a/workflow/scripts/_multifasta2separatefastas.sh b/workflow/scripts/_multifasta2separatefastas.sh index e14b198..4f26015 100755 --- a/workflow/scripts/_multifasta2separatefastas.sh +++ b/workflow/scripts/_multifasta2separatefastas.sh @@ -9,4 +9,4 @@ cat $fasta | awk '{ if (substr($0, 1, 1)==">") {filename=(substr($0,2) ".fa")} print $0 >> filename close(filename) - }' \ No newline at end of file + }' diff --git a/workflow/scripts/_process_bamtobed.py b/workflow/scripts/_process_bamtobed.py index 222b7ae..a096148 100755 --- a/workflow/scripts/_process_bamtobed.py +++ b/workflow/scripts/_process_bamtobed.py @@ -3,83 +3,104 @@ import argparse import gzip + def main(): # debug = True debug = False - parser = argparse.ArgumentParser( - ) + parser = argparse.ArgumentParser() # INPUTs - parser.add_argument("-i","--inbed",dest="inbed",required=True,type=str, - help="Input bamtobed bed file") + parser.add_argument( + "-i", + "--inbed", + dest="inbed", + required=True, + type=str, + help="Input bamtobed bed file", + ) # OUTPUTs - parser.add_argument('-o',"--outbed",dest="outbed",required=True,type=str, - help="Output bed file") - parser.add_argument('-l',"--linear",dest="linear",required=True,type=str, - help="gzip-ed list of linear readids") - parser.add_argument('-s',"--spliced",dest="spliced",required=True,type=str, - help="gzip-ed list of spliced readids") + parser.add_argument( + "-o", "--outbed", dest="outbed", required=True, type=str, help="Output bed file" + ) + parser.add_argument( + "-l", + "--linear", + dest="linear", + required=True, + type=str, + help="gzip-ed list of linear readids", + ) + parser.add_argument( + "-s", + "--spliced", + dest="spliced", + required=True, + type=str, + help="gzip-ed list of spliced readids", + ) args = parser.parse_args() - outbed = open(args.outbed,'w') + outbed = open(args.outbed, "w") pairtest = 0 paired = 0 readname_counts = dict() - with open(args.inbed,'r') as inbed: + with open(args.inbed, "r") as inbed: for l in inbed: - l=l.strip().split("\t") - l1=[] - l2=[] + l = l.strip().split("\t") + l1 = [] + l2 = [] l1.append(l[0]) l2.append(l[0]) l1.append(l[1]) l1.append(l[1]) l2.append(l[2]) l2.append(l[2]) - if "/" in l[3]: # paired end + if "/" in l[3]: # paired end if pairtest == 0: - pairtest=1 - paired=1 - x=l[3].split("/") - readname=x[0] - if x[1]=="1": # pick the strand of mate1 as the read strand - strand=l[5] - else: # if it is mate2 then reverse the strand - if l[5]=="-": - strand="+" - elif l[5]=="+": - strand="-" - else: # if neither + or - is provided the use whatever is provided - strand=l[5] - else: # single end - readname=l[3] - strand=l[5] + pairtest = 1 + paired = 1 + x = l[3].split("/") + readname = x[0] + if x[1] == "1": # pick the strand of mate1 as the read strand + strand = l[5] + else: # if it is mate2 then reverse the strand + if l[5] == "-": + strand = "+" + elif l[5] == "+": + strand = "-" + else: # if neither + or - is provided the use whatever is provided + strand = l[5] + else: # single end + readname = l[3] + strand = l[5] if readname in readname_counts: - readname_counts[readname]+=1 + readname_counts[readname] += 1 else: - readname_counts[readname]=1 - readname+="##"+strand + readname_counts[readname] = 1 + readname += "##" + strand l1.append(readname) l2.append(readname) l1.append(".") l2.append(".") l1.append(strand) l2.append(strand) - outbed.write("\t".join(l1)+"\n") - outbed.write("\t".join(l2)+"\n") + outbed.write("\t".join(l1) + "\n") + outbed.write("\t".join(l2) + "\n") inbed.close() outbed.close() # linear = open(args.linear,'w') # spliced = open(args.spliced,'w') limit = 1 - if paired==1: limit=2 - with gzip.open(args.spliced,'wt') as spliced: - with gzip.open(args.linear,'wt') as linear: - for rid,count in readname_counts.items(): - if count>limit: - spliced.write("%s\n"%rid) + if paired == 1: + limit = 2 + with gzip.open(args.spliced, "wt") as spliced: + with gzip.open(args.linear, "wt") as linear: + for rid, count in readname_counts.items(): + if count > limit: + spliced.write("%s\n" % rid) else: - linear.write("%s\n"%rid) + linear.write("%s\n" % rid) spliced.close() linear.close() + if __name__ == "__main__": main() diff --git a/workflow/scripts/annotate_clear_quant.py b/workflow/scripts/annotate_clear_quant.py index 39e3df9..f2238b5 100755 --- a/workflow/scripts/annotate_clear_quant.py +++ b/workflow/scripts/annotate_clear_quant.py @@ -4,16 +4,46 @@ import pandas import sys -indexcol=sys.argv[3] # hg38 or mm39 +indexcol = sys.argv[3] # hg38 or mm39 -lookupfile=sys.argv[1] -annotations=pandas.read_csv(lookupfile,sep="\t",header=0) -annotations.set_index([indexcol],inplace=True) +lookupfile = sys.argv[1] +annotations = pandas.read_csv(lookupfile, sep="\t", header=0) +annotations.set_index([indexcol], inplace=True) -quantfile=sys.argv[2] -quant=pandas.read_csv(quantfile,sep="\t",header=None,names=["quant_chrom","quant_start","quant_end","quant_name","quant_score","quant_quant_strand","quant_thickStart","quant_thickEnd","quant_itemRgb","quant_exonCount","quant_exonSizes","quant_exonOffsets","quant_readNumber","quant_circType","quant_geneName","quant_isoformName","quant_index","quant_flankIntron","quant_FPBcirc","quant_FPBlinear","quant_CIRCscore"]) -quant[indexcol]=quant.apply(lambda row: row.quant_chrom+":"+str(row.quant_start)+"-"+str(row.quant_end),axis=1) -quant.set_index([indexcol],inplace=True) +quantfile = sys.argv[2] +quant = pandas.read_csv( + quantfile, + sep="\t", + header=None, + names=[ + "quant_chrom", + "quant_start", + "quant_end", + "quant_name", + "quant_score", + "quant_quant_strand", + "quant_thickStart", + "quant_thickEnd", + "quant_itemRgb", + "quant_exonCount", + "quant_exonSizes", + "quant_exonOffsets", + "quant_readNumber", + "quant_circType", + "quant_geneName", + "quant_isoformName", + "quant_index", + "quant_flankIntron", + "quant_FPBcirc", + "quant_FPBlinear", + "quant_CIRCscore", + ], +) +quant[indexcol] = quant.apply( + lambda row: row.quant_chrom + ":" + str(row.quant_start) + "-" + str(row.quant_end), + axis=1, +) +quant.set_index([indexcol], inplace=True) -x=quant.join(annotations) -x.to_csv(quantfile+'.annotated',sep="\t",header=True) +x = quant.join(annotations) +x.to_csv(quantfile + ".annotated", sep="\t", header=True) diff --git a/workflow/scripts/apply_junction_filters.py b/workflow/scripts/apply_junction_filters.py index f9173ea..47e29f9 100755 --- a/workflow/scripts/apply_junction_filters.py +++ b/workflow/scripts/apply_junction_filters.py @@ -2,63 +2,106 @@ import argparse import os + def my_bool(s): - return s != 'False' + return s != "False" + -parser = argparse.ArgumentParser(description='apply junction filters, input stdin and output stdout') -parser.add_argument('--regions', dest='regions', type=str, required=True, metavar="absolute path to regions file", - help='regions file') -parser.add_argument('--filter1regions', dest='filter1regions', type=str, required=True, metavar="eg. \"hg38,ERCC,rRNA\"", - help='comma separated list of regions to apply filter1 on ... filter2 is applied to all other regions') -parser.add_argument('--filter1_noncanonical', dest='filter1_noncanonical', default=True, type=my_bool, required=True, metavar="\"True/False\"", - help='apply canonical filter on filter1') -parser.add_argument('--filter1_unannotated', dest='filter1_unannotated', default=True, type=my_bool, required=True, metavar="\"True/False\"", - help='apply unannotated filter on filter1') -parser.add_argument('--filter2_noncanonical', dest='filter2_noncanonical', default=False, type=my_bool, required=True, metavar="\"True/False\"", - help='apply canonical filter on filter2') -parser.add_argument('--filter2_unannotated', dest='filter2_unannotated', default=False, type=my_bool, required=True, metavar="\"True/False\"", - help='apply unannotated filter on filter2') +parser = argparse.ArgumentParser( + description="apply junction filters, input stdin and output stdout" +) +parser.add_argument( + "--regions", + dest="regions", + type=str, + required=True, + metavar="absolute path to regions file", + help="regions file", +) +parser.add_argument( + "--filter1regions", + dest="filter1regions", + type=str, + required=True, + metavar='eg. "hg38,ERCC,rRNA"', + help="comma separated list of regions to apply filter1 on ... filter2 is applied to all other regions", +) +parser.add_argument( + "--filter1_noncanonical", + dest="filter1_noncanonical", + default=True, + type=my_bool, + required=True, + metavar='"True/False"', + help="apply canonical filter on filter1", +) +parser.add_argument( + "--filter1_unannotated", + dest="filter1_unannotated", + default=True, + type=my_bool, + required=True, + metavar='"True/False"', + help="apply unannotated filter on filter1", +) +parser.add_argument( + "--filter2_noncanonical", + dest="filter2_noncanonical", + default=False, + type=my_bool, + required=True, + metavar='"True/False"', + help="apply canonical filter on filter2", +) +parser.add_argument( + "--filter2_unannotated", + dest="filter2_unannotated", + default=False, + type=my_bool, + required=True, + metavar='"True/False"', + help="apply unannotated filter on filter2", +) args = parser.parse_args() -chr2region=dict() -regions=list() +chr2region = dict() +regions = list() x = open(args.regions) for r in x.readlines(): - r = r.strip().split("\t") - regions.append(r[0]) - for c in r[1].split(): - chr2region[c]=r[0] + r = r.strip().split("\t") + regions.append(r[0]) + for c in r[1].split(): + chr2region[c] = r[0] x.close() -region2filter=dict() +region2filter = dict() for x in regions: - region2filter[x]=2 # apply filter2 to everything + region2filter[x] = 2 # apply filter2 to everything -filter1regions=args.filter1regions +filter1regions = args.filter1regions for f in filter1regions.split(","): - f = f.strip() - if not f in region2filter: - exit("Region "+f+" not defined!") - region2filter[f]=1 # change filter from filter2 to filter1 + f = f.strip() + if not f in region2filter: + exit("Region " + f + " not defined!") + region2filter[f] = 1 # change filter from filter2 to filter1 # cat {input} |sort|uniq|awk -F \"\\t\" '{{if ($5>0 && $6==1) {{print}}}}'|cut -f1-4|sort -k1,1 -k2,2n|uniq > {output.pass1sjtab} for line in sys.stdin: - l=line.split("\t") - f=region2filter[chr2region[l[0]]] - if f==1: - if args.filter1_noncanonical: - if not int(l[4])>0: - continue - if args.filter1_unannotated: - if not int(l[5])==1: - continue - elif f==2: - if args.filter2_noncanonical: - if not int(l[4])>0: - continue - if args.filter2_unannotated: - if not int(l[5])==1: - continue - sys.stdout.write(line) - # exit() - + l = line.split("\t") + f = region2filter[chr2region[l[0]]] + if f == 1: + if args.filter1_noncanonical: + if not int(l[4]) > 0: + continue + if args.filter1_unannotated: + if not int(l[5]) == 1: + continue + elif f == 2: + if args.filter2_noncanonical: + if not int(l[4]) > 0: + continue + if args.filter2_unannotated: + if not int(l[5]) == 1: + continue + sys.stdout.write(line) + # exit() diff --git a/workflow/scripts/bam_get_max_readlen.py b/workflow/scripts/bam_get_max_readlen.py index d16edef..25e3001 100755 --- a/workflow/scripts/bam_get_max_readlen.py +++ b/workflow/scripts/bam_get_max_readlen.py @@ -8,17 +8,19 @@ def main(): parser = argparse.ArgumentParser( description="Print out the maximum aligned read length in the input BAM" ) - parser.add_argument("-i","--bam",dest="inbam",required=True,type=str, - help="Input BAM file") + parser.add_argument( + "-i", "--bam", dest="inbam", required=True, type=str, help="Input BAM file" + ) args = parser.parse_args() samfile = pysam.AlignmentFile(args.inbam, "rb") - maxrl=0 + maxrl = 0 for read in samfile.fetch(): rl = int(read.query_length) - if rl > maxrl: maxrl=rl + if rl > maxrl: + maxrl = rl samfile.close() print(maxrl) if __name__ == "__main__": - main() \ No newline at end of file + main() diff --git a/workflow/scripts/bam_split_by_regions.py b/workflow/scripts/bam_split_by_regions.py index 9da14e2..abd1ba5 100755 --- a/workflow/scripts/bam_split_by_regions.py +++ b/workflow/scripts/bam_split_by_regions.py @@ -3,46 +3,51 @@ import os import time + def get_ctime(): return time.ctime(time.time()) -def read_regions(regionsfile,host,additives,viruses): - host=host.split(",") - additives=additives.split(",") - viruses=viruses.split(",") - infile=open(regionsfile,'r') - regions=dict() + +def read_regions(regionsfile, host, additives, viruses): + host = host.split(",") + additives = additives.split(",") + viruses = viruses.split(",") + infile = open(regionsfile, "r") + regions = dict() for l in infile.readlines(): l = l.strip().split("\t") - region_name=l[0] - regions[region_name]=dict() - regions[region_name]['sequences']=dict() + region_name = l[0] + regions[region_name] = dict() + regions[region_name]["sequences"] = dict() if region_name in host: - regions[region_name]['host_additive_virus']="host" + regions[region_name]["host_additive_virus"] = "host" elif region_name in additives: - regions[region_name]['host_additive_virus']="additive" + regions[region_name]["host_additive_virus"] = "additive" elif region_name in viruses: - regions[region_name]['host_additive_virus']="virus" + regions[region_name]["host_additive_virus"] = "virus" else: exit("%s has unknown region. Its not a host or a additive or a virus!!") - sequence_names=l[1].split() + sequence_names = l[1].split() for s in sequence_names: - regions[region_name]['sequences'][s]=1 - return regions + regions[region_name]["sequences"][s] = 1 + return regions + -def _get_host_additive_virus(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: - return v['host_additive_virus'] +def _get_host_additive_virus(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: + return v["host_additive_virus"] else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) -def _get_regionname_from_seqname(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: + +def _get_regionname_from_seqname(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: return k else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) + def main(): # debug = True @@ -51,66 +56,113 @@ def main(): description="""Extracts PE BSJs from STAR2p output Chimeric BAM file. It also adds unique read group IDs to each read. This RID is of the format #### where the chrom, start and end represent the BSJ the read is depicting. - ## UPDATE: works for all BAM files ... not just BSJ only + ## UPDATE: works for all BAM files ... not just BSJ only """ ) - #INPUTs - parser.add_argument("-i","--inbam",dest="inbam",required=True,type=str, - help="Input BAM file") - parser.add_argument("-s",'--sample_name', dest='samplename', type=str, required=False, default = 'sample1', - help='Sample Name: SM for RG') - parser.add_argument('--regions', dest='regions', type=str, required=True, - help='regions file eg. ref.fa.regions') - parser.add_argument('--host', dest='host', type=str, required=True, - help='host name eg.hg38... single value') - parser.add_argument('--additives', dest='additives', type=str, required=True, - help='additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out') - parser.add_argument('--viruses', dest='viruses', type=str, required=True, - help='virus name(s) eg.NC_009333.1... comma-separated list') - parser.add_argument('--prefix', dest='prefix', type=str, required=True, - help='outfile prefix ... like "linear" or "linear_spliced" etc.') - #OUTPUTs - parser.add_argument("--outdir",dest="outdir",required=False,type=str, - help="Output folder for the individual BAM files.") + # INPUTs + parser.add_argument( + "-i", "--inbam", dest="inbam", required=True, type=str, help="Input BAM file" + ) + parser.add_argument( + "-s", + "--sample_name", + dest="samplename", + type=str, + required=False, + default="sample1", + help="Sample Name: SM for RG", + ) + parser.add_argument( + "--regions", + dest="regions", + type=str, + required=True, + help="regions file eg. ref.fa.regions", + ) + parser.add_argument( + "--host", + dest="host", + type=str, + required=True, + help="host name eg.hg38... single value", + ) + parser.add_argument( + "--additives", + dest="additives", + type=str, + required=True, + help="additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out", + ) + parser.add_argument( + "--viruses", + dest="viruses", + type=str, + required=True, + help="virus name(s) eg.NC_009333.1... comma-separated list", + ) + parser.add_argument( + "--prefix", + dest="prefix", + type=str, + required=True, + help='outfile prefix ... like "linear" or "linear_spliced" etc.', + ) + # OUTPUTs + parser.add_argument( + "--outdir", + dest="outdir", + required=False, + type=str, + help="Output folder for the individual BAM files.", + ) args = parser.parse_args() samfile = pysam.AlignmentFile(args.inbam, "rb") sequences = list() samheader = samfile.header.to_dict() - for v in samheader['SQ']: - sequences.append(v['SN']) - - seqname2regionname=dict() - hosts=set() - viruses=set() - - regions = read_regions(regionsfile=args.regions,host=args.host,additives=args.additives,viruses=args.viruses) + for v in samheader["SQ"]: + sequences.append(v["SN"]) + + seqname2regionname = dict() + hosts = set() + viruses = set() + + regions = read_regions( + regionsfile=args.regions, + host=args.host, + additives=args.additives, + viruses=args.viruses, + ) for s in sequences: - hav = _get_host_additive_virus(regions,s) + hav = _get_host_additive_virus(regions, s) if hav == "host": - hostname = _get_regionname_from_seqname(regions,s) - seqname2regionname[s]=hostname + hostname = _get_regionname_from_seqname(regions, s) + seqname2regionname[s] = hostname hosts.add(hostname) if hav == "virus": - virusname = _get_regionname_from_seqname(regions,s) - seqname2regionname[s]=virusname + virusname = _get_regionname_from_seqname(regions, s) + seqname2regionname[s] = virusname viruses.add(virusname) if hav == "additive": - additive = _get_regionname_from_seqname(regions,s) - seqname2regionname[s]=additive - + additive = _get_regionname_from_seqname(regions, s) + seqname2regionname[s] = additive + outputbams = dict() for h in hosts: - outbamname = os.path.join(args.outdir,args.samplename+"."+args.prefix+"."+h+".bam") - outputbams[h] = pysam.AlignmentFile(outbamname, "wb", header = samheader) + outbamname = os.path.join( + args.outdir, args.samplename + "." + args.prefix + "." + h + ".bam" + ) + outputbams[h] = pysam.AlignmentFile(outbamname, "wb", header=samheader) for h in viruses: - outbamname = os.path.join(args.outdir,args.samplename+"."+args.prefix+"."+h+".bam") - outputbams[h] = pysam.AlignmentFile(outbamname, "wb", header = samheader) - + outbamname = os.path.join( + args.outdir, args.samplename + "." + args.prefix + "." + h + ".bam" + ) + outputbams[h] = pysam.AlignmentFile(outbamname, "wb", header=samheader) + for read in samfile.fetch(): - chrom=read.reference_name - regionname=seqname2regionname[chrom] + chrom = read.reference_name + regionname = seqname2regionname[chrom] if regionname in hosts or regionname in viruses: outputbams[regionname].write(read) samfile.close() @@ -118,6 +170,5 @@ def main(): o.close() - if __name__ == "__main__": main() diff --git a/workflow/scripts/bam_to_bigwig.sh b/workflow/scripts/bam_to_bigwig.sh index 31c8b2f..9446f16 100755 --- a/workflow/scripts/bam_to_bigwig.sh +++ b/workflow/scripts/bam_to_bigwig.sh @@ -25,4 +25,4 @@ if [ "$(wc -l ${tmpdir}/${bdg}|awk '{print $1}')" != "0" ];then samtools view -H $bam | grep ^@SQ | cut -f2,3 | sed "s/SN://g" | sed "s/LN://g" > ${tmpdir}/${sizes} bedGraphToBigWig ${tmpdir}/${bdg} ${tmpdir}/${sizes} $bw fi -rm -f ${tmpdir}/${bdg} ${tmpdir}/${sizes} \ No newline at end of file +rm -f ${tmpdir}/${bdg} ${tmpdir}/${sizes} diff --git a/workflow/scripts/circExplorer_get_annotated_counts_per_sample.py b/workflow/scripts/circExplorer_get_annotated_counts_per_sample.py index c5ccb8c..60cb734 100755 --- a/workflow/scripts/circExplorer_get_annotated_counts_per_sample.py +++ b/workflow/scripts/circExplorer_get_annotated_counts_per_sample.py @@ -1,134 +1,274 @@ import argparse + class BSJ: - def __init__(self,chrom="",start=-1,end=-1,strand=".",known_novel="novel",read_count=-1,counted=-1): - self.chrom=chrom - self.start=start - self.end=end - self.strand=strand - self.known_novel=known_novel - self.read_count=read_count - self.counted=counted + def __init__( + self, + chrom="", + start=-1, + end=-1, + strand=".", + known_novel="novel", + read_count=-1, + counted=-1, + ): + self.chrom = chrom + self.start = start + self.end = end + self.strand = strand + self.known_novel = known_novel + self.read_count = read_count + self.counted = counted + def __str__(self): # id="##".join([self.chrom,str(self.start),str(self.end),self.strand]) - return "%s\t%d\t%d\t%s\t%d\t%s\n"%(self.chrom,self.start,self.end,self.strand,self.read_count,self.known_novel) - -def read_regions(regionsfile,host,additives,viruses): - host=host.split(",") - additives=additives.split(",") - viruses=viruses.split(",") - infile=open(regionsfile,'r') - regions=dict() + return "%s\t%d\t%d\t%s\t%d\t%s\n" % ( + self.chrom, + self.start, + self.end, + self.strand, + self.read_count, + self.known_novel, + ) + + +def read_regions(regionsfile, host, additives, viruses): + host = host.split(",") + additives = additives.split(",") + viruses = viruses.split(",") + infile = open(regionsfile, "r") + regions = dict() for l in infile.readlines(): l = l.strip().split("\t") - region_name=l[0] - regions[region_name]=dict() - regions[region_name]['sequences']=dict() + region_name = l[0] + regions[region_name] = dict() + regions[region_name]["sequences"] = dict() if region_name in host: - regions[region_name]['host_additive_virus']="host" + regions[region_name]["host_additive_virus"] = "host" elif region_name in additives: - regions[region_name]['host_additive_virus']="additive" + regions[region_name]["host_additive_virus"] = "additive" elif region_name in viruses: - regions[region_name]['host_additive_virus']="virus" + regions[region_name]["host_additive_virus"] = "virus" else: exit("%s has unknown region. Its not a host or a additive or a virus!!") - sequence_names=l[1].split() + sequence_names = l[1].split() for s in sequence_names: - regions[region_name]['sequences'][s]=1 - return regions + regions[region_name]["sequences"][s] = 1 + return regions -def _get_host_additive_virus(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: - return v['host_additive_virus'] + +def _get_host_additive_virus(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: + return v["host_additive_virus"] else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) + -def read_BSJs(filename,regions,host_min,host_max,virus_min,virus_max,known_novel="novel",counted=-1,threshold=0): - infile=open(filename,'r') - BSJdict=dict() +def read_BSJs( + filename, + regions, + host_min, + host_max, + virus_min, + virus_max, + known_novel="novel", + counted=-1, + threshold=0, +): + infile = open(filename, "r") + BSJdict = dict() for l in infile.readlines(): - l=l.strip().split("\t") - chrom=l[0] - start=int(l[1]) - end=int(l[2]) - strand=l[5] - circid="##".join([chrom,str(start),str(end)]) + l = l.strip().split("\t") + chrom = l[0] + start = int(l[1]) + end = int(l[2]) + strand = l[5] + circid = "##".join([chrom, str(start), str(end)]) # count=int(l[3].split("/")[1]) - count=int(l[3]) + count = int(l[3]) if count < threshold: continue - host_additive_virus=_get_host_additive_virus(regions=regions,seqname=chrom) + host_additive_virus = _get_host_additive_virus(regions=regions, seqname=chrom) # if host_additive_virus == "additive": continue - size = end-start + size = end - start if host_additive_virus == "host" or host_additive_virus == "additive": - if size < host_min: continue - if size > host_max: continue + if size < host_min: + continue + if size > host_max: + continue if host_additive_virus == "virus": - if size < virus_min : continue - if size > virus_max : continue - BSJdict[circid]=BSJ(chrom=chrom,start=start,end=end,strand=strand,known_novel=known_novel,read_count=count,counted=counted) - return(BSJdict) + if size < virus_min: + continue + if size > virus_max: + continue + BSJdict[circid] = BSJ( + chrom=chrom, + start=start, + end=end, + strand=strand, + known_novel=known_novel, + read_count=count, + counted=counted, + ) + return BSJdict + -parser = argparse.ArgumentParser(description='Create CircExplorer2 Per Sample Counts Table') +parser = argparse.ArgumentParser( + description="Create CircExplorer2 Per Sample Counts Table" +) # INPUTS -parser.add_argument('--back_spliced_bed', dest='bsb', type=str, required=True, - help='back_spliced.bed') -parser.add_argument('--back_spliced_min_reads', dest='back_spliced_min_reads', type=int, required=True, - help='back_spliced minimum read threshold') # in addition to "known" and "low-conf" circRNAs identified by circexplorer, we also include those found in back_spliced.bed file but not classified as known/low-conf only if the number of reads supporting the BSJ call is greater than this number -parser.add_argument('--circularRNA_known', dest='ck', type=str, required=True, - help='circularRNA_known.txt') -parser.add_argument('--low_conf', dest='lc', type=str, required=False, - help='low_conf.circularRNA_known.txt') -parser.add_argument('--host', dest='host', type=str, required=True, - help='host name eg.hg38... single value...host_filter_min/host_filter_max filters are applied to this region only') -parser.add_argument('--additives', dest='additives', type=str, required=True, - help='additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out') -parser.add_argument('--viruses', dest='viruses', type=str, required=True, - help='virus name(s) eg.NC_009333.1... comma-separated list...virus_filter_min/virus_filter_max filters are applied to this region only') -parser.add_argument('--host_filter_min', dest='host_filter_min', type=int, required=False, default=150, - help='min BSJ size filter for host') -parser.add_argument('--virus_filter_min', dest='virus_filter_min', type=int, required=False, default=150, - help='min BSJ size filter for virus') -parser.add_argument('--host_filter_max', dest='host_filter_max', type=int, required=False, default=5000, - help='max BSJ size filter for host') -parser.add_argument('--virus_filter_max', dest='virus_filter_max', type=int, required=False, default=5000, - help='max BSJ size filter for virus') -parser.add_argument('--regions', dest='regions', type=str, required=True, - help='regions file eg. ref.fa.regions') +parser.add_argument( + "--back_spliced_bed", dest="bsb", type=str, required=True, help="back_spliced.bed" +) +parser.add_argument( + "--back_spliced_min_reads", + dest="back_spliced_min_reads", + type=int, + required=True, + help="back_spliced minimum read threshold", +) # in addition to "known" and "low-conf" circRNAs identified by circexplorer, we also include those found in back_spliced.bed file but not classified as known/low-conf only if the number of reads supporting the BSJ call is greater than this number +parser.add_argument( + "--circularRNA_known", + dest="ck", + type=str, + required=True, + help="circularRNA_known.txt", +) +parser.add_argument( + "--low_conf", + dest="lc", + type=str, + required=False, + help="low_conf.circularRNA_known.txt", +) +parser.add_argument( + "--host", + dest="host", + type=str, + required=True, + help="host name eg.hg38... single value...host_filter_min/host_filter_max filters are applied to this region only", +) +parser.add_argument( + "--additives", + dest="additives", + type=str, + required=True, + help="additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out", +) +parser.add_argument( + "--viruses", + dest="viruses", + type=str, + required=True, + help="virus name(s) eg.NC_009333.1... comma-separated list...virus_filter_min/virus_filter_max filters are applied to this region only", +) +parser.add_argument( + "--host_filter_min", + dest="host_filter_min", + type=int, + required=False, + default=150, + help="min BSJ size filter for host", +) +parser.add_argument( + "--virus_filter_min", + dest="virus_filter_min", + type=int, + required=False, + default=150, + help="min BSJ size filter for virus", +) +parser.add_argument( + "--host_filter_max", + dest="host_filter_max", + type=int, + required=False, + default=5000, + help="max BSJ size filter for host", +) +parser.add_argument( + "--virus_filter_max", + dest="virus_filter_max", + type=int, + required=False, + default=5000, + help="max BSJ size filter for virus", +) +parser.add_argument( + "--regions", + dest="regions", + type=str, + required=True, + help="regions file eg. ref.fa.regions", +) # OUTPUTS -parser.add_argument('-o',dest='outfile',required=True,help='counts TSV table') +parser.add_argument("-o", dest="outfile", required=True, help="counts TSV table") args = parser.parse_args() -regions=read_regions(regionsfile=args.regions,host=args.host,additives=args.additives,viruses=args.viruses) -o=open(args.outfile,'w') +regions = read_regions( + regionsfile=args.regions, + host=args.host, + additives=args.additives, + viruses=args.viruses, +) +o = open(args.outfile, "w") o.write("#chrom\tstart\tend\tstrand\tread_count\tknown_novel\n") -all_BSJs=read_BSJs(args.bsb,counted=0,threshold=args.back_spliced_min_reads,regions=regions,host_min=args.host_filter_min,host_max=args.host_filter_max,virus_min=args.virus_filter_min,virus_max=args.virus_filter_max) +all_BSJs = read_BSJs( + args.bsb, + counted=0, + threshold=args.back_spliced_min_reads, + regions=regions, + host_min=args.host_filter_min, + host_max=args.host_filter_max, + virus_min=args.virus_filter_min, + virus_max=args.virus_filter_max, +) -known_BSJs=read_BSJs(args.ck,known_novel="known",counted=0,threshold=args.back_spliced_min_reads,regions=regions,host_min=args.host_filter_min,host_max=args.host_filter_max,virus_min=args.virus_filter_min,virus_max=args.virus_filter_max) +known_BSJs = read_BSJs( + args.ck, + known_novel="known", + counted=0, + threshold=args.back_spliced_min_reads, + regions=regions, + host_min=args.host_filter_min, + host_max=args.host_filter_max, + virus_min=args.virus_filter_min, + virus_max=args.virus_filter_max, +) if args.lc: - low_conf_BSJs=read_BSJs(args.lc,known_novel="known",counted=0,threshold=args.back_spliced_min_reads,regions=regions,host_min=args.host_filter_min,host_max=args.host_filter_max,virus_min=args.virus_filter_min,virus_max=args.virus_filter_max) - for k,v in all_BSJs.items(): + low_conf_BSJs = read_BSJs( + args.lc, + known_novel="known", + counted=0, + threshold=args.back_spliced_min_reads, + regions=regions, + host_min=args.host_filter_min, + host_max=args.host_filter_max, + virus_min=args.virus_filter_min, + virus_max=args.virus_filter_max, + ) + for k, v in all_BSJs.items(): if k in low_conf_BSJs: - all_BSJs[k].known_novel="low_conf" - all_BSJs[k].strand=v.strand - all_BSJs[k].counted=1 - low_conf_BSJs[k].counted=1 + all_BSJs[k].known_novel = "low_conf" + all_BSJs[k].strand = v.strand + all_BSJs[k].counted = 1 + low_conf_BSJs[k].counted = 1 -for k,v in all_BSJs.items(): +for k, v in all_BSJs.items(): if k in known_BSJs: - all_BSJs[k].known_novel="known" - all_BSJs[k].strand=known_BSJs[k].strand - all_BSJs[k].counted=1 - known_BSJs[k].counted=1 + all_BSJs[k].known_novel = "known" + all_BSJs[k].strand = known_BSJs[k].strand + all_BSJs[k].counted = 1 + known_BSJs[k].counted = 1 o.write(str(all_BSJs[k])) -lst=[known_BSJs] +lst = [known_BSJs] if args.lc: lst.append(low_conf_BSJs) for l in lst: - for k,v in l.items(): - if l[k].counted!=1: + for k, v in l.items(): + if l[k].counted != 1: o.write(str(v)) o.close() diff --git a/workflow/scripts/create_circExplorer_linear_bam.py b/workflow/scripts/create_circExplorer_linear_bam.py index e3fb5f7..eaf956f 100755 --- a/workflow/scripts/create_circExplorer_linear_bam.py +++ b/workflow/scripts/create_circExplorer_linear_bam.py @@ -6,96 +6,102 @@ pp = pprint.PrettyPrinter(indent=4) -def read_regions(regionsfile,host,additives,viruses): - host=host.split(",") - additives=additives.split(",") - viruses=viruses.split(",") - infile=open(regionsfile,'r') - regions=dict() + +def read_regions(regionsfile, host, additives, viruses): + host = host.split(",") + additives = additives.split(",") + viruses = viruses.split(",") + infile = open(regionsfile, "r") + regions = dict() for l in infile.readlines(): l = l.strip().split("\t") - region_name=l[0] - regions[region_name]=dict() - regions[region_name]['sequences']=dict() + region_name = l[0] + regions[region_name] = dict() + regions[region_name]["sequences"] = dict() if region_name in host: - regions[region_name]['host_additive_virus']="host" + regions[region_name]["host_additive_virus"] = "host" elif region_name in additives: - regions[region_name]['host_additive_virus']="additive" + regions[region_name]["host_additive_virus"] = "additive" elif region_name in viruses: - regions[region_name]['host_additive_virus']="virus" + regions[region_name]["host_additive_virus"] = "virus" else: exit("%s has unknown region. Its not a host or a additive or a virus!!") - sequence_names=l[1].split() + sequence_names = l[1].split() for s in sequence_names: - regions[region_name]['sequences'][s]=1 - return regions + regions[region_name]["sequences"][s] = 1 + return regions + -def _get_host_additive_virus(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: - return v['host_additive_virus'] +def _get_host_additive_virus(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: + return v["host_additive_virus"] else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) + class JUNCTION: - def __init__(self,jid,chrom="",start=-1,end=-1): - self.jid=jid - self.chrom=chrom - self.start=int(start) - self.end=int(end) - self.score=0 - self.rids=set() - self.refcoords=dict() - self.keeprids=set() - - def append_rid_refcoords(self,rid,coords): - if not rid in self.rids: self.refcoords[rid]=dict() + def __init__(self, jid, chrom="", start=-1, end=-1): + self.jid = jid + self.chrom = chrom + self.start = int(start) + self.end = int(end) + self.score = 0 + self.rids = set() + self.refcoords = dict() + self.keeprids = set() + + def append_rid_refcoords(self, rid, coords): + if not rid in self.rids: + self.refcoords[rid] = dict() self.rids.add(rid) for c in coords: - if not c in self.refcoords[rid]: self.refcoords[rid][c]=1 + if not c in self.refcoords[rid]: + self.refcoords[rid][c] = 1 - def append_keeprid(self,rid): + def append_keeprid(self, rid): self.keeprids.add(rid) - - def set_chrom_start_end(self,chrom,start,end): - self.chrom=chrom - self.start=int(start) - self.end=int(end) + + def set_chrom_start_end(self, chrom, start, end): + self.chrom = chrom + self.start = int(start) + self.end = int(end) + class BSJ: def __init__(self): - self.chrom="" - self.start="" - self.end="" - self.score=0 - self.name="." - self.strand="U" - self.bitids=set() - self.rids=set() - + self.chrom = "" + self.start = "" + self.end = "" + self.score = 0 + self.name = "." + self.strand = "U" + self.bitids = set() + self.rids = set() + def plusone(self): - self.score+=1 - - def set_strand(self,strand): - self.strand=strand - - def set_chrom(self,chrom): - self.chrom=chrom - - def set_start(self,start): - self.start=start - - def set_end(self,end): - self.end=end - - def append_bitid(self,bitid): + self.score += 1 + + def set_strand(self, strand): + self.strand = strand + + def set_chrom(self, chrom): + self.chrom = chrom + + def set_start(self, start): + self.start = start + + def set_end(self, end): + self.end = end + + def append_bitid(self, bitid): self.bitids.add(bitid) - def append_rid(self,rid): + def append_rid(self, rid): self.rids.add(rid) - - def write_out_BSJ(self,outbed): - t=[] + + def write_out_BSJ(self, outbed): + t = [] t.append(self.chrom) t.append(str(self.start)) t.append(str(self.end)) @@ -104,145 +110,150 @@ def write_out_BSJ(self,outbed): t.append(self.strand) t.append(",".join(self.bitids)) t.append(",".join(self.rids)) - outbed.write("\t".join(t)+"\n") + outbed.write("\t".join(t) + "\n") - def update_score_and_found_count(self,junctions_found): + def update_score_and_found_count(self, junctions_found): self.score = len(self.rids) - jid = self.chrom + "##" + str(self.start) + "##" + str(int(self.end)-1) - junctions_found[jid]+=self.score + jid = self.chrom + "##" + str(self.start) + "##" + str(int(self.end) - 1) + junctions_found[jid] += self.score + - class Readinfo: - def __init__(self,readid,rname): - self.readid=readid - self.refname=rname - self.bitflags=list() - self.bitid="" - self.strand="." - self.start=-1 - self.end=-1 - self.refcoordinates=dict() - self.isread1=dict() - self.isreverse=dict() - self.issecondary=dict() - self.issupplementary=dict() - + def __init__(self, readid, rname): + self.readid = readid + self.refname = rname + self.bitflags = list() + self.bitid = "" + self.strand = "." + self.start = -1 + self.end = -1 + self.refcoordinates = dict() + self.isread1 = dict() + self.isreverse = dict() + self.issecondary = dict() + self.issupplementary = dict() + def __str__(self): - s = "readid: %s"%(self.readid) - s = "%s\tbitflags: %s"%(s,self.bitflags) - s = "%s\tbitid: %s"%(s,self.bitid) + s = "readid: %s" % (self.readid) + s = "%s\tbitflags: %s" % (s, self.bitflags) + s = "%s\tbitid: %s" % (s, self.bitid) for bf in self.bitflags: - s = "%s\t%s\trefcoordinates: %s"%(s,bf,", ".join(list(map(lambda x:str(x),self.refcoordinates[bf])))) + s = "%s\t%s\trefcoordinates: %s" % ( + s, + bf, + ", ".join(list(map(lambda x: str(x), self.refcoordinates[bf]))), + ) return s - def set_refcoordinates(self,bitflag,refpos): - self.refcoordinates[bitflag]=refpos - - def set_read1_reverse_secondary_supplementary(self,bitflag,read): + def set_refcoordinates(self, bitflag, refpos): + self.refcoordinates[bitflag] = refpos + + def set_read1_reverse_secondary_supplementary(self, bitflag, read): if read.is_read1: - self.isread1[bitflag]="Y" + self.isread1[bitflag] = "Y" else: - self.isread1[bitflag]="N" + self.isread1[bitflag] = "N" if read.is_reverse: - self.isreverse[bitflag]="Y" + self.isreverse[bitflag] = "Y" else: - self.isreverse[bitflag]="N" + self.isreverse[bitflag] = "N" if read.is_secondary: - self.issecondary[bitflag]="Y" + self.issecondary[bitflag] = "Y" else: - self.issecondary[bitflag]="N" + self.issecondary[bitflag] = "N" if read.is_supplementary: - self.issupplementary[bitflag]="Y" + self.issupplementary[bitflag] = "Y" else: - self.issupplementary[bitflag]="N" - - def append_alignment(self,read): + self.issupplementary[bitflag] = "N" + + def append_alignment(self, read): self.alignments.append(read) - - def append_bitflag(self,bf): + + def append_bitflag(self, bf): self.bitflags.append(bf) - + # def extend_ref_positions(self,refcoords): # self.refcoordinates.extend(refcoords) - + def generate_bitid(self): - bitlist=sorted(self.bitflags) - self.bitid="##".join(list(map(lambda x:str(x),bitlist))) -# self.bitid=str(bitlist[0])+"##"+str(bitlist[1])+"##"+str(bitlist[2]) - + bitlist = sorted(self.bitflags) + self.bitid = "##".join(list(map(lambda x: str(x), bitlist))) + + # self.bitid=str(bitlist[0])+"##"+str(bitlist[1])+"##"+str(bitlist[2]) + def get_strand(self): - if self.bitid=="83##163##2129": - self.strand="+" - elif self.bitid=="339##419##2385": - self.strand="+" - elif self.bitid=="83##163##2209": - self.strand="+" - elif self.bitid=="339##419##2465": - self.strand="+" - elif self.bitid=="99##147##2193": - self.strand="-" - elif self.bitid=="355##403##2449": - self.strand="-" - elif self.bitid=="99##147##2145": - self.strand="-" - elif self.bitid=="355##403##2401": - self.strand="-" - elif self.bitid=="16##2064": - self.strand="+" - elif self.bitid=="272##2320": - self.strand="+" - elif self.bitid=="0##2048": - self.strand="-" - elif self.bitid=="256##2304": - self.strand="-" - elif self.bitid=="153##2201": - self.strand="-" + if self.bitid == "83##163##2129": + self.strand = "+" + elif self.bitid == "339##419##2385": + self.strand = "+" + elif self.bitid == "83##163##2209": + self.strand = "+" + elif self.bitid == "339##419##2465": + self.strand = "+" + elif self.bitid == "99##147##2193": + self.strand = "-" + elif self.bitid == "355##403##2449": + self.strand = "-" + elif self.bitid == "99##147##2145": + self.strand = "-" + elif self.bitid == "355##403##2401": + self.strand = "-" + elif self.bitid == "16##2064": + self.strand = "+" + elif self.bitid == "272##2320": + self.strand = "+" + elif self.bitid == "0##2048": + self.strand = "-" + elif self.bitid == "256##2304": + self.strand = "-" + elif self.bitid == "153##2201": + self.strand = "-" else: - self.strand="U" + self.strand = "U" - def validate_BSJ_read(self,junctions): + def validate_BSJ_read(self, junctions): """ Checks if read is truly a BSJ originitor. * Defines left, right and middle alignments * Left and right alignments should not overlap * Middle alignment should be between left and right alignments """ - if len(self.bitid.split("##"))==3: - left=-1 - right=-1 - middle=-1 - if self.bitid=="83##163##2129": - left=2129 - right=83 - middle=163 - if self.bitid=="339##419##2385": - left=2385 - right=339 - middle=419 - if self.bitid=="83##163##2209": - left=163 - right=2209 - middle=83 - if self.bitid=="339##419##2465": - left=419 - right=2465 - middle=339 - if self.bitid=="99##147##2145": - left=99 - right=2145 - middle=147 - if self.bitid=="355##403##2401": - left=355 - right=2401 - middle=403 - if self.bitid=="99##147##2193": - left=2193 - right=147 - middle=99 - if self.bitid=="355##403##2449": - left=2449 - right=403 - middle=355 + if len(self.bitid.split("##")) == 3: + left = -1 + right = -1 + middle = -1 + if self.bitid == "83##163##2129": + left = 2129 + right = 83 + middle = 163 + if self.bitid == "339##419##2385": + left = 2385 + right = 339 + middle = 419 + if self.bitid == "83##163##2209": + left = 163 + right = 2209 + middle = 83 + if self.bitid == "339##419##2465": + left = 419 + right = 2465 + middle = 339 + if self.bitid == "99##147##2145": + left = 99 + right = 2145 + middle = 147 + if self.bitid == "355##403##2401": + left = 355 + right = 2401 + middle = 403 + if self.bitid == "99##147##2193": + left = 2193 + right = 147 + middle = 99 + if self.bitid == "355##403##2449": + left = 2449 + right = 403 + middle = 355 # print(left,right,middle) if left == -1 or right == -1 or middle == -1: return False @@ -253,46 +264,46 @@ def validate_BSJ_read(self,junctions): # print("validate_BSJ_read",self.readid,self.refcoordinates[middle][0],self.refcoordinates[middle][-1]) leftmost = str(self.refcoordinates[left][0]) rightmost = str(self.refcoordinates[right][-1]) - possiblejid = chrom+"##"+leftmost+"##"+rightmost + possiblejid = chrom + "##" + leftmost + "##" + rightmost # print("validate_BSJ_read",self.readid,possiblejid) if possiblejid in junctions: self.start = leftmost - self.end = str(int(rightmost) + 1) # this will be added to the BED file + self.end = str(int(rightmost) + 1) # this will be added to the BED file return True else: return False - - - + def get_bsjid(self): - t=[] + t = [] t.append(self.refname) t.append(self.start) t.append(self.end) t.append(self.strand) return "##".join(t) - - def write_out_reads(self,outbam): + + def write_out_reads(self, outbam): for r in self.alignments: outbam.write(r) - - + + def get_uniq_readid(r): - rname=r.query_name - hi=r.get_tag("HI") - rid=rname+"##"+str(hi) + rname = r.query_name + hi = r.get_tag("HI") + rid = rname + "##" + str(hi) return rid + def get_bitflag(r): - bitflag=str(r).split("\t")[1] + bitflag = str(r).split("\t")[1] return int(bitflag) + def _bsjid2jid(bsjid): - x=bsjid.split("##") - chrom=x[0] - start=x[1] - end=str(int(x[2])-1) - return "##".join([chrom,start,end]) + x = bsjid.split("##") + chrom = x[0] + start = x[1] + end = str(int(x[2]) - 1) + return "##".join([chrom, start, end]) def main(): @@ -304,153 +315,279 @@ def main(): where the chrom, start and end represent the BSJ the read is depicting. """ ) - parser.add_argument("-i","--inbam",dest="inbam",required=True,type=str, - help="Input NON-Chimeric-only STAR2p BAM file") - parser.add_argument('-t','--sample_counts_table', dest='countstable', type=str, required=True, - help='circExplore per-sample counts table') # get coordinates of the circRNA - parser.add_argument("-s",'--sample_name', dest='samplename', type=str, required=False, default = 'sample1', - help='Sample Name: SM for RG') - parser.add_argument("-l",'--library', dest='library', type=str, required=False, default = 'lib1', - help='Sample Name: LB for RG') - parser.add_argument("-f",'--platform', dest='platform', type=str, required=False, default = 'illumina', - help='Sample Name: PL for RG') - parser.add_argument("-u",'--unit', dest='unit', type=str, required=False, default = 'unit1', - help='Sample Name: PU for RG') - parser.add_argument("-o","--outbam",dest="outbam",required=True,type=argparse.FileType('w'), - help="Output bam file ... both strands") - parser.add_argument("-p","--plusbam",dest="plusbam",required=True,type=argparse.FileType('w'), - help="Output plus strand bam file") - parser.add_argument("-m","--minusbam",dest="minusbam",required=True,type=argparse.FileType('w'), - help="Output plus strand bam file") - parser.add_argument("-b","--bed",dest="bed",required=True,type=argparse.FileType('w', encoding='UTF-8'), - help="Output BSJ bed file (with strand info)") - parser.add_argument("-j","--junctionsfound",dest="junctionsfound",required=True,type=argparse.FileType('w', encoding='UTF-8'), - help="Output TSV file with counts of junctions expected vs found") - parser.add_argument('--regions', dest='regions', type=str, required=True, - help='regions file eg. ref.fa.regions') - parser.add_argument('--host', dest='host', type=str, required=True, - help='host name eg.hg38... single value...host_filter_min/host_filter_max filters are applied to this region only') - parser.add_argument('--additives', dest='additives', type=str, required=True, - help='additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out') - parser.add_argument('--viruses', dest='viruses', type=str, required=True, - help='virus name(s) eg.NC_009333.1... comma-separated list...virus_filter_min/virus_filter_max filters are applied to this region only') - args = parser.parse_args() + parser.add_argument( + "-i", + "--inbam", + dest="inbam", + required=True, + type=str, + help="Input NON-Chimeric-only STAR2p BAM file", + ) + parser.add_argument( + "-t", + "--sample_counts_table", + dest="countstable", + type=str, + required=True, + help="circExplore per-sample counts table", + ) # get coordinates of the circRNA + parser.add_argument( + "-s", + "--sample_name", + dest="samplename", + type=str, + required=False, + default="sample1", + help="Sample Name: SM for RG", + ) + parser.add_argument( + "-l", + "--library", + dest="library", + type=str, + required=False, + default="lib1", + help="Sample Name: LB for RG", + ) + parser.add_argument( + "-f", + "--platform", + dest="platform", + type=str, + required=False, + default="illumina", + help="Sample Name: PL for RG", + ) + parser.add_argument( + "-u", + "--unit", + dest="unit", + type=str, + required=False, + default="unit1", + help="Sample Name: PU for RG", + ) + parser.add_argument( + "-o", + "--outbam", + dest="outbam", + required=True, + type=argparse.FileType("w"), + help="Output bam file ... both strands", + ) + parser.add_argument( + "-p", + "--plusbam", + dest="plusbam", + required=True, + type=argparse.FileType("w"), + help="Output plus strand bam file", + ) + parser.add_argument( + "-m", + "--minusbam", + dest="minusbam", + required=True, + type=argparse.FileType("w"), + help="Output plus strand bam file", + ) + parser.add_argument( + "-b", + "--bed", + dest="bed", + required=True, + type=argparse.FileType("w", encoding="UTF-8"), + help="Output BSJ bed file (with strand info)", + ) + parser.add_argument( + "-j", + "--junctionsfound", + dest="junctionsfound", + required=True, + type=argparse.FileType("w", encoding="UTF-8"), + help="Output TSV file with counts of junctions expected vs found", + ) + parser.add_argument( + "--regions", + dest="regions", + type=str, + required=True, + help="regions file eg. ref.fa.regions", + ) + parser.add_argument( + "--host", + dest="host", + type=str, + required=True, + help="host name eg.hg38... single value...host_filter_min/host_filter_max filters are applied to this region only", + ) + parser.add_argument( + "--additives", + dest="additives", + type=str, + required=True, + help="additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out", + ) + parser.add_argument( + "--viruses", + dest="viruses", + type=str, + required=True, + help="virus name(s) eg.NC_009333.1... comma-separated list...virus_filter_min/virus_filter_max filters are applied to this region only", + ) + args = parser.parse_args() samfile = pysam.AlignmentFile(args.inbam, "rb") samheader = samfile.header.to_dict() - samheader['RG']=list() -# bsjfile = open(args.bed,"w") - junctionsfile = open(args.countstable,'r') - junctions=dict() - junction_chroms=set() + samheader["RG"] = list() + # bsjfile = open(args.bed,"w") + junctionsfile = open(args.countstable, "r") + junctions = dict() + junction_chroms = set() print("Reading...junctions!...") for l in junctionsfile.readlines(): - if "read_count" in l: continue + if "read_count" in l: + continue l = l.strip().split("\t") chrom = l[0] junction_chroms.add(chrom) start = l[1] - end = str(int(l[2])-1) - jid = chrom+"##"+start+"##"+end # create a unique junction ID for each line in the BSJ junction file and make it the dict key ... easy for searching! - samheader['RG'].append({'ID':jid, 'LB':args.library, 'PL':args.platform, 'PU':args.unit,'SM':args.samplename}) - junctions[jid] = JUNCTION(jid,chrom=chrom,start=start,end=end) + end = str(int(l[2]) - 1) + jid = ( + chrom + "##" + start + "##" + end + ) # create a unique junction ID for each line in the BSJ junction file and make it the dict key ... easy for searching! + samheader["RG"].append( + { + "ID": jid, + "LB": args.library, + "PL": args.platform, + "PU": args.unit, + "SM": args.samplename, + } + ) + junctions[jid] = JUNCTION(jid, chrom=chrom, start=start, end=end) junctionsfile.close() sequences = set() - for v in samheader['SQ']: - sequences.add(v['SN']) + for v in samheader["SQ"]: + sequences.add(v["SN"]) # pp.pprint(junctions) # print(sequences) if not junction_chroms.issubset(sequences): - print("Junction file has junction on chromosome which are NOT part of the supplied BAM file!!!") + print( + "Junction file has junction on chromosome which are NOT part of the supplied BAM file!!!" + ) exit() - print("Done reading %d junctions."%(len(junctions))) + print("Done reading %d junctions." % (len(junctions))) print("Reading...regions file!...") host_virus_sequences = set() - regions = read_regions(regionsfile=args.regions,host=args.host,additives=args.additives,viruses=args.viruses) + regions = read_regions( + regionsfile=args.regions, + host=args.host, + additives=args.additives, + viruses=args.viruses, + ) for s in sequences: - hav = _get_host_additive_virus(regions,s) - if hav == "host": host_virus_sequences.add(s) - if hav == "virus": host_virus_sequences.add(s) + hav = _get_host_additive_virus(regions, s) + if hav == "host": + host_virus_sequences.add(s) + if hav == "virus": + host_virus_sequences.add(s) # print(host_virus_sequences) host_virus_sequences = host_virus_sequences.intersection(junction_chroms) # print(host_virus_sequences) - rid2jid=dict() - jid2rid=dict() - for jid,junc in junctions.items(): + rid2jid = dict() + jid2rid = dict() + for jid, junc in junctions.items(): # print(jid) - for read in samfile.fetch(junc.chrom,junc.start-2,junc.end+2): - if read.reference_id != read.next_reference_id: continue # only works for PE ... for SE read.next_reference_id is -1 - if ( not read.is_proper_pair ) or read.is_secondary or read.is_supplementary or read.is_unmapped : continue - rid=get_uniq_readid(read) - rid2jid[rid]=jid - if not jid in jid2rid: jid2rid[jid]=set() + for read in samfile.fetch(junc.chrom, junc.start - 2, junc.end + 2): + if read.reference_id != read.next_reference_id: + continue # only works for PE ... for SE read.next_reference_id is -1 + if ( + (not read.is_proper_pair) + or read.is_secondary + or read.is_supplementary + or read.is_unmapped + ): + continue + rid = get_uniq_readid(read) + rid2jid[rid] = jid + if not jid in jid2rid: + jid2rid[jid] = set() jid2rid[jid].add(rid) samfile.reset() - - outfile = pysam.AlignmentFile(args.outbam, "wb", header = samheader) + + outfile = pysam.AlignmentFile(args.outbam, "wb", header=samheader) for read in samfile.fetch(): - rid=get_uniq_readid(read) + rid = get_uniq_readid(read) if rid in rid2jid: read.set_tag("RG", jid, value_type="Z") outbam.write(read) outbam.close() samfile.close() args.junctionsfound.write("#chrom\tstart\tend\tfound_linear_reads\n") - for jid,junc in junctions.items(): - args.junctionsfound.write("%s\t%d\t%d\t%d\n"%(junc.chrom,junc.start,junc.end,len(jid2rid[jid]))) + for jid, junc in junctions.items(): + args.junctionsfound.write( + "%s\t%d\t%d\t%d\n" % (junc.chrom, junc.start, junc.end, len(jid2rid[jid])) + ) args.junctionsfound.close() exit() + # # print("rid",rid) + # # print("junctions[jid].rids",junctions[jid].rids) + # # print("junctions[jid].refcoords:") + # # pp.pprint(junctions[jid].refcoords) + # junctions[jid].append_rid_refcoords(rid,read.get_reference_positions()) + # print("junctions[jid].rids",junctions[jid].rids) + # print("junctions[jid].refcoords:") + # pp.pprint(junctions[jid].refcoords) + # for rid in junctions[jid].rids: + # print(rid) + # # if junc.start in junctions[jid].refcoords[rid] and junc.end in junctions[jid].refcoords[rid]: + # if junc.start in junctions[jid].refcoords[rid] or junc.end in junctions[jid].refcoords[rid]: + # junctions[jid].append_keeprid(rid) + # print(len(junctions[jid].rids)) + # print(len(junctions[jid].keeprids)) + # print(junctions[jid].keeprids) + # exit() - - # # print("rid",rid) - # # print("junctions[jid].rids",junctions[jid].rids) - # # print("junctions[jid].refcoords:") - # # pp.pprint(junctions[jid].refcoords) - # junctions[jid].append_rid_refcoords(rid,read.get_reference_positions()) - # print("junctions[jid].rids",junctions[jid].rids) - # print("junctions[jid].refcoords:") - # pp.pprint(junctions[jid].refcoords) - # for rid in junctions[jid].rids: - # print(rid) - # # if junc.start in junctions[jid].refcoords[rid] and junc.end in junctions[jid].refcoords[rid]: - # if junc.start in junctions[jid].refcoords[rid] or junc.end in junctions[jid].refcoords[rid]: - # junctions[jid].append_keeprid(rid) - # print(len(junctions[jid].rids)) - # print(len(junctions[jid].keeprids)) - # print(junctions[jid].keeprids) - # exit() - - - - - - - - bigdict=dict() + bigdict = dict() # print("Opening...") # print(args.inbam) print("Reading...alignments!...") - count=0 - count2=0 + count = 0 + count2 = 0 for read in samfile.fetch(): - count+=1 - if debug: print(read,read.reference_id,read.next_reference_id) - if read.reference_id != read.next_reference_id: continue # only works for PE ... for SE read.next_reference_id is -1 - count2+=1 - rid=get_uniq_readid(read) # add the HI number to the readid - if debug:print(rid) + count += 1 + if debug: + print(read, read.reference_id, read.next_reference_id) + if read.reference_id != read.next_reference_id: + continue # only works for PE ... for SE read.next_reference_id is -1 + count2 += 1 + rid = get_uniq_readid(read) # add the HI number to the readid + if debug: + print(rid) if not rid in bigdict: - bigdict[rid]=Readinfo(rid,read.reference_name) + bigdict[rid] = Readinfo(rid, read.reference_name) # bigdict[rid].append_alignment(read) # since rid has HI number included ... this separates alignment by HI - bitflag=get_bitflag(read) - if debug:print(bitflag) - bigdict[rid].append_bitflag(bitflag) # each rid can have upto 3 lines in the BAM with each having its own bitflag ... collect all bigflags in a list here - refpos=list(filter(lambda x:x!=None,read.get_reference_positions(full_length=True))) - bigdict[rid].set_refcoordinates(bitflag,refpos) # maintain a list of reference coordinated that are "aligned" for each bitflag in each rid alignment + bitflag = get_bitflag(read) + if debug: + print(bitflag) + bigdict[rid].append_bitflag( + bitflag + ) # each rid can have upto 3 lines in the BAM with each having its own bitflag ... collect all bigflags in a list here + refpos = list( + filter(lambda x: x != None, read.get_reference_positions(full_length=True)) + ) + bigdict[rid].set_refcoordinates( + bitflag, refpos + ) # maintain a list of reference coordinated that are "aligned" for each bitflag in each rid alignment # bigdict[rid].set_read1_reverse_secondary_supplementary(bitflag,read) - if debug:print(bigdict[rid]) - print("Done reading %d chimeric alignments. [%d same chrom chimeras]"%(count,count2)) + if debug: + print(bigdict[rid]) + print( + "Done reading %d chimeric alignments. [%d same chrom chimeras]" + % (count, count2) + ) # samfile.close() # print("Closed") # print("Reopening") @@ -460,49 +597,57 @@ def main(): samfile.reset() print("Writing BAMs") print("Re-Reading...alignments!...") - plusfile = pysam.AlignmentFile(args.plusbam, "wb", header = samheader) - minusfile = pysam.AlignmentFile(args.minusbam, "wb", header = samheader) - outfile = pysam.AlignmentFile(args.outbam, "wb", header = samheader) - bsjdict=dict() - bitid_counts=dict() + plusfile = pysam.AlignmentFile(args.plusbam, "wb", header=samheader) + minusfile = pysam.AlignmentFile(args.minusbam, "wb", header=samheader) + outfile = pysam.AlignmentFile(args.outbam, "wb", header=samheader) + bsjdict = dict() + bitid_counts = dict() for read in samfile.fetch(): - if read.reference_id != read.next_reference_id: continue - rid=get_uniq_readid(read) + if read.reference_id != read.next_reference_id: + continue + rid = get_uniq_readid(read) if rid in bigdict: - bigdict[rid].generate_bitid() # separate all bitflags for the same rid with ## and create a unique single bitflag ... bitflags are pre-sorted - if debug:print(bigdict[rid]) - bigdict[rid].get_strand() # use the unique aggregated bitid to extract the strand information ... all possible cases are explicitly covered - if not bigdict[rid].validate_BSJ_read(junctions=junctions): # ensure that the read alignments leftmost and rightmost coordinates match with one of the BSJ junctions... if yes then that rid represents a BSJ. Also add start and end to the BSJ object + bigdict[ + rid + ].generate_bitid() # separate all bitflags for the same rid with ## and create a unique single bitflag ... bitflags are pre-sorted + if debug: + print(bigdict[rid]) + bigdict[ + rid + ].get_strand() # use the unique aggregated bitid to extract the strand information ... all possible cases are explicitly covered + if not bigdict[rid].validate_BSJ_read( + junctions=junctions + ): # ensure that the read alignments leftmost and rightmost coordinates match with one of the BSJ junctions... if yes then that rid represents a BSJ. Also add start and end to the BSJ object continue # bigdict[rid].get_start_end() # print(bigdict[rid]) - bsjid=bigdict[rid].get_bsjid() - jid=_bsjid2jid(bsjid) + bsjid = bigdict[rid].get_bsjid() + jid = _bsjid2jid(bsjid) read.set_tag("RG", jid, value_type="Z") - if bigdict[rid].strand=="+": + if bigdict[rid].strand == "+": plusfile.write(read) - if bigdict[rid].strand=="-": + if bigdict[rid].strand == "-": minusfile.write(read) outfile.write(read) if not bsjid in bsjdict: - bsjdict[bsjid]=BSJ() + bsjdict[bsjid] = BSJ() bsjdict[bsjid].set_chrom(bigdict[rid].refname) bsjdict[bsjid].set_start(bigdict[rid].start) bsjdict[bsjid].set_end(bigdict[rid].end) bsjdict[bsjid].set_strand(bigdict[rid].strand) bsjdict[bsjid].append_bitid(bigdict[rid].bitid) if not bigdict[rid].bitid in bitid_counts: - bitid_counts[bigdict[rid].bitid]=0 - bitid_counts[bigdict[rid].bitid]+=1 + bitid_counts[bigdict[rid].bitid] = 0 + bitid_counts[bigdict[rid].bitid] += 1 bsjdict[bsjid].append_rid(rid) - print("Done!") + print("Done!") for b in bitid_counts.keys(): - print(b,bitid_counts[b]) + print(b, bitid_counts[b]) print("Writing BED") for bsjid in bsjdict.keys(): bsjdict[bsjid].update_score_and_found_count(junctions_found) bsjdict[bsjid].write_out_BSJ(args.bed) - + plusfile.close() minusfile.close() samfile.close() @@ -510,16 +655,17 @@ def main(): args.bed.close() args.junctionsfound.write("#chrom\tstart\tend\texpected_counts\tfound_counts\n") for jid in junctions.keys(): - x=jid.split("##") - chrom=x[0] - start=int(x[1]) - end=int(x[2])+1 - args.junctionsfound.write("%s\t%d\t%d\t%d\t%d\n"%(chrom,start,end,junctions[jid],junctions_found[jid])) + x = jid.split("##") + chrom = x[0] + start = int(x[1]) + end = int(x[2]) + 1 + args.junctionsfound.write( + "%s\t%d\t%d\t%d\t%d\n" + % (chrom, start, end, junctions[jid], junctions_found[jid]) + ) args.junctionsfound.close() print("ALL Done!") - + if __name__ == "__main__": main() - - diff --git a/workflow/scripts/create_circExplorer_per_sample_counts_table.py b/workflow/scripts/create_circExplorer_per_sample_counts_table.py index a60e321..50c04e9 100755 --- a/workflow/scripts/create_circExplorer_per_sample_counts_table.py +++ b/workflow/scripts/create_circExplorer_per_sample_counts_table.py @@ -16,41 +16,60 @@ # 11 linear_spliced_BSJ_reads_opposite_strand -def _df_setcol_as_int(df,collist): +def _df_setcol_as_int(df, collist): for c in collist: - df[[c]]=df[[c]].astype(int) + df[[c]] = df[[c]].astype(int) return df -def _df_setcol_as_str(df,collist): + +def _df_setcol_as_str(df, collist): for c in collist: - df[[c]]=df[[c]].astype(str) + df[[c]] = df[[c]].astype(str) return df + def main(): # debug = True debug = False - parser = argparse.ArgumentParser( + parser = argparse.ArgumentParser() + parser.add_argument( + "--annotationcounts", + dest="annotationcounts", + required=True, + type=str, + help="annotated_counts.tsv counts file", + ) + parser.add_argument( + "--allfoundcounts", + dest="allfoundcounts", + required=True, + type=str, + help="readcounts.tsv", + ) + parser.add_argument( + "--countstable", + dest="mergedcounts", + required=True, + type=str, + help="merged counts_table.tsv file", ) - parser.add_argument("--annotationcounts",dest="annotationcounts",required=True,type=str, - help="annotated_counts.tsv counts file") - parser.add_argument("--allfoundcounts",dest="allfoundcounts",required=True,type=str, - help="readcounts.tsv") - parser.add_argument("--countstable",dest="mergedcounts",required=True,type=str, - help="merged counts_table.tsv file") args = parser.parse_args() - bcounts = pandas.read_csv(args.annotationcounts,header=0,sep="\t") - lcounts = pandas.read_csv(args.allfoundcounts,header=0,sep="\t") + bcounts = pandas.read_csv(args.annotationcounts, header=0, sep="\t") + lcounts = pandas.read_csv(args.allfoundcounts, header=0, sep="\t") # print(bcounts.head()) # print(lcounts.head()) - mcounts = bcounts.merge(lcounts,how='outer',on=["#chrom","start","end","strand"]) - mcounts.fillna(value=0,inplace=True) - strcols = [ '#chrom', 'strand', 'known_novel' ] - intcols = list ( set(mcounts.columns) - set(strcols) ) - mcounts = _df_setcol_as_str(mcounts,strcols) - mcounts = _df_setcol_as_int(mcounts,intcols) - mcounts.drop(["read_count"],axis=1,inplace=True) - mcounts.to_csv(args.mergedcounts,index=False,doublequote=False,sep="\t") + mcounts = bcounts.merge( + lcounts, how="outer", on=["#chrom", "start", "end", "strand"] + ) + mcounts.fillna(value=0, inplace=True) + strcols = ["#chrom", "strand", "known_novel"] + intcols = list(set(mcounts.columns) - set(strcols)) + mcounts = _df_setcol_as_str(mcounts, strcols) + mcounts = _df_setcol_as_int(mcounts, intcols) + mcounts.drop(["read_count"], axis=1, inplace=True) + mcounts.to_csv(args.mergedcounts, index=False, doublequote=False, sep="\t") + if __name__ == "__main__": - main() \ No newline at end of file + main() diff --git a/workflow/scripts/create_dcc_per_sample_counts_table.py b/workflow/scripts/create_dcc_per_sample_counts_table.py index b6af479..e791a33 100755 --- a/workflow/scripts/create_dcc_per_sample_counts_table.py +++ b/workflow/scripts/create_dcc_per_sample_counts_table.py @@ -1,21 +1,33 @@ import argparse import pandas -parser = argparse.ArgumentParser(description='Merge information from CircCoordinates and CircRNACount files generated by DCC') -parser.add_argument('--CircCoordinates', dest='CircCoordinates', type=str, required=True, - help='CircCoordinates file from DCC') -parser.add_argument('--CircRNALinearCount', dest='CircRNACount', type=str, required=True, - help='CircRNACount + LinearCount output file from DCC') +parser = argparse.ArgumentParser( + description="Merge information from CircCoordinates and CircRNACount files generated by DCC" +) +parser.add_argument( + "--CircCoordinates", + dest="CircCoordinates", + type=str, + required=True, + help="CircCoordinates file from DCC", +) +parser.add_argument( + "--CircRNALinearCount", + dest="CircRNACount", + type=str, + required=True, + help="CircRNACount + LinearCount output file from DCC", +) # parser.add_argument('--samplename', dest='samplename', type=str, required=True, # help='Sample Name') -parser.add_argument('-o',dest='outfile',required=True,help='merged table') +parser.add_argument("-o", dest="outfile", required=True, help="merged table") args = parser.parse_args() # sn=args.samplename # load files -CircCoordinates=pandas.read_csv(args.CircCoordinates,sep="\t",header=0) -CircRNACount=pandas.read_csv(args.CircRNACount,sep="\t",header=0) +CircCoordinates = pandas.read_csv(args.CircCoordinates, sep="\t", header=0) +CircRNACount = pandas.read_csv(args.CircRNACount, sep="\t", header=0) # CircRNACount columns are: # | # | ColName | @@ -53,37 +65,85 @@ # | 7 | Start-End | # | 8 | OverallRegion | -old_names = CircCoordinates.columns -new_names = ['chr', 'start', 'end', 'gene', 'junction_type', 'strand2', 'start_end_region', 'overall_region'] +old_names = CircCoordinates.columns +new_names = [ + "chr", + "start", + "end", + "gene", + "junction_type", + "strand2", + "start_end_region", + "overall_region", +] CircCoordinates.rename(columns=dict(zip(old_names, new_names)), inplace=True) -CircCoordinates[['junction_type']]=CircCoordinates[['junction_type']].astype(str) -CircCoordinates.loc[CircCoordinates['junction_type']=="0",'junction_type']="Non-canonical" -CircCoordinates.loc[CircCoordinates['junction_type']=="1",'junction_type']="GT/AG" -CircCoordinates.loc[CircCoordinates['junction_type']=="2",'junction_type']="CT/AC" -CircCoordinates.loc[CircCoordinates['junction_type']=="3",'junction_type']="GC/AG" -CircCoordinates.loc[CircCoordinates['junction_type']=="4",'junction_type']="CT/GC" -CircCoordinates.loc[CircCoordinates['junction_type']=="5",'junction_type']="AT/AC" -CircCoordinates.loc[CircCoordinates['junction_type']=="6",'junction_type']="GT/AT" +CircCoordinates[["junction_type"]] = CircCoordinates[["junction_type"]].astype(str) +CircCoordinates.loc[ + CircCoordinates["junction_type"] == "0", "junction_type" +] = "Non-canonical" +CircCoordinates.loc[CircCoordinates["junction_type"] == "1", "junction_type"] = "GT/AG" +CircCoordinates.loc[CircCoordinates["junction_type"] == "2", "junction_type"] = "CT/AC" +CircCoordinates.loc[CircCoordinates["junction_type"] == "3", "junction_type"] = "GC/AG" +CircCoordinates.loc[CircCoordinates["junction_type"] == "4", "junction_type"] = "CT/GC" +CircCoordinates.loc[CircCoordinates["junction_type"] == "5", "junction_type"] = "AT/AC" +CircCoordinates.loc[CircCoordinates["junction_type"] == "6", "junction_type"] = "GT/AT" # strand is flipped in CircCoordinates file ... flipping it back -CircCoordinates['strand']="." -CircCoordinates.loc[CircCoordinates['strand2']=="-",'strand']="+" -CircCoordinates.loc[CircCoordinates['strand2']=="+",'strand']="-" +CircCoordinates["strand"] = "." +CircCoordinates.loc[CircCoordinates["strand2"] == "-", "strand"] = "+" +CircCoordinates.loc[CircCoordinates["strand2"] == "+", "strand"] = "-" -CircCoordinates['dcc_annotation']=CircCoordinates['gene'].astype(str)+"##"+CircCoordinates['junction_type'].astype(str)+"##"+CircCoordinates['start_end_region'].astype(str) +CircCoordinates["dcc_annotation"] = ( + CircCoordinates["gene"].astype(str) + + "##" + + CircCoordinates["junction_type"].astype(str) + + "##" + + CircCoordinates["start_end_region"].astype(str) +) -CircCoordinates['circRNA_id']=CircCoordinates['chr'].astype(str)+"##"+CircCoordinates['start'].astype(str)+"##"+CircCoordinates['end'].astype(str)+"##"+CircCoordinates['strand'].astype(str) -CircCoordinates.drop(['chr', 'start', 'end', 'strand', 'strand2', 'gene','junction_type','start_end_region','overall_region'],axis=1,inplace=True) -CircCoordinates.set_index(['circRNA_id'],inplace=True) +CircCoordinates["circRNA_id"] = ( + CircCoordinates["chr"].astype(str) + + "##" + + CircCoordinates["start"].astype(str) + + "##" + + CircCoordinates["end"].astype(str) + + "##" + + CircCoordinates["strand"].astype(str) +) +CircCoordinates.drop( + [ + "chr", + "start", + "end", + "strand", + "strand2", + "gene", + "junction_type", + "start_end_region", + "overall_region", + ], + axis=1, + inplace=True, +) +CircCoordinates.set_index(["circRNA_id"], inplace=True) # CircCoordinates.to_csv("tmp",sep="\t",header=True,index=True) -old_names = CircRNACount.columns -new_names = ['chr', 'start', 'end', 'strand', 'read_count', 'linear_read_count'] +old_names = CircRNACount.columns +new_names = ["chr", "start", "end", "strand", "read_count", "linear_read_count"] CircRNACount.rename(columns=dict(zip(old_names, new_names)), inplace=True) -CircRNACount['circRNA_id']=CircRNACount['chr'].astype(str)+"##"+CircRNACount['start'].astype(str)+"##"+CircRNACount['end'].astype(str)+"##"+CircRNACount['strand'].astype(str) -CircRNACount.set_index(['circRNA_id'],inplace=True) +CircRNACount["circRNA_id"] = ( + CircRNACount["chr"].astype(str) + + "##" + + CircRNACount["start"].astype(str) + + "##" + + CircRNACount["end"].astype(str) + + "##" + + CircRNACount["strand"].astype(str) +) +CircRNACount.set_index(["circRNA_id"], inplace=True) # CircRNACount.to_csv("tmp2",sep="\t",header=True,index=True) -CircRNACount=CircRNACount.merge(CircCoordinates,left_index=True,right_index=True,how="left",sort=False) -CircRNACount.fillna("0",inplace=True) -CircRNACount.to_csv(args.outfile,sep="\t",header=True,index=False) - +CircRNACount = CircRNACount.merge( + CircCoordinates, left_index=True, right_index=True, how="left", sort=False +) +CircRNACount.fillna("0", inplace=True) +CircRNACount.to_csv(args.outfile, sep="\t", header=True, index=False) diff --git a/workflow/scripts/create_mapsplice_per_sample_counts_table.py b/workflow/scripts/create_mapsplice_per_sample_counts_table.py index 0af0b79..487b7da 100755 --- a/workflow/scripts/create_mapsplice_per_sample_counts_table.py +++ b/workflow/scripts/create_mapsplice_per_sample_counts_table.py @@ -3,29 +3,87 @@ # pandas.options.mode.chained_assignment = None -parser = argparse.ArgumentParser(description='Create per sample Counts Table from MapSplice Outputs') -parser.add_argument('--circularRNAstxt', dest='circularRNAstxt', type=str, required=True, - help='circular_RNAs.txt file from MapSplice') -parser.add_argument('--back_spliced_min_reads', dest='back_spliced_min_reads', type=int, required=True, - help='back_spliced minimum read threshold') -parser.add_argument('--host', dest='host', type=str, required=True, - help='host name eg.hg38... single value...host_filter_min/host_filter_max filters are applied to this region only') -parser.add_argument('--additives', dest='additives', type=str, required=True, - help='additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out') -parser.add_argument('--viruses', dest='viruses', type=str, required=True, - help='virus name(s) eg.NC_009333.1... comma-separated list...virus_filter_min/virus_filter_max filters are applied to this region only') -parser.add_argument('--host_filter_min', dest='host_filter_min', type=int, required=False, default=150, - help='min BSJ size filter for host') -parser.add_argument('--virus_filter_min', dest='virus_filter_min', type=int, required=False, default=150, - help='min BSJ size filter for virus') -parser.add_argument('--host_filter_max', dest='host_filter_max', type=int, required=False, default=5000, - help='max BSJ size filter for host') -parser.add_argument('--virus_filter_max', dest='virus_filter_max', type=int, required=False, default=5000, - help='max BSJ size filter for virus') -parser.add_argument('--regions', dest='regions', type=str, required=True, - help='regions file eg. ref.fa.regions') -parser.add_argument('-o',dest='outfile',required=True,help='output table') -parser.add_argument('-fo',dest='filteredoutfile',required=True,help='filtered output table') +parser = argparse.ArgumentParser( + description="Create per sample Counts Table from MapSplice Outputs" +) +parser.add_argument( + "--circularRNAstxt", + dest="circularRNAstxt", + type=str, + required=True, + help="circular_RNAs.txt file from MapSplice", +) +parser.add_argument( + "--back_spliced_min_reads", + dest="back_spliced_min_reads", + type=int, + required=True, + help="back_spliced minimum read threshold", +) +parser.add_argument( + "--host", + dest="host", + type=str, + required=True, + help="host name eg.hg38... single value...host_filter_min/host_filter_max filters are applied to this region only", +) +parser.add_argument( + "--additives", + dest="additives", + type=str, + required=True, + help="additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out", +) +parser.add_argument( + "--viruses", + dest="viruses", + type=str, + required=True, + help="virus name(s) eg.NC_009333.1... comma-separated list...virus_filter_min/virus_filter_max filters are applied to this region only", +) +parser.add_argument( + "--host_filter_min", + dest="host_filter_min", + type=int, + required=False, + default=150, + help="min BSJ size filter for host", +) +parser.add_argument( + "--virus_filter_min", + dest="virus_filter_min", + type=int, + required=False, + default=150, + help="min BSJ size filter for virus", +) +parser.add_argument( + "--host_filter_max", + dest="host_filter_max", + type=int, + required=False, + default=5000, + help="max BSJ size filter for host", +) +parser.add_argument( + "--virus_filter_max", + dest="virus_filter_max", + type=int, + required=False, + default=5000, + help="max BSJ size filter for virus", +) +parser.add_argument( + "--regions", + dest="regions", + type=str, + required=True, + help="regions file eg. ref.fa.regions", +) +parser.add_argument("-o", dest="outfile", required=True, help="output table") +parser.add_argument( + "-fo", dest="filteredoutfile", required=True, help="filtered output table" +) args = parser.parse_args() # sn=args.samplename @@ -38,7 +96,7 @@ regions["host"] = list() regions["additive"] = list() regions["virus"] = list() -r = open(args.regions,'r') +r = open(args.regions, "r") rlines = r.readlines() r.close() allseqs = list() @@ -54,70 +112,214 @@ host_additive_virus = "virus" regions[host_additive_virus].extend(seq) allseqs.extend(seq) -regions["additive"] = list((set(allseqs)-set(regions["host"]))-set(regions["virus"])) +regions["additive"] = list( + (set(allseqs) - set(regions["host"])) - set(regions["virus"]) +) # load files -circularRNAstxt=pandas.read_csv(args.circularRNAstxt,sep="\t",header=None) +circularRNAstxt = pandas.read_csv(args.circularRNAstxt, sep="\t", header=None) # file has no column lables ... add them # ref: https://github.com/Aufiero/circRNAprofiler/blob/master/R/importFilesPredictionTool.R -circularRNAstxt.columns=["chrom", "donor_end", "acceptor_start", "id", "coverage", "strand", "rgb", "block_count", "block_size", "block_distance", "entropy", "flank_case", "flank_string", "min_mismatch", "max_mismatch", "ave_mismatch", "max_min_suffix", "max_min_prefix", "min_anchor_difference", "unique_read_count", "multi_read_count", "paired_read_count", "left_paired_read_count", "right_paired_read_count", "multiple_paired_read_count", "unique_paired_read_count", "single_read_count", "encompassing_read", "doner_start", "acceptor_end", "doner_iosforms", "acceptor_isoforms", "obsolete1", "obsolete2", "obsolete3", "obsolete4", "minimal_doner_isoform_length", "maximal_doner_isoform_length", "minimal_acceptor_isoform_length", "maximal_acceptor_isoform_length", "paired_reads_entropy", "mismatch_per_bp", "anchor_score", "max_doner_fragment", "max_acceptor_fragment", "max_cur_fragment", "min_cur_fragment", "ave_cur_fragment", "doner_encompass_unique", "doner_encompass_multiple", "acceptor_encompass_unique", "acceptor_encompass_multiple", "doner_match_to_normal", "acceptor_match_to_normal", "doner_seq", "acceptor_seq", "match_gene_strand", "annotated_type", "fusion_type", "gene_strand", "annotated_gene_donor", "annotated_gene_acceptor", "dummy"] +circularRNAstxt.columns = [ + "chrom", + "donor_end", + "acceptor_start", + "id", + "coverage", + "strand", + "rgb", + "block_count", + "block_size", + "block_distance", + "entropy", + "flank_case", + "flank_string", + "min_mismatch", + "max_mismatch", + "ave_mismatch", + "max_min_suffix", + "max_min_prefix", + "min_anchor_difference", + "unique_read_count", + "multi_read_count", + "paired_read_count", + "left_paired_read_count", + "right_paired_read_count", + "multiple_paired_read_count", + "unique_paired_read_count", + "single_read_count", + "encompassing_read", + "doner_start", + "acceptor_end", + "doner_iosforms", + "acceptor_isoforms", + "obsolete1", + "obsolete2", + "obsolete3", + "obsolete4", + "minimal_doner_isoform_length", + "maximal_doner_isoform_length", + "minimal_acceptor_isoform_length", + "maximal_acceptor_isoform_length", + "paired_reads_entropy", + "mismatch_per_bp", + "anchor_score", + "max_doner_fragment", + "max_acceptor_fragment", + "max_cur_fragment", + "min_cur_fragment", + "ave_cur_fragment", + "doner_encompass_unique", + "doner_encompass_multiple", + "acceptor_encompass_unique", + "acceptor_encompass_multiple", + "doner_match_to_normal", + "acceptor_match_to_normal", + "doner_seq", + "acceptor_seq", + "match_gene_strand", + "annotated_type", + "fusion_type", + "gene_strand", + "annotated_gene_donor", + "annotated_gene_acceptor", + "dummy", +] # 'chrom' is in the format 'donor_chr~acceptor_chr' ... hence needs to be split -circularRNAstxt[['Donor', 'Acceptor']] = circularRNAstxt['chrom'].str.split('~', expand=True) +circularRNAstxt[["Donor", "Acceptor"]] = circularRNAstxt["chrom"].str.split( + "~", expand=True +) # only select rows with ++ or -- strand -circularRNAstxtnew = pandas.concat([circularRNAstxt[circularRNAstxt['strand'] == '++' ],circularRNAstxt[circularRNAstxt['strand'] == '--' ]],ignore_index=True,sort=False) +circularRNAstxtnew = pandas.concat( + [ + circularRNAstxt[circularRNAstxt["strand"] == "++"], + circularRNAstxt[circularRNAstxt["strand"] == "--"], + ], + ignore_index=True, + sort=False, +) # strand is either ++ or -- .. needs to be fixed to + or - -circularRNAstxtnew.replace('++','+',inplace=True) -circularRNAstxtnew.replace('--','-',inplace=True) +circularRNAstxtnew.replace("++", "+", inplace=True) +circularRNAstxtnew.replace("--", "-", inplace=True) # subset columns and rename them and fix start/end order -circularRNAstxtnew=circularRNAstxtnew[['Acceptor', 'donor_end', 'acceptor_start', 'strand', 'coverage', 'fusion_type', 'entropy']] +circularRNAstxtnew = circularRNAstxtnew[ + [ + "Acceptor", + "donor_end", + "acceptor_start", + "strand", + "coverage", + "fusion_type", + "entropy", + ] +] -plus_strand = circularRNAstxtnew[circularRNAstxtnew['strand']=='+'] -plus_strand.columns = ['chrom','end','start','strand','read_count','fusion_type', 'entropy'] # start and end need to be switched! +plus_strand = circularRNAstxtnew[circularRNAstxtnew["strand"] == "+"] +plus_strand.columns = [ + "chrom", + "end", + "start", + "strand", + "read_count", + "fusion_type", + "entropy", +] # start and end need to be switched! -minus_strand = circularRNAstxtnew[circularRNAstxtnew['strand']=='-'] -minus_strand.columns = ['chrom','start','end','strand','read_count','fusion_type', 'entropy'] +minus_strand = circularRNAstxtnew[circularRNAstxtnew["strand"] == "-"] +minus_strand.columns = [ + "chrom", + "start", + "end", + "strand", + "read_count", + "fusion_type", + "entropy", +] -circularRNAstxtnew = pandas.concat([plus_strand,minus_strand],ignore_index=True,sort=False) +circularRNAstxtnew = pandas.concat( + [plus_strand, minus_strand], ignore_index=True, sort=False +) # circularRNAstxtnew.columns=['chrom','start','end','strand','read_count','fusion_type', 'entropy'] # create mapsplice_annotation column to include "fusion_type" along with "entropy" -circularRNAstxtnew['mapsplice_annotation']=circularRNAstxtnew['fusion_type'].astype(str)+"##"+circularRNAstxtnew['entropy'].astype(str) -circularRNAstxtnew.drop(['fusion_type', 'entropy'],axis=1,inplace=True) -circularRNAstxtnew.fillna(value="-11",inplace=True) -circularRNAstxtnew = circularRNAstxtnew.astype({"chrom": str, "start": int, "end": int, "strand": str, "read_count": int, "mapsplice_annotation": str}) +circularRNAstxtnew["mapsplice_annotation"] = ( + circularRNAstxtnew["fusion_type"].astype(str) + + "##" + + circularRNAstxtnew["entropy"].astype(str) +) +circularRNAstxtnew.drop(["fusion_type", "entropy"], axis=1, inplace=True) +circularRNAstxtnew.fillna(value="-11", inplace=True) +circularRNAstxtnew = circularRNAstxtnew.astype( + { + "chrom": str, + "start": int, + "end": int, + "strand": str, + "read_count": int, + "mapsplice_annotation": str, + } +) # create index -circularRNAstxtnew['circRNA_id']=circularRNAstxtnew['chrom'].astype(str)+"##"+circularRNAstxtnew['start'].astype(str)+"##"+circularRNAstxtnew['end'].astype(str)+"##"+circularRNAstxtnew['strand'].astype(str) -circularRNAstxtnew.set_index(['circRNA_id'],inplace=True) +circularRNAstxtnew["circRNA_id"] = ( + circularRNAstxtnew["chrom"].astype(str) + + "##" + + circularRNAstxtnew["start"].astype(str) + + "##" + + circularRNAstxtnew["end"].astype(str) + + "##" + + circularRNAstxtnew["strand"].astype(str) +) +circularRNAstxtnew.set_index(["circRNA_id"], inplace=True) # sort and write out -circularRNAstxtnew.sort_values(by=['chrom','start'],inplace=True) -circularRNAstxtnew.to_csv(args.outfile,sep="\t",header=True,index=False) +circularRNAstxtnew.sort_values(by=["chrom", "start"], inplace=True) +circularRNAstxtnew.to_csv(args.outfile, sep="\t", header=True, index=False) -# filter +# filter # nreads filter -circularRNAstxtnew = circularRNAstxtnew[~circularRNAstxtnew["chrom"].isin(regions["additive"])] -circularRNAstxtnew = circularRNAstxtnew[circularRNAstxtnew["read_count"] >= args.back_spliced_min_reads] +circularRNAstxtnew = circularRNAstxtnew[ + ~circularRNAstxtnew["chrom"].isin(regions["additive"]) +] +circularRNAstxtnew = circularRNAstxtnew[ + circularRNAstxtnew["read_count"] >= args.back_spliced_min_reads +] # host distance/size filter -circularRNAstxtnew_host = circularRNAstxtnew[circularRNAstxtnew["chrom"].isin(regions["host"])] -circularRNAstxtnew_host["dist"] = abs(circularRNAstxtnew_host["start"] - circularRNAstxtnew_host["end"]) -circularRNAstxtnew_host = circularRNAstxtnew_host[circularRNAstxtnew_host["dist"] > args.host_filter_min] -circularRNAstxtnew_host = circularRNAstxtnew_host[circularRNAstxtnew_host["dist"] < args.host_filter_max] -circularRNAstxtnew_host.drop(["dist"],axis=1,inplace=True) +circularRNAstxtnew_host = circularRNAstxtnew[ + circularRNAstxtnew["chrom"].isin(regions["host"]) +] +circularRNAstxtnew_host["dist"] = abs( + circularRNAstxtnew_host["start"] - circularRNAstxtnew_host["end"] +) +circularRNAstxtnew_host = circularRNAstxtnew_host[ + circularRNAstxtnew_host["dist"] > args.host_filter_min +] +circularRNAstxtnew_host = circularRNAstxtnew_host[ + circularRNAstxtnew_host["dist"] < args.host_filter_max +] +circularRNAstxtnew_host.drop(["dist"], axis=1, inplace=True) # virus distance/size filter -circularRNAstxtnew_virus = circularRNAstxtnew[circularRNAstxtnew["chrom"].isin(regions["virus"])] -circularRNAstxtnew_virus["dist"] = abs(circularRNAstxtnew_virus["start"] - circularRNAstxtnew_virus["end"]) -circularRNAstxtnew_virus = circularRNAstxtnew_virus[circularRNAstxtnew_virus["dist"] > args.virus_filter_min] -circularRNAstxtnew_virus = circularRNAstxtnew_virus[circularRNAstxtnew_virus["dist"] < args.virus_filter_max] -circularRNAstxtnew_virus.drop(["dist"],axis=1,inplace=True) +circularRNAstxtnew_virus = circularRNAstxtnew[ + circularRNAstxtnew["chrom"].isin(regions["virus"]) +] +circularRNAstxtnew_virus["dist"] = abs( + circularRNAstxtnew_virus["start"] - circularRNAstxtnew_virus["end"] +) +circularRNAstxtnew_virus = circularRNAstxtnew_virus[ + circularRNAstxtnew_virus["dist"] > args.virus_filter_min +] +circularRNAstxtnew_virus = circularRNAstxtnew_virus[ + circularRNAstxtnew_virus["dist"] < args.virus_filter_max +] +circularRNAstxtnew_virus.drop(["dist"], axis=1, inplace=True) -circularRNAstxtnew = pandas.concat([circularRNAstxtnew_host,circularRNAstxtnew_virus]) +circularRNAstxtnew = pandas.concat([circularRNAstxtnew_host, circularRNAstxtnew_virus]) # sort and write out -circularRNAstxtnew.sort_values(by=['chrom','start'],inplace=True) -circularRNAstxtnew.to_csv(args.filteredoutfile,sep="\t",header=True,index=False) \ No newline at end of file +circularRNAstxtnew.sort_values(by=["chrom", "start"], inplace=True) +circularRNAstxtnew.to_csv(args.filteredoutfile, sep="\t", header=True, index=False) diff --git a/workflow/scripts/create_nclscan_per_sample_counts_table.py b/workflow/scripts/create_nclscan_per_sample_counts_table.py index 91560e6..1fca24b 100755 --- a/workflow/scripts/create_nclscan_per_sample_counts_table.py +++ b/workflow/scripts/create_nclscan_per_sample_counts_table.py @@ -1,39 +1,99 @@ import argparse import pandas + def _annotation_int2str(i): - if i==0: + if i == 0: return "Intergenic" - elif i==1: + elif i == 1: return "Intragenic" else: return "Unknown" + # pandas.options.mode.chained_assignment = None -parser = argparse.ArgumentParser(description='Create per sample Counts Table from NCLscan Outputs') -parser.add_argument('--result', dest='resultsfile', type=str, required=True, - help='.result file from NCLscan') -parser.add_argument('--back_spliced_min_reads', dest='back_spliced_min_reads', type=int, required=True, - help='back_spliced minimum read threshold') -parser.add_argument('--host', dest='host', type=str, required=True, - help='host name eg.hg38... single value...host_filter_min/host_filter_max filters are applied to this region only') -parser.add_argument('--additives', dest='additives', type=str, required=True, - help='additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out') -parser.add_argument('--viruses', dest='viruses', type=str, required=True, - help='virus name(s) eg.NC_009333.1... comma-separated list...virus_filter_min/virus_filter_max filters are applied to this region only') -parser.add_argument('--host_filter_min', dest='host_filter_min', type=int, required=False, default=150, - help='min BSJ size filter for host') -parser.add_argument('--virus_filter_min', dest='virus_filter_min', type=int, required=False, default=150, - help='min BSJ size filter for virus') -parser.add_argument('--host_filter_max', dest='host_filter_max', type=int, required=False, default=5000, - help='max BSJ size filter for host') -parser.add_argument('--virus_filter_max', dest='virus_filter_max', type=int, required=False, default=5000, - help='max BSJ size filter for virus') -parser.add_argument('--regions', dest='regions', type=str, required=True, - help='regions file eg. ref.fa.regions') -parser.add_argument('-o',dest='outfile',required=True,help='output table') -parser.add_argument('-fo',dest='filteredoutfile',required=True,help='filtered output table') +parser = argparse.ArgumentParser( + description="Create per sample Counts Table from NCLscan Outputs" +) +parser.add_argument( + "--result", + dest="resultsfile", + type=str, + required=True, + help=".result file from NCLscan", +) +parser.add_argument( + "--back_spliced_min_reads", + dest="back_spliced_min_reads", + type=int, + required=True, + help="back_spliced minimum read threshold", +) +parser.add_argument( + "--host", + dest="host", + type=str, + required=True, + help="host name eg.hg38... single value...host_filter_min/host_filter_max filters are applied to this region only", +) +parser.add_argument( + "--additives", + dest="additives", + type=str, + required=True, + help="additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out", +) +parser.add_argument( + "--viruses", + dest="viruses", + type=str, + required=True, + help="virus name(s) eg.NC_009333.1... comma-separated list...virus_filter_min/virus_filter_max filters are applied to this region only", +) +parser.add_argument( + "--host_filter_min", + dest="host_filter_min", + type=int, + required=False, + default=150, + help="min BSJ size filter for host", +) +parser.add_argument( + "--virus_filter_min", + dest="virus_filter_min", + type=int, + required=False, + default=150, + help="min BSJ size filter for virus", +) +parser.add_argument( + "--host_filter_max", + dest="host_filter_max", + type=int, + required=False, + default=5000, + help="max BSJ size filter for host", +) +parser.add_argument( + "--virus_filter_max", + dest="virus_filter_max", + type=int, + required=False, + default=5000, + help="max BSJ size filter for virus", +) +parser.add_argument( + "--regions", + dest="regions", + type=str, + required=True, + help="regions file eg. ref.fa.regions", +) +parser.add_argument("-o", dest="outfile", required=True, help="output table") +parser.add_argument( + "-fo", dest="filteredoutfile", required=True, help="filtered output table" +) args = parser.parse_args() # sn=args.samplename @@ -46,7 +106,7 @@ def _annotation_int2str(i): regions["host"] = list() regions["additive"] = list() regions["virus"] = list() -r = open(args.regions,'r') +r = open(args.regions, "r") rlines = r.readlines() r.close() allseqs = list() @@ -62,17 +122,18 @@ def _annotation_int2str(i): host_additive_virus = "virus" regions[host_additive_virus].extend(seq) allseqs.extend(seq) -regions["additive"] = list((set(allseqs)-set(regions["host"]))-set(regions["virus"])) - +regions["additive"] = list( + (set(allseqs) - set(regions["host"])) - set(regions["virus"]) +) # load files -resultsfile=pandas.read_csv(args.resultsfile,sep="\t",header=None) +resultsfile = pandas.read_csv(args.resultsfile, sep="\t", header=None) # file has no column lables ... add them ... the file format is: # ref: https://github.com/TreesLab/NCLscan # | # | Description | ColName -# |----|--------------------------------------------|------------ +# |----|--------------------------------------------|------------ # | 1 | Chromosome name of the donor side (5'ss) | chrd # | 2 | Junction coordinate of the donor side | coordd # | 3 | Strand of the donor side | strandd @@ -86,33 +147,81 @@ def _annotation_int2str(i): # | 11 | Total number of junc-reads | jreads # | 12 | Total number of span-reads | sreads -resultsfile.columns=["chrd", "coordd", "strandd", "chra", "coorda", "stranda", "gened", "genea", "case", "reads", "jreads", "sreads"] +resultsfile.columns = [ + "chrd", + "coordd", + "strandd", + "chra", + "coorda", + "stranda", + "gened", + "genea", + "case", + "reads", + "jreads", + "sreads", +] resultsfile = resultsfile[resultsfile["chrd"] == resultsfile["chra"]] resultsfile = resultsfile[resultsfile["strandd"] == resultsfile["stranda"]] -plus_strand = resultsfile[resultsfile['strandd']=='+'] -plus_strand = plus_strand[["chrd", "coorda", "coordd", "strandd", "reads", "case"]] # start and end need to be switched! -plus_strand.columns = ['chrom','end','start','strand','read_count', 'nclscan_annotation'] - -minus_strand = resultsfile[resultsfile['strandd']=='+'] -minus_strand = minus_strand[["chrd", "coordd", "coorda", "strandd", "reads", "case"]] -minus_strand.columns = ['chrom','end','start','strand','read_count', 'nclscan_annotation'] - -outdf = pandas.concat([plus_strand,minus_strand],ignore_index=True,sort=False) -outdf["nclscan_annotation"] = outdf["nclscan_annotation"] + 1 #change 1 to 2 and 0 to 1 ... as 0 is for no annotation +plus_strand = resultsfile[resultsfile["strandd"] == "+"] +plus_strand = plus_strand[ + ["chrd", "coorda", "coordd", "strandd", "reads", "case"] +] # start and end need to be switched! +plus_strand.columns = [ + "chrom", + "end", + "start", + "strand", + "read_count", + "nclscan_annotation", +] + +minus_strand = resultsfile[resultsfile["strandd"] == "+"] +minus_strand = minus_strand[["chrd", "coordd", "coorda", "strandd", "reads", "case"]] +minus_strand.columns = [ + "chrom", + "end", + "start", + "strand", + "read_count", + "nclscan_annotation", +] + +outdf = pandas.concat([plus_strand, minus_strand], ignore_index=True, sort=False) +outdf["nclscan_annotation"] = ( + outdf["nclscan_annotation"] + 1 +) # change 1 to 2 and 0 to 1 ... as 0 is for no annotation outdf["nclscan_annotation"] = outdf["nclscan_annotation"].apply(_annotation_int2str) -outdf = outdf.astype({"chrom": str, "start": int, "end": int, "strand": str, "read_count": int, "nclscan_annotation": str}) +outdf = outdf.astype( + { + "chrom": str, + "start": int, + "end": int, + "strand": str, + "read_count": int, + "nclscan_annotation": str, + } +) # create index -outdf['circRNA_id']=outdf['chrom'].astype(str)+"##"+outdf['start'].astype(str)+"##"+outdf['end'].astype(str)+"##"+outdf['strand'].astype(str) -outdf.set_index(['circRNA_id'],inplace=True) +outdf["circRNA_id"] = ( + outdf["chrom"].astype(str) + + "##" + + outdf["start"].astype(str) + + "##" + + outdf["end"].astype(str) + + "##" + + outdf["strand"].astype(str) +) +outdf.set_index(["circRNA_id"], inplace=True) # sort and write out -outdf.sort_values(by=['chrom','start'],inplace=True) -outdf.to_csv(args.outfile,sep="\t",header=True,index=False) +outdf.sort_values(by=["chrom", "start"], inplace=True) +outdf.to_csv(args.outfile, sep="\t", header=True, index=False) -# filter +# filter # nreads filter outdf = outdf[~outdf["chrom"].isin(regions["additive"])] outdf = outdf[outdf["read_count"] >= args.back_spliced_min_reads] @@ -122,16 +231,16 @@ def _annotation_int2str(i): outdf_host["dist"] = abs(outdf_host["start"] - outdf_host["end"]) outdf_host = outdf_host[outdf_host["dist"] > args.host_filter_min] outdf_host = outdf_host[outdf_host["dist"] < args.host_filter_max] -outdf_host.drop(["dist"],axis=1,inplace=True) +outdf_host.drop(["dist"], axis=1, inplace=True) # virus distance/size filter outdf_virus = outdf[outdf["chrom"].isin(regions["virus"])] outdf_virus["dist"] = abs(outdf_virus["start"] - outdf_virus["end"]) outdf_virus = outdf_virus[outdf_virus["dist"] > args.virus_filter_min] outdf_virus = outdf_virus[outdf_virus["dist"] < args.virus_filter_max] -outdf_virus.drop(["dist"],axis=1,inplace=True) +outdf_virus.drop(["dist"], axis=1, inplace=True) -outdf = pandas.concat([outdf_host,outdf_virus]) +outdf = pandas.concat([outdf_host, outdf_virus]) # sort and write out -outdf.sort_values(by=['chrom','start'],inplace=True) -outdf.to_csv(args.filteredoutfile,sep="\t",header=True,index=False) +outdf.sort_values(by=["chrom", "start"], inplace=True) +outdf.to_csv(args.filteredoutfile, sep="\t", header=True, index=False) diff --git a/workflow/scripts/filter_bam.py b/workflow/scripts/filter_bam.py index 5f02829..1c08e8e 100755 --- a/workflow/scripts/filter_bam.py +++ b/workflow/scripts/filter_bam.py @@ -1,27 +1,45 @@ import pysam import argparse + def main(): parser = argparse.ArgumentParser( description="Remove all non-proper-pair, chimeric, secondary, supplementary, unmapped alignments from input BAM file" ) - parser.add_argument("-i","--inbam",dest="inbam",required=True,type=str, - help="Input BAM file") - parser.add_argument("-o","--outbam",dest="outbam",required=True,type=str, - help="Output primary alignment only BAM file") - parser.add_argument('-p',"--pe",dest="pe",required=False,action='store_true', default=False, - help="set this if BAM is paired end") - args = parser.parse_args() + parser.add_argument( + "-i", "--inbam", dest="inbam", required=True, type=str, help="Input BAM file" + ) + parser.add_argument( + "-o", + "--outbam", + dest="outbam", + required=True, + type=str, + help="Output primary alignment only BAM file", + ) + parser.add_argument( + "-p", + "--pe", + dest="pe", + required=False, + action="store_true", + default=False, + help="set this if BAM is paired end", + ) + args = parser.parse_args() samfile = pysam.AlignmentFile(args.inbam, "rb") outfile = pysam.AlignmentFile(args.outbam, "wb", template=samfile) for read in samfile.fetch(): - if args.pe and ( read.reference_id != read.next_reference_id ): continue # only works for PE ... for SE read.next_reference_id is -1 - if args.pe and ( not read.is_proper_pair ): continue - if read.is_secondary or read.is_supplementary or read.is_unmapped : continue + if args.pe and (read.reference_id != read.next_reference_id): + continue # only works for PE ... for SE read.next_reference_id is -1 + if args.pe and (not read.is_proper_pair): + continue + if read.is_secondary or read.is_supplementary or read.is_unmapped: + continue outfile.write(read) samfile.close() outfile.close() if __name__ == "__main__": - main() \ No newline at end of file + main() diff --git a/workflow/scripts/filter_bam_by_readids.py b/workflow/scripts/filter_bam_by_readids.py index b03f426..178ec83 100755 --- a/workflow/scripts/filter_bam_by_readids.py +++ b/workflow/scripts/filter_bam_by_readids.py @@ -3,8 +3,9 @@ import argparse import os import gzip + # """ -# Script takes a BAM file with a list of readids, then +# Script takes a BAM file with a list of readids, then # filters the input BAM for those readids and outputs # only those readid alignments into a new BAM file # @Params: @@ -18,42 +19,53 @@ # path to output BAM file # """ -parser = argparse.ArgumentParser(description='Filter BAM by readids') -parser.add_argument('--inputBAM', dest='inputBAM', type=str, required=True, - help='input BAM file') -parser.add_argument('--outputBAM', dest='outputBAM', type=str, required=True, - help='filtered output BAM file') -parser.add_argument('--readids', dest='readids', type=str, required=True, - help='file with readids to keep (one readid per line)') +parser = argparse.ArgumentParser(description="Filter BAM by readids") +parser.add_argument( + "--inputBAM", dest="inputBAM", type=str, required=True, help="input BAM file" +) +parser.add_argument( + "--outputBAM", + dest="outputBAM", + type=str, + required=True, + help="filtered output BAM file", +) +parser.add_argument( + "--readids", + dest="readids", + type=str, + required=True, + help="file with readids to keep (one readid per line)", +) args = parser.parse_args() -split_tup = os.path.splitext(args.readids) +split_tup = os.path.splitext(args.readids) # extract the file name and extension file_name = split_tup[0] file_extension = split_tup[1] -rids_dict=dict() -if file_extension==".gz": - rids=list() - with gzip.open(args.readids,'rt') as readids: - for l in readids: - l = l.strip() - rids_dict[l]=1 +rids_dict = dict() +if file_extension == ".gz": + rids = list() + with gzip.open(args.readids, "rt") as readids: + for l in readids: + l = l.strip() + rids_dict[l] = 1 else: - rids = list(map(lambda x:x.strip(),open(args.readids,'r').readlines())) - rids = list(set(rids)) - for rid in rids: - rids_dict[rid]=1 + rids = list(map(lambda x: x.strip(), open(args.readids, "r").readlines())) + rids = list(set(rids)) + for rid in rids: + rids_dict[rid] = 1 inBAM = pysam.AlignmentFile(args.inputBAM, "rb") outBAM = pysam.AlignmentFile(args.outputBAM, "wb", template=inBAM) -count=0 +count = 0 for read in inBAM.fetch(): - count+=1 - if count%1000000 == 0: - print("%d reads read!"%(count)) - qn=read.query_name - if qn in rids_dict: - outBAM.write(read) + count += 1 + if count % 1000000 == 0: + print("%d reads read!" % (count)) + qn = read.query_name + if qn in rids_dict: + outBAM.write(read) inBAM.close() outBAM.close() diff --git a/workflow/scripts/filter_bam_for_BSJs.py b/workflow/scripts/filter_bam_for_BSJs.py index d0473c7..5186bf0 100755 --- a/workflow/scripts/filter_bam_for_BSJs.py +++ b/workflow/scripts/filter_bam_for_BSJs.py @@ -7,7 +7,7 @@ # """ # input is a BAM file containing all BSJ alignments along with some chimeric alignments -# this script filters out the non-BSJ alignments and outputs the BSJ-only alignments to a +# this script filters out the non-BSJ alignments and outputs the BSJ-only alignments to a # new BAM file. # @Params: # @Inputs: @@ -20,38 +20,54 @@ # path to output BAM file # """ + def split_text(s): for k, g in groupby(s, str.isalpha): - yield ''.join(g) + yield "".join(g) + def get_alt_cigars(c): - alt_cigars=[] - x=list(split_text(c)) - if x[1]=="H": - alt_cigars.append("".join(x[2:])) - if x[-1]=="H": - alt_cigars.append("".join(x[:-2])) - if x[1]=="H" and x[-1]=="H": - alt_cigars.append("".join(x[2:-2])) - return alt_cigars + alt_cigars = [] + x = list(split_text(c)) + if x[1] == "H": + alt_cigars.append("".join(x[2:])) + if x[-1] == "H": + alt_cigars.append("".join(x[:-2])) + if x[1] == "H" and x[-1] == "H": + alt_cigars.append("".join(x[2:-2])) + return alt_cigars + pp = pprint.PrettyPrinter(indent=4) -parser = argparse.ArgumentParser(description='Filter readid filtered BAM file for BSJ-only alignments') -parser.add_argument('--inputBAM', dest='inputBAM', type=str, required=True, - help='input BAM file') -parser.add_argument('--outputBAM', dest='outputBAM', type=str, required=True, - help='filtered output BAM file') -parser.add_argument('--readids', dest='readids', type=str, required=True, - help='file with readids to keep (tab-delimited with columns:readid,chrom,strand,site1,site2,cigarlist)') +parser = argparse.ArgumentParser( + description="Filter readid filtered BAM file for BSJ-only alignments" +) +parser.add_argument( + "--inputBAM", dest="inputBAM", type=str, required=True, help="input BAM file" +) +parser.add_argument( + "--outputBAM", + dest="outputBAM", + type=str, + required=True, + help="filtered output BAM file", +) +parser.add_argument( + "--readids", + dest="readids", + type=str, + required=True, + help="file with readids to keep (tab-delimited with columns:readid,chrom,strand,site1,site2,cigarlist)", +) args = parser.parse_args() -rids=dict() +rids = dict() inBAM = pysam.AlignmentFile(args.inputBAM, "rb") outBAM = pysam.AlignmentFile(args.outputBAM, "wb", template=inBAM) -# multiple alignments of a read are grouped together by +# multiple alignments of a read are grouped together by # HI i Query hit index ... eg. HI:i:1, HI:i:2 etc. --> See https://samtools.github.io/hts-specs/SAMtags.pdf -# each HI represents a different alignent for the pair and +# each HI represents a different alignent for the pair and # generally contains 3 lines in the alignment file eg: # SRR1731877.10077876 163 chr16 16699505 1 30S53M = 16699513 53 CTACCGTTTCCTGTGATAAGTGCTACTTCTTGAGGCTCTGTTCCATCTTTGTCCCTTTCCAGAGATTTAATCTCTCTCTCTCT ;DDDDHBFHHDG@AAFHHGEHHIIIIIIIIIBDGIEH3DDHGC4?09?BBB0999B?8)./>FH>GHG>==CE@@A>>AE?;; NH:i:4 HI:i:1 AS:i:97 nM:i:0 NM:i:0 SA:Z:chr16,16700448,+,30M53H,1,0; # SRR1731877.10077876 83 chr16 16699513 1 45M = 16699505 -53 TGTTCCATCTTTGTCCCTTTCCAGAGATTTAATCTCTCTCTCTCT DGD>B@?;B@GFC88ECADCFHEE@C<@C:2A>",rids[readid][hi]['sites']) - # print(site in rids[readid][hi]['sites']) - # print(cigars,"====>>",rids[readid][hi]['cigars']) - # print(rids[readid][hi]['cigars'] == cigars) - if site in rids[readid][hi]['sites']: -## TEST #2 -## we know that site is present in sites of this alignment -## next we ensure that all 3 alignments of this HI value are on the same chromosome/reference - references=[] - for read in rids[readid][hi]['alignments']: - references.append(read.reference_name) - if len(list(set(references)))!=1: # same HI but different aligning to different chromosomes - continue - rids[readid][hi]['alignments']=list(set(rids[readid][hi]['alignments'])) -## TEST #3.1 -## we know that site is in 'sites' and all 3 alignment from the HI value are on the same chromosome -## next we check if the CIGAR scores of the 3 alignments are the same as the CIGAR scores from the readids file - if rids[readid][hi]['cigars'] == cigars: # lists are sorted before comparison - for read in rids[readid][hi]['alignments']: - outBAM.write(read) - else: -## TEST #3.2 -## some alignments are missed because of extra soft clipping in one of the 3 reported alignments in a single HI value -## eg. -# SRR1731877.16929220 83 chr7 99416198 255 5S48M1S = 99416198 -48 GGAAGTCCACCACCAGAAAACCCGCTACATCTTCGACCTCTTTTACAAGCGGAC FEHC>HHE@GC=GCIIJIGDIJJIJJIJJJJJJIHGJJIIIHHGHHFFFFFCC@ NH:i:1 HI:i:1 AS:i:95 nM:i:0 NM:i:0 -# SRR1731877.16929220 163 chr7 99416198 255 42S48M1S = 99416198 48 CAGAAAACCCGCTACATCTGCGACCTCTTTTACAAGCGGAAATCCACCACCAGAAAACCCGCTACATCTTCGACCTCTTTTACAAGCGGAC @@BFFFFFHHHHHJJJJJJHIJJJJJJJJJJIJJJIIIIGJJCFHHGJHGEHEFFFFEDCDDDDDDDDDDEDDDDDDDDDDDDCDDDDDDD NH:i:1 HI:i:1 AS:i:95 nM:i:0 NM:i:0 SA:Z:chr7,99416206,+,42M49H,255,1; -# SRR1731877.16929220 2209 chr7 99416206 255 42M49H = 99416198 0 CAGAAAACCCGCTACATCTGCGACCTCTTTTACAAGCGGAAA @@BFFFFFHHHHHJJJJJJHIJJJJJJJJJJIJJJIIIIGJJ NH:i:1 HI:i:1 AS:i:39 nM:i:1 NM:i:1 SA:Z:chr7,99416198,+,42S48M1S,255,0; -# the readids file contains -# SRR1731877.16929220 chr7 - 99416197 99416248 42H48M,48M1H,42M49H -# cigars from readids file --> 42H48M,48M1H,42M49H -# cigars from bam file --> 42H48M,5H48M1H,42M49H -# this is fix to include these alignments in the output BAM -# recompare cigars after removing softclippings at the ends of the CIGAR of non-matching cigar string - aminusb=list(set(rids[readid][hi]['cigars'])-set(cigars)) - if len(aminusb)==1: - restcigars=list(set(rids[readid][hi]['cigars'])-set(aminusb)) - altcigars=get_alt_cigars(aminusb[0]) - for ac in altcigars: - newcigars=[] - newcigars.extend(restcigars) - newcigars.append(ac) - newcigars.sort() - if newcigars == cigars: - for read in rids[readid][hi]['alignments']: - outBAM.write(read) - break -## TEST #3.3 -## similar to 3.2 some alignments are missed because of extra soft clipping in 2 of the 3 reported alignments in a single HI value -# this is fix for that scenario - if len(aminusb)==2: - commoncigar=list(set(rids[readid][hi]['cigars'])-set(aminusb)) - altcigars1=get_alt_cigars(aminusb[0]) - altcigars2=get_alt_cigars(aminusb[1]) - found=0 - for ac1 in altcigars1: - if found!=0: - break - tmpcigars=[] - tmpcigars.extend(commoncigar) - tmpcigars.append(ac1) - for ac2 in altcigars2: - newcigars=[] - newcigars.extend(tmpcigars) - newcigars.append(ac2) - newcigars.sort() - if newcigars == cigars: - for read in rids[readid][hi]['alignments']: - outBAM.write(read) - found=1 - break + line = line.strip().split("\t") + # print(line) + ## SRR1731877.10077876 chr16 - 16699504 16700478 30H53M,45M,30M53H + ## columns:readid,chrom,strand,site1,site2,cigarlist + ## this is generated by junctions2readids.py from the .junction file from STAR2p + readid = line[0] + chrom = line[1] + strand = line[2] + site1 = line[3] + site2 = line[4] + cigars = line[5].split(",") + cigars.sort() + ## as we are searching for the alignment which represents this occurance + ## (which of the multiple HI values should we report in the output BAM) + ## of this readid in the readids file, + ## TEST #1 + ## We first compare site (or coordinate) + ## If strand is -ve, then site1 is expected to be in the reported alignment + ## but if the strand is +ve, the site2 is expected to be in the 'sites' list + ## note: we have to add 1 to switch from 0-based to 1-based + if strand == "-": + site = int(site1) + 1 + else: + site = int(site2) + 1 + ## readid will always be part of rids... but just in case + if not readid in rids: + continue + for hi in rids[readid].keys(): + # print(readid,hi,site) + # print(site,"===>>",rids[readid][hi]['sites']) + # print(site in rids[readid][hi]['sites']) + # print(cigars,"====>>",rids[readid][hi]['cigars']) + # print(rids[readid][hi]['cigars'] == cigars) + if site in rids[readid][hi]["sites"]: + ## TEST #2 + ## we know that site is present in sites of this alignment + ## next we ensure that all 3 alignments of this HI value are on the same chromosome/reference + references = [] + for read in rids[readid][hi]["alignments"]: + references.append(read.reference_name) + if ( + len(list(set(references))) != 1 + ): # same HI but different aligning to different chromosomes + continue + rids[readid][hi]["alignments"] = list(set(rids[readid][hi]["alignments"])) + ## TEST #3.1 + ## we know that site is in 'sites' and all 3 alignment from the HI value are on the same chromosome + ## next we check if the CIGAR scores of the 3 alignments are the same as the CIGAR scores from the readids file + if ( + rids[readid][hi]["cigars"] == cigars + ): # lists are sorted before comparison + for read in rids[readid][hi]["alignments"]: + outBAM.write(read) + else: + ## TEST #3.2 + ## some alignments are missed because of extra soft clipping in one of the 3 reported alignments in a single HI value + ## eg. + # SRR1731877.16929220 83 chr7 99416198 255 5S48M1S = 99416198 -48 GGAAGTCCACCACCAGAAAACCCGCTACATCTTCGACCTCTTTTACAAGCGGAC FEHC>HHE@GC=GCIIJIGDIJJIJJIJJJJJJIHGJJIIIHHGHHFFFFFCC@ NH:i:1 HI:i:1 AS:i:95 nM:i:0 NM:i:0 + # SRR1731877.16929220 163 chr7 99416198 255 42S48M1S = 99416198 48 CAGAAAACCCGCTACATCTGCGACCTCTTTTACAAGCGGAAATCCACCACCAGAAAACCCGCTACATCTTCGACCTCTTTTACAAGCGGAC @@BFFFFFHHHHHJJJJJJHIJJJJJJJJJJIJJJIIIIGJJCFHHGJHGEHEFFFFEDCDDDDDDDDDDEDDDDDDDDDDDDCDDDDDDD NH:i:1 HI:i:1 AS:i:95 nM:i:0 NM:i:0 SA:Z:chr7,99416206,+,42M49H,255,1; + # SRR1731877.16929220 2209 chr7 99416206 255 42M49H = 99416198 0 CAGAAAACCCGCTACATCTGCGACCTCTTTTACAAGCGGAAA @@BFFFFFHHHHHJJJJJJHIJJJJJJJJJJIJJJIIIIGJJ NH:i:1 HI:i:1 AS:i:39 nM:i:1 NM:i:1 SA:Z:chr7,99416198,+,42S48M1S,255,0; + # the readids file contains + # SRR1731877.16929220 chr7 - 99416197 99416248 42H48M,48M1H,42M49H + # cigars from readids file --> 42H48M,48M1H,42M49H + # cigars from bam file --> 42H48M,5H48M1H,42M49H + # this is fix to include these alignments in the output BAM + # recompare cigars after removing softclippings at the ends of the CIGAR of non-matching cigar string + aminusb = list(set(rids[readid][hi]["cigars"]) - set(cigars)) + if len(aminusb) == 1: + restcigars = list(set(rids[readid][hi]["cigars"]) - set(aminusb)) + altcigars = get_alt_cigars(aminusb[0]) + for ac in altcigars: + newcigars = [] + newcigars.extend(restcigars) + newcigars.append(ac) + newcigars.sort() + if newcigars == cigars: + for read in rids[readid][hi]["alignments"]: + outBAM.write(read) + break + ## TEST #3.3 + ## similar to 3.2 some alignments are missed because of extra soft clipping in 2 of the 3 reported alignments in a single HI value + # this is fix for that scenario + if len(aminusb) == 2: + commoncigar = list(set(rids[readid][hi]["cigars"]) - set(aminusb)) + altcigars1 = get_alt_cigars(aminusb[0]) + altcigars2 = get_alt_cigars(aminusb[1]) + found = 0 + for ac1 in altcigars1: + if found != 0: + break + tmpcigars = [] + tmpcigars.extend(commoncigar) + tmpcigars.append(ac1) + for ac2 in altcigars2: + newcigars = [] + newcigars.extend(tmpcigars) + newcigars.append(ac2) + newcigars.sort() + if newcigars == cigars: + for read in rids[readid][hi]["alignments"]: + outBAM.write(read) + found = 1 + break outBAM.close() diff --git a/workflow/scripts/filter_bam_for_linear_reads.py b/workflow/scripts/filter_bam_for_linear_reads.py index c8bf79a..085be9e 100755 --- a/workflow/scripts/filter_bam_for_linear_reads.py +++ b/workflow/scripts/filter_bam_for_linear_reads.py @@ -21,33 +21,42 @@ # """ # if readid is NOT in the junctions file then it is a read with no Junction ... aka LINEAR! - + + class Read: def __init__(self): - self.alignments=list() - self.read1exists=False - self.read2exists=False - - def append_alignment(self,alignment): + self.alignments = list() + self.read1exists = False + self.read2exists = False + + def append_alignment(self, alignment): self.alignments.append(alignment) if alignment.is_read1: - self.read1exists=True + self.read1exists = True if alignment.is_read2: - self.read2exists=True - - def is_valid_read(self): - return(self.read1exists and self.read2exists) - + self.read2exists = True + def is_valid_read(self): + return self.read1exists and self.read2exists -parser = argparse.ArgumentParser(description='Filter BAM to exclude BSJs and other chimeric alignments') -parser.add_argument('--inputBAM', dest='inputBAM', type=str, required=True, - help='input BAM file') -parser.add_argument('--outputBAM', dest='outputBAM', type=str, required=True, - help='filtered output BAM file') -parser.add_argument('-j',dest='junctions',required=True,help='chimeric junctions file') -parser.add_argument('-p',dest='paired', help='bam is paired', action='store_true') +parser = argparse.ArgumentParser( + description="Filter BAM to exclude BSJs and other chimeric alignments" +) +parser.add_argument( + "--inputBAM", dest="inputBAM", type=str, required=True, help="input BAM file" +) +parser.add_argument( + "--outputBAM", + dest="outputBAM", + type=str, + required=True, + help="filtered output BAM file", +) +parser.add_argument( + "-j", dest="junctions", required=True, help="chimeric junctions file" +) +parser.add_argument("-p", dest="paired", help="bam is paired", action="store_true") args = parser.parse_args() # rids=list() inBAM = pysam.AlignmentFile(args.inputBAM, "rb") @@ -98,31 +107,36 @@ def is_valid_read(self): # get a list of the chimeric readids -rids_dict=dict() -with open(args.junctions, 'r') as junc_f: +rids_dict = dict() +with open(args.junctions, "r") as junc_f: for line in junc_f: if "junction_type" in line: continue - readid=line.split()[9] # 10th column is read-name - rids_dict[readid]=1 + readid = line.split()[9] # 10th column is read-name + rids_dict[readid] = 1 print(f"Total chimeric readids:{len(rids_dict)}") -if args.paired: # paired-end +if args.paired: # paired-end for read in inBAM.fetch(until_eof=True): - if not read.is_proper_pair or read.is_secondary or read.is_supplementary or read.is_unmapped: + if ( + not read.is_proper_pair + or read.is_secondary + or read.is_supplementary + or read.is_unmapped + ): continue qname = read.query_name - if qname in rids_dict: # "in" dict is much faster than "in" list - continue # if readid is in dict then it is a junction read ... so ignore it! + if qname in rids_dict: # "in" dict is much faster than "in" list + continue # if readid is in dict then it is a junction read ... so ignore it! else: outBAM.write(read) -else: # single-end - incount=0 - outcount=0 +else: # single-end + incount = 0 + outcount = 0 for read in inBAM.fetch(until_eof=True): - incount+=1 - if incount%1000==0: + incount += 1 + if incount % 1000 == 0: print(f"{incount/1000000:.4f}m reads read in") print(f"{outcount/1000000:.4f}m reads written out") if read.is_secondary or read.is_supplementary or read.is_unmapped: @@ -131,11 +145,10 @@ def is_valid_read(self): if qname in rids_dict: continue else: - outcount+=1 + outcount += 1 outBAM.write(read) - inBAM.close() outBAM.close() diff --git a/workflow/scripts/filter_bam_for_splice_reads.py b/workflow/scripts/filter_bam_for_splice_reads.py index ac04f76..51eee92 100755 --- a/workflow/scripts/filter_bam_for_splice_reads.py +++ b/workflow/scripts/filter_bam_for_splice_reads.py @@ -1,6 +1,7 @@ import pysam import sys import argparse + # """ # Script takes a STAR 2p BAM file and tab-delimited file with splice junctions in the first 3 columns, # and outputs spliced-only alignments @@ -15,87 +16,96 @@ # path to output BAM file # """ -parser = argparse.ArgumentParser(description='extract spliced reads from bam file') -parser.add_argument('--inbam',dest='inbam',required=True,help='STAR bam file with index') -parser.add_argument('--tab',dest='tab',required=True,help='tab file with splice junctions in the first 3 columns') -parser.add_argument('--outbam',dest='outbam',required=True,help='Output bam filename') -args=parser.parse_args() +parser = argparse.ArgumentParser(description="extract spliced reads from bam file") +parser.add_argument( + "--inbam", dest="inbam", required=True, help="STAR bam file with index" +) +parser.add_argument( + "--tab", + dest="tab", + required=True, + help="tab file with splice junctions in the first 3 columns", +) +parser.add_argument( + "--outbam", dest="outbam", required=True, help="Output bam filename" +) +args = parser.parse_args() -inbam = pysam.AlignmentFile(args.inbam, "rb" ) -outbam = pysam.AlignmentFile(args.outbam, "wb", template=inbam ) +inbam = pysam.AlignmentFile(args.inbam, "rb") +outbam = pysam.AlignmentFile(args.outbam, "wb", template=inbam) tab = open(args.tab) junctions = tab.readlines() junctions.pop(0) tab.close() -count=0 -threshold=0 -incr=5 +count = 0 +threshold = 0 +incr = 5 for l in junctions: - count+=1 - if count*100/len(junctions)>threshold: - print("%d %% complete!"% (threshold)) - threshold+=incr - l=l.strip().split("\t") - c=l[0] - s=int(l[1]) - e=int(l[2]) -# get chromosome name, start and end positions for the junction -# and fetch reads aligning to this region using "fetch" -# ref: https://pysam.readthedocs.io/en/latest/api.html#pysam.FastaFile.fetch - for read in inbam.fetch(c,s-200,e+200): - # for read in inbam.fetch(c): -# get cigarstring to replace softclips - cigar=read.cigarstring -# replace softclips with hardclip - cigar=cigar.replace("S","H") - cigart=read.cigartuples + count += 1 + if count * 100 / len(junctions) > threshold: + print("%d %% complete!" % (threshold)) + threshold += incr + l = l.strip().split("\t") + c = l[0] + s = int(l[1]) + e = int(l[2]) + # get chromosome name, start and end positions for the junction + # and fetch reads aligning to this region using "fetch" + # ref: https://pysam.readthedocs.io/en/latest/api.html#pysam.FastaFile.fetch + for read in inbam.fetch(c, s - 200, e + 200): + # for read in inbam.fetch(c): + # get cigarstring to replace softclips + cigar = read.cigarstring + # replace softclips with hardclip + cigar = cigar.replace("S", "H") + cigart = read.cigartuples -# if cigartuple contains -# N BAM_CREF_SKIP 3 -# then it is a spliced read! + # if cigartuple contains + # N BAM_CREF_SKIP 3 + # then it is a spliced read! -# ref: https://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment.cigartuples -# cigartuples operation list is -# M BAM_CMATCH 0 -# I BAM_CINS 1 -# D BAM_CDEL 2 -# N BAM_CREF_SKIP 3 -# S BAM_CSOFT_CLIP 4 -# H BAM_CHARD_CLIP 5 -# P BAM_CPAD 6 -# = BAM_CEQUAL 7 -# X BAM_CDIFF 8 -# B BAM_CBACK 9 -# cigartuples returns a list of tuples of (operation, length) -# eg. 30M is returned as [(0, 30)] -# N in CIGAR score is index 3 in tuple represents BAM_CREF_SKIP indicative of spliced read - if 3 in list(map(lambda z:z[0],cigart)): -# cigart[list(map(lambda z:z[0],cigart)).index(0):] -# get the first item of each tuple in the list of tuples -# first item will be the operation from the above table -# if 3 is in the new list ... means that there was a BAM_CREF_SKIP -# BAM_CREF_SKIP in CIGAR score right after a match (BAM_CMATCH) -# suggests spliced alignment aka spliced read - cigart=cigart[list(map(lambda z:z[0],cigart)).index(0):] - if cigart[0][0]==0 and cigart[1][0]==3: -# CIGAR has match ... followed by skip ... aka spliced read -# so gather start and end coordinates - start=read.reference_start+cigart[0][1]+1 - end=start+cigart[1][1]-1 + # ref: https://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment.cigartuples + # cigartuples operation list is + # M BAM_CMATCH 0 + # I BAM_CINS 1 + # D BAM_CDEL 2 + # N BAM_CREF_SKIP 3 + # S BAM_CSOFT_CLIP 4 + # H BAM_CHARD_CLIP 5 + # P BAM_CPAD 6 + # = BAM_CEQUAL 7 + # X BAM_CDIFF 8 + # B BAM_CBACK 9 + # cigartuples returns a list of tuples of (operation, length) + # eg. 30M is returned as [(0, 30)] + # N in CIGAR score is index 3 in tuple represents BAM_CREF_SKIP indicative of spliced read + if 3 in list(map(lambda z: z[0], cigart)): + # cigart[list(map(lambda z:z[0],cigart)).index(0):] + # get the first item of each tuple in the list of tuples + # first item will be the operation from the above table + # if 3 is in the new list ... means that there was a BAM_CREF_SKIP + # BAM_CREF_SKIP in CIGAR score right after a match (BAM_CMATCH) + # suggests spliced alignment aka spliced read + cigart = cigart[list(map(lambda z: z[0], cigart)).index(0) :] + if cigart[0][0] == 0 and cigart[1][0] == 3: + # CIGAR has match ... followed by skip ... aka spliced read + # so gather start and end coordinates + start = read.reference_start + cigart[0][1] + 1 + end = start + cigart[1][1] - 1 # print(read) # print(cigart) # print(c+"##"+str(s)+"##"+str(e),start-s,end-e,read.get_reference_positions(full_length=True),read) - if start==s and end==e: -# check if start and end are in the junctions file -# if yes then write to output file + if start == s and end == e: + # check if start and end are in the junctions file + # if yes then write to output file # print(read) # print(cigart) # print(start,end) # print(s,e) # exit() outbam.write(read) - #print(read.query_name,c,s,e,start,end) - #print(read) + # print(read.query_name,c,s,e,start,end) + # print(read) inbam.close() outbam.close() -exit() \ No newline at end of file +exit() diff --git a/workflow/scripts/filter_ciriout.py b/workflow/scripts/filter_ciriout.py index 686ddac..830d128 100755 --- a/workflow/scripts/filter_ciriout.py +++ b/workflow/scripts/filter_ciriout.py @@ -2,6 +2,7 @@ import argparse import inspect + # CIRI2 output file has following columns: # | # | colName | Description | # |----|----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| @@ -19,105 +20,183 @@ # | 12 | junction_reads_ID | all of the circular junction read IDs (split by ",") | # ref: https://ciri-cookbook.readthedocs.io/en/latest/CIRI2.html#an-example-of-running-ciri2 class CIRIOUT: - def __init__(self,entry,chrom="",start=0,end=0,nreads=0,size=0,host_additive_virus="additive",filter_out=False): - self.entry=entry - l=entry.strip().split('\t') - self.chrom=l[1] - self.start=int(l[2])-1 - self.end=int(l[3]) - self.nreads=int(l[4]) - self.size=self.start-self.end - self.filter_out=False - + def __init__( + self, + entry, + chrom="", + start=0, + end=0, + nreads=0, + size=0, + host_additive_virus="additive", + filter_out=False, + ): + self.entry = entry + l = entry.strip().split("\t") + self.chrom = l[1] + self.start = int(l[2]) - 1 + self.end = int(l[3]) + self.nreads = int(l[4]) + self.size = self.start - self.end + self.filter_out = False + # @classmethod - def set_host_additive_virus(self,regions): - self.host_additive_virus=_get_host_additive_virus(regions=regions,seqname=self.chrom) - + def set_host_additive_virus(self, regions): + self.host_additive_virus = _get_host_additive_virus( + regions=regions, seqname=self.chrom + ) + # @classmethod - def filter_by_nreads(self,minreads): - if self.nreads < minreads: self.filter_out=True - + def filter_by_nreads(self, minreads): + if self.nreads < minreads: + self.filter_out = True + # @classmethod - def filter_by_size(self,host_min,host_max,virus_min,virus_max): - if self.host_additive_virus=="host": - if self.size < host_min : self.filter_out=True - if self.size > host_max : self.filter_out=True - elif self.host_additive_virus=="virus": - if self.size < virus_min : self.filter_out=True - if self.size > virus_max : self.filter_out=True + def filter_by_size(self, host_min, host_max, virus_min, virus_max): + if self.host_additive_virus == "host": + if self.size < host_min: + self.filter_out = True + if self.size > host_max: + self.filter_out = True + elif self.host_additive_virus == "virus": + if self.size < virus_min: + self.filter_out = True + if self.size > virus_max: + self.filter_out = True else: - self.filter_out=True + self.filter_out = True + - -def read_regions(regionsfile,host,additives,viruses): - host=host.split(",") - additives=additives.split(",") - viruses=viruses.split(",") - infile=open(regionsfile,'r') - regions=dict() +def read_regions(regionsfile, host, additives, viruses): + host = host.split(",") + additives = additives.split(",") + viruses = viruses.split(",") + infile = open(regionsfile, "r") + regions = dict() for l in infile.readlines(): l = l.strip().split("\t") - region_name=l[0] - regions[region_name]=dict() - regions[region_name]['sequences']=dict() + region_name = l[0] + regions[region_name] = dict() + regions[region_name]["sequences"] = dict() if region_name in host: - regions[region_name]['host_additive_virus']="host" + regions[region_name]["host_additive_virus"] = "host" elif region_name in additives: - regions[region_name]['host_additive_virus']="additive" + regions[region_name]["host_additive_virus"] = "additive" elif region_name in viruses: - regions[region_name]['host_additive_virus']="virus" + regions[region_name]["host_additive_virus"] = "virus" else: exit("%s has unknown region. Its not a host or a additive or a virus!!") - sequence_names=l[1].split() + sequence_names = l[1].split() for s in sequence_names: - regions[region_name]['sequences'][s]=1 - return regions + regions[region_name]["sequences"][s] = 1 + return regions + -def _get_host_additive_virus(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: - return v['host_additive_virus'] +def _get_host_additive_virus(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: + return v["host_additive_virus"] else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) -parser = argparse.ArgumentParser(description='Filter CIRI2 Per Sample Counts Table') -parser.add_argument('--ciriout', dest='ciriout', type=str, required=True, - help='ciri out file') -parser.add_argument('--back_spliced_min_reads', dest='back_spliced_min_reads', type=int, required=True, - help='back_spliced minimum read threshold') -parser.add_argument('--host', dest='host', type=str, required=True, - help='host name eg.hg38... single value...host_filter_min/host_filter_max filters are applied to this region only') -parser.add_argument('--additives', dest='additives', type=str, required=True, - help='additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out') -parser.add_argument('--viruses', dest='viruses', type=str, required=True, - help='virus name(s) eg.NC_009333.1... comma-separated list...virus_filter_min/virus_filter_max filters are applied to this region only') -parser.add_argument('--host_filter_min', dest='host_filter_min', type=int, required=False, default=150, - help='min BSJ size filter for host') -parser.add_argument('--virus_filter_min', dest='virus_filter_min', type=int, required=False, default=150, - help='min BSJ size filter for virus') -parser.add_argument('--host_filter_max', dest='host_filter_max', type=int, required=False, default=5000, - help='max BSJ size filter for host') -parser.add_argument('--virus_filter_max', dest='virus_filter_max', type=int, required=False, default=5000, - help='max BSJ size filter for virus') -parser.add_argument('--regions', dest='regions', type=str, required=True, - help='regions file eg. ref.fa.regions') -parser.add_argument('-o',dest='outfile',required=True,help='filtered ciriout file') +parser = argparse.ArgumentParser(description="Filter CIRI2 Per Sample Counts Table") +parser.add_argument( + "--ciriout", dest="ciriout", type=str, required=True, help="ciri out file" +) +parser.add_argument( + "--back_spliced_min_reads", + dest="back_spliced_min_reads", + type=int, + required=True, + help="back_spliced minimum read threshold", +) +parser.add_argument( + "--host", + dest="host", + type=str, + required=True, + help="host name eg.hg38... single value...host_filter_min/host_filter_max filters are applied to this region only", +) +parser.add_argument( + "--additives", + dest="additives", + type=str, + required=True, + help="additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out", +) +parser.add_argument( + "--viruses", + dest="viruses", + type=str, + required=True, + help="virus name(s) eg.NC_009333.1... comma-separated list...virus_filter_min/virus_filter_max filters are applied to this region only", +) +parser.add_argument( + "--host_filter_min", + dest="host_filter_min", + type=int, + required=False, + default=150, + help="min BSJ size filter for host", +) +parser.add_argument( + "--virus_filter_min", + dest="virus_filter_min", + type=int, + required=False, + default=150, + help="min BSJ size filter for virus", +) +parser.add_argument( + "--host_filter_max", + dest="host_filter_max", + type=int, + required=False, + default=5000, + help="max BSJ size filter for host", +) +parser.add_argument( + "--virus_filter_max", + dest="virus_filter_max", + type=int, + required=False, + default=5000, + help="max BSJ size filter for virus", +) +parser.add_argument( + "--regions", + dest="regions", + type=str, + required=True, + help="regions file eg. ref.fa.regions", +) +parser.add_argument("-o", dest="outfile", required=True, help="filtered ciriout file") args = parser.parse_args() -regions = read_regions(regionsfile=args.regions,host=args.host,additives=args.additives,viruses=args.viruses) -outfile = open(args.outfile,'w') -infile = open(args.ciriout,'r') +regions = read_regions( + regionsfile=args.regions, + host=args.host, + additives=args.additives, + viruses=args.viruses, +) +outfile = open(args.outfile, "w") +infile = open(args.ciriout, "r") alllines = infile.readlines() header = alllines.pop(0) -outfile.write("%s"%(header)) +outfile.write("%s" % (header)) infile.close() for l in alllines: out = CIRIOUT(entry=l) out.set_host_additive_virus(regions=regions) out.filter_by_nreads(args.back_spliced_min_reads) if out.filter_out == False: - out.filter_by_size(host_min=args.host_filter_min,host_max=args.host_filter_max,virus_min=args.virus_filter_min,virus_max=args.virus_filter_max) + out.filter_by_size( + host_min=args.host_filter_min, + host_max=args.host_filter_max, + virus_min=args.virus_filter_min, + virus_max=args.virus_filter_max, + ) if out.filter_out == True: outfile.write(l) outfile.close() diff --git a/workflow/scripts/filter_dcc.py b/workflow/scripts/filter_dcc.py index f518d6a..a9e1707 100755 --- a/workflow/scripts/filter_dcc.py +++ b/workflow/scripts/filter_dcc.py @@ -2,6 +2,7 @@ import argparse import inspect + # DCC counts table input/output file has following columns: # | # | colName | Description | # |----|----------------------|-----------------------------------------------------------------| @@ -12,105 +13,188 @@ # | 5 | read_count | | # | 6 | dcc_annotation | this is JunctionType##Start-End Region from CircCoordinates file| class DCC: - def __init__(self,entry,chrom="",start=0,end=0,nreads=0,size=0,host_additive_virus="additive",filter_out=False): - self.entry=entry - l=entry.strip().split('\t') - self.chrom=l[0] - self.start=int(l[1]) - self.end=int(l[2]) - self.nreads=int(l[4]) - self.size=self.start-self.end - self.filter_out=False - + def __init__( + self, + entry, + chrom="", + start=0, + end=0, + nreads=0, + size=0, + host_additive_virus="additive", + filter_out=False, + ): + self.entry = entry + l = entry.strip().split("\t") + self.chrom = l[0] + self.start = int(l[1]) + self.end = int(l[2]) + self.nreads = int(l[4]) + self.size = self.start - self.end + self.filter_out = False + # @classmethod - def set_host_additive_virus(self,regions): - self.host_additive_virus=_get_host_additive_virus(regions=regions,seqname=self.chrom) - + def set_host_additive_virus(self, regions): + self.host_additive_virus = _get_host_additive_virus( + regions=regions, seqname=self.chrom + ) + # @classmethod - def filter_by_nreads(self,minreads): - if self.nreads < minreads: self.filter_out=True - + def filter_by_nreads(self, minreads): + if self.nreads < minreads: + self.filter_out = True + # @classmethod - def filter_by_size(self,host_min,host_max,virus_min,virus_max): - if self.host_additive_virus=="host": - if self.size < host_min : self.filter_out=True - if self.size > host_max : self.filter_out=True - elif self.host_additive_virus=="virus": - if self.size < virus_min : self.filter_out=True - if self.size > virus_max : self.filter_out=True + def filter_by_size(self, host_min, host_max, virus_min, virus_max): + if self.host_additive_virus == "host": + if self.size < host_min: + self.filter_out = True + if self.size > host_max: + self.filter_out = True + elif self.host_additive_virus == "virus": + if self.size < virus_min: + self.filter_out = True + if self.size > virus_max: + self.filter_out = True else: - self.filter_out=True + self.filter_out = True + - -def read_regions(regionsfile,host,additives,viruses): - host=host.split(",") - additives=additives.split(",") - viruses=viruses.split(",") - infile=open(regionsfile,'r') - regions=dict() +def read_regions(regionsfile, host, additives, viruses): + host = host.split(",") + additives = additives.split(",") + viruses = viruses.split(",") + infile = open(regionsfile, "r") + regions = dict() for l in infile.readlines(): l = l.strip().split("\t") - region_name=l[0] - regions[region_name]=dict() - regions[region_name]['sequences']=dict() + region_name = l[0] + regions[region_name] = dict() + regions[region_name]["sequences"] = dict() if region_name in host: - regions[region_name]['host_additive_virus']="host" + regions[region_name]["host_additive_virus"] = "host" elif region_name in additives: - regions[region_name]['host_additive_virus']="additive" + regions[region_name]["host_additive_virus"] = "additive" elif region_name in viruses: - regions[region_name]['host_additive_virus']="virus" + regions[region_name]["host_additive_virus"] = "virus" else: exit("%s has unknown region. Its not a host or a additive or a virus!!") - sequence_names=l[1].split() + sequence_names = l[1].split() for s in sequence_names: - regions[region_name]['sequences'][s]=1 - return regions + regions[region_name]["sequences"][s] = 1 + return regions + -def _get_host_additive_virus(regions,seqname): - for k,v in regions.items(): - if seqname in v['sequences']: - return v['host_additive_virus'] +def _get_host_additive_virus(regions, seqname): + for k, v in regions.items(): + if seqname in v["sequences"]: + return v["host_additive_virus"] else: - exit("Sequence: %s does not have a region."%(seqname)) + exit("Sequence: %s does not have a region." % (seqname)) -parser = argparse.ArgumentParser(description='Filter DCC Per Sample Counts Table') -parser.add_argument('--in_dcc_counts_table', dest='intable', type=str, required=True, - help='DCC in file') -parser.add_argument('--back_spliced_min_reads', dest='back_spliced_min_reads', type=int, required=True, - help='back_spliced minimum read threshold') -parser.add_argument('--host', dest='host', type=str, required=True, - help='host name eg.hg38... single value...host_filter_min/host_filter_max filters are applied to this region only') -parser.add_argument('--additives', dest='additives', type=str, required=True, - help='additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out') -parser.add_argument('--viruses', dest='viruses', type=str, required=True, - help='virus name(s) eg.NC_009333.1... comma-separated list...virus_filter_min/virus_filter_max filters are applied to this region only') -parser.add_argument('--host_filter_min', dest='host_filter_min', type=int, required=False, default=150, - help='min BSJ size filter for host') -parser.add_argument('--virus_filter_min', dest='virus_filter_min', type=int, required=False, default=150, - help='min BSJ size filter for virus') -parser.add_argument('--host_filter_max', dest='host_filter_max', type=int, required=False, default=5000, - help='max BSJ size filter for host') -parser.add_argument('--virus_filter_max', dest='virus_filter_max', type=int, required=False, default=5000, - help='max BSJ size filter for virus') -parser.add_argument('--regions', dest='regions', type=str, required=True, - help='regions file eg. ref.fa.regions') -parser.add_argument('--out_dcc_filtered_counts_table',dest='outfile',required=True,help='filtered DCC out file') +parser = argparse.ArgumentParser(description="Filter DCC Per Sample Counts Table") +parser.add_argument( + "--in_dcc_counts_table", dest="intable", type=str, required=True, help="DCC in file" +) +parser.add_argument( + "--back_spliced_min_reads", + dest="back_spliced_min_reads", + type=int, + required=True, + help="back_spliced minimum read threshold", +) +parser.add_argument( + "--host", + dest="host", + type=str, + required=True, + help="host name eg.hg38... single value...host_filter_min/host_filter_max filters are applied to this region only", +) +parser.add_argument( + "--additives", + dest="additives", + type=str, + required=True, + help="additive name(s) eg.ERCC... comma-separated list... all BSJs in this region are filtered out", +) +parser.add_argument( + "--viruses", + dest="viruses", + type=str, + required=True, + help="virus name(s) eg.NC_009333.1... comma-separated list...virus_filter_min/virus_filter_max filters are applied to this region only", +) +parser.add_argument( + "--host_filter_min", + dest="host_filter_min", + type=int, + required=False, + default=150, + help="min BSJ size filter for host", +) +parser.add_argument( + "--virus_filter_min", + dest="virus_filter_min", + type=int, + required=False, + default=150, + help="min BSJ size filter for virus", +) +parser.add_argument( + "--host_filter_max", + dest="host_filter_max", + type=int, + required=False, + default=5000, + help="max BSJ size filter for host", +) +parser.add_argument( + "--virus_filter_max", + dest="virus_filter_max", + type=int, + required=False, + default=5000, + help="max BSJ size filter for virus", +) +parser.add_argument( + "--regions", + dest="regions", + type=str, + required=True, + help="regions file eg. ref.fa.regions", +) +parser.add_argument( + "--out_dcc_filtered_counts_table", + dest="outfile", + required=True, + help="filtered DCC out file", +) args = parser.parse_args() -regions = read_regions(regionsfile=args.regions,host=args.host,additives=args.additives,viruses=args.viruses) -outfile = open(args.outfile,'w') -infile = open(args.intable,'r') +regions = read_regions( + regionsfile=args.regions, + host=args.host, + additives=args.additives, + viruses=args.viruses, +) +outfile = open(args.outfile, "w") +infile = open(args.intable, "r") alllines = infile.readlines() header = alllines.pop(0) -outfile.write("%s"%(header)) +outfile.write("%s" % (header)) infile.close() for l in alllines: out = DCC(entry=l) out.set_host_additive_virus(regions=regions) out.filter_by_nreads(args.back_spliced_min_reads) if out.filter_out == False: - out.filter_by_size(host_min=args.host_filter_min,host_max=args.host_filter_max,virus_min=args.virus_filter_min,virus_max=args.virus_filter_max) + out.filter_by_size( + host_min=args.host_filter_min, + host_max=args.host_filter_max, + virus_min=args.virus_filter_min, + virus_max=args.virus_filter_max, + ) if out.filter_out == True: outfile.write(l) outfile.close() diff --git a/workflow/scripts/filter_junction.py b/workflow/scripts/filter_junction.py index 7a276c3..31acac6 100755 --- a/workflow/scripts/filter_junction.py +++ b/workflow/scripts/filter_junction.py @@ -1,6 +1,6 @@ import sys -for i in open(sys.argv[1]).readlines(): - j=i.split("\t") - if (j[0]=="chrKSHV") and (j[3]=="chrKSHV") : - print(i.strip()) +for i in open(sys.argv[1]).readlines(): + j = i.split("\t") + if (j[0] == "chrKSHV") and (j[3] == "chrKSHV"): + print(i.strip()) diff --git a/workflow/scripts/filter_junction_human.py b/workflow/scripts/filter_junction_human.py index c8e38cd..d315ae9 100755 --- a/workflow/scripts/filter_junction_human.py +++ b/workflow/scripts/filter_junction_human.py @@ -1,6 +1,6 @@ import sys -for i in open(sys.argv[1]).readlines(): - j=i.split("\t") - if (j[0]!="chrKSHV") and (j[3]==j[0]) : - print(i.strip()) +for i in open(sys.argv[1]).readlines(): + j = i.split("\t") + if (j[0] != "chrKSHV") and (j[3] == j[0]): + print(i.strip()) diff --git a/workflow/scripts/fix_gtfs.py b/workflow/scripts/fix_gtfs.py index 3eee716..5f5615e 100755 --- a/workflow/scripts/fix_gtfs.py +++ b/workflow/scripts/fix_gtfs.py @@ -1,86 +1,98 @@ import argparse import pandas -debug=0 +debug = 0 + def get_attributes(attstr): - att = dict() - attlist = attstr.strip().split(";") - if debug==1: print(attstr) - if debug==1: print(attlist) - for item in attlist: - x = item.strip() - if debug==1: print(x) - x = x.replace("\"","") - if debug==1: print(x) - x = x.split() - if debug==1: print(x) - if len(x)!=2: continue - key = x.pop(0) - key = key.replace(":","") - value = " ".join(x) - value = value.replace(":","_") - att[key] = value - return att + att = dict() + attlist = attstr.strip().split(";") + if debug == 1: + print(attstr) + if debug == 1: + print(attlist) + for item in attlist: + x = item.strip() + if debug == 1: + print(x) + x = x.replace('"', "") + if debug == 1: + print(x) + x = x.split() + if debug == 1: + print(x) + if len(x) != 2: + continue + key = x.pop(0) + key = key.replace(":", "") + value = " ".join(x) + value = value.replace(":", "_") + att[key] = value + return att + def get_attstr(att): - strlist=[] - for k,v in att.items(): - s = "%s \"%s\""%(k,v) - strlist.append(s) - attstr = "; ".join(strlist) - return attstr+";" + strlist = [] + for k, v in att.items(): + s = '%s "%s"' % (k, v) + strlist.append(s) + attstr = "; ".join(strlist) + return attstr + ";" -parser = argparse.ArgumentParser(description='fix gtf file') -parser.add_argument('--ingtf', dest='ingtf', type=str, required=True, - help='input gtf file') -parser.add_argument('--outgtf', dest='outgtf', type=str, required=True, - help='output gtf file') + +parser = argparse.ArgumentParser(description="fix gtf file") +parser.add_argument( + "--ingtf", dest="ingtf", type=str, required=True, help="input gtf file" +) +parser.add_argument( + "--outgtf", dest="outgtf", type=str, required=True, help="output gtf file" +) args = parser.parse_args() gene_id_2_gene_name = dict() -with open(args.ingtf, 'r') as ingtf: - for line in ingtf: - if line.startswith("#"): continue - line = line.strip() - line = line.split("\t") - if len(line) != 9: - print(line) - exit("ERROR ... line does not have 9 items!") - attributes = get_attributes(line[8]) - if debug==1: print(line) - if debug==1: print(attributes) - if not attributes["gene_id"] in gene_id_2_gene_name: - if "gene_name" in attributes: - gene_id_2_gene_name[attributes["gene_id"]] = attributes["gene_name"] - else: - gene_id_2_gene_name["gene_id"] = attributes["gene_id"] - -with open("gene_id_2_gene_name.tsv",'w') as tmp: - for k,v in gene_id_2_gene_name.items(): - tmp.write("%s\t%s\n"%(k,v)) +with open(args.ingtf, "r") as ingtf: + for line in ingtf: + if line.startswith("#"): + continue + line = line.strip() + line = line.split("\t") + if len(line) != 9: + print(line) + exit("ERROR ... line does not have 9 items!") + attributes = get_attributes(line[8]) + if debug == 1: + print(line) + if debug == 1: + print(attributes) + if not attributes["gene_id"] in gene_id_2_gene_name: + if "gene_name" in attributes: + gene_id_2_gene_name[attributes["gene_id"]] = attributes["gene_name"] + else: + gene_id_2_gene_name["gene_id"] = attributes["gene_id"] -with open(args.ingtf,'r') as ingtf, open(args.outgtf,'w') as outgtf: - for line in ingtf: - if line.startswith("#"): - outgtf.write(line) - continue - line = line.strip() - line = line.split("\t") - attributes = get_attributes(line[8]) - if not "gene_name" in attributes: - if not "gene_id" in attributes: - print(line) - print(attributes) - exit("ERROR in this line!") - if not attributes["gene_id"] in gene_id_2_gene_name: - print(line) - print(attributes) - print(attributes["gene_id"]) - exit("ERROR2 in this line!") - attributes["gene_name"] = gene_id_2_gene_name[attributes["gene_id"]] - line[8] = get_attstr(attributes) - outgtf.write("\t".join(line)+"\n") +with open("gene_id_2_gene_name.tsv", "w") as tmp: + for k, v in gene_id_2_gene_name.items(): + tmp.write("%s\t%s\n" % (k, v)) - +with open(args.ingtf, "r") as ingtf, open(args.outgtf, "w") as outgtf: + for line in ingtf: + if line.startswith("#"): + outgtf.write(line) + continue + line = line.strip() + line = line.split("\t") + attributes = get_attributes(line[8]) + if not "gene_name" in attributes: + if not "gene_id" in attributes: + print(line) + print(attributes) + exit("ERROR in this line!") + if not attributes["gene_id"] in gene_id_2_gene_name: + print(line) + print(attributes) + print(attributes["gene_id"]) + exit("ERROR2 in this line!") + attributes["gene_name"] = gene_id_2_gene_name[attributes["gene_id"]] + line[8] = get_attstr(attributes) + outgtf.write("\t".join(line) + "\n") diff --git a/workflow/scripts/fix_refseq_gtf.py b/workflow/scripts/fix_refseq_gtf.py index 130fe2a..4e06fca 100755 --- a/workflow/scripts/fix_refseq_gtf.py +++ b/workflow/scripts/fix_refseq_gtf.py @@ -3,266 +3,293 @@ # Date: Aug, 2020 -import sys,copy,argparse +import sys, copy, argparse parser = argparse.ArgumentParser() -parser.add_argument('-i',dest='ingtf', required=True, type=str, help="Input RefSeq GTF ..downloaded from NCBI ftp server") -parser.add_argument('-o',dest='outgtf', required=True, type=str, help="Modified Output RefSeq GTF") +parser.add_argument( + "-i", + dest="ingtf", + required=True, + type=str, + help="Input RefSeq GTF ..downloaded from NCBI ftp server", +) +parser.add_argument( + "-o", dest="outgtf", required=True, type=str, help="Modified Output RefSeq GTF" +) args = parser.parse_args() + def get_gene_id(column9): - x=column9.strip().split() - for i,value in enumerate(x): - if value=="gene_id": - gene_id_index=i+1 + x = column9.strip().split() + for i, value in enumerate(x): + if value == "gene_id": + gene_id_index = i + 1 break - gene_id=x[gene_id_index] + gene_id = x[gene_id_index] return gene_id + def get_gene_biotype(column9): - x=column9.strip().split() - found=0 - for i,value in enumerate(x): - if value=="gene_type" or value=="gene_biotype": - gene_biotype_index=i+1 - found=1 + x = column9.strip().split() + found = 0 + for i, value in enumerate(x): + if value == "gene_type" or value == "gene_biotype": + gene_biotype_index = i + 1 + found = 1 break - if found==0: + if found == 0: return '"unknown";' - gene_biotype=x[gene_biotype_index] + gene_biotype = x[gene_biotype_index] return gene_biotype + def get_gene_name(column9): - x=column9.strip().split() - found=0 - for i,value in enumerate(x): - if value=="gene" or value=="gene_name": - gene_index=i+1 - found=1 + x = column9.strip().split() + found = 0 + for i, value in enumerate(x): + if value == "gene" or value == "gene_name": + gene_index = i + 1 + found = 1 break - if found==0: + if found == 0: return "" - gene_name=x[gene_index] + gene_name = x[gene_index] return gene_name + def get_transcript_id(column9): - x=column9.strip().split() - found=0 - for i,value in enumerate(x): - if value=="transcript_id": - transcript_id_index=i+1 - found=1 + x = column9.strip().split() + found = 0 + for i, value in enumerate(x): + if value == "transcript_id": + transcript_id_index = i + 1 + found = 1 break - if found==0: + if found == 0: return '"transcript_id_unknown";' - transcript_id=x[transcript_id_index] + transcript_id = x[transcript_id_index] return transcript_id -def fix_transcript_id(column9,g): - x=column9.strip().split() - found=0 - for i,value in enumerate(x): - if value=="transcript_id": - transcript_id_index=i+1 - found=1 + +def fix_transcript_id(column9, g): + x = column9.strip().split() + found = 0 + for i, value in enumerate(x): + if value == "transcript_id": + transcript_id_index = i + 1 + found = 1 break - x[transcript_id_index]=g - if found==0: + x[transcript_id_index] = g + if found == 0: x.append("transcript_id") x.append(g) - x=" ".join(x) - return x + x = " ".join(x) + return x -def create_new_transript_id(g,i): - n=g.split('"') - n[-2]+="_transcript_"+str(i) - n='"'.join(n) + +def create_new_transript_id(g, i): + n = g.split('"') + n[-2] += "_transcript_" + str(i) + n = '"'.join(n) return n + def are_exons_present(transcript_lines): for l in transcript_lines: - l_split=l.strip().split("\t") - if l_split[2]=="exon": + l_split = l.strip().split("\t") + if l_split[2] == "exon": return True else: return False -#create genelist -genelist=[] -gene_coords=dict() -all_gtflines=list(filter(lambda x:not x.startswith("#"),open(args.ingtf).readlines())) -blank_gene_id_lines=[] + +# create genelist +genelist = [] +gene_coords = dict() +all_gtflines = list( + filter(lambda x: not x.startswith("#"), open(args.ingtf).readlines()) +) +blank_gene_id_lines = [] for f in all_gtflines: - its_a_gene=0 - if f.strip().split("\t")[2]=="gene": - its_a_gene=1 - gene_id=get_gene_id(f.strip().split("\t")[8]) - if gene_id=='"";': + its_a_gene = 0 + if f.strip().split("\t")[2] == "gene": + its_a_gene = 1 + gene_id = get_gene_id(f.strip().split("\t")[8]) + if gene_id == '"";': blank_gene_id_lines.append(f) continue genelist.append(gene_id) - if its_a_gene==1 and not gene_id in gene_coords: - gene_coords[gene_id]=(int(f.strip().split("\t")[3]),int(f.strip().split("\t")[4])) -genelist=list(set(genelist)) + if its_a_gene == 1 and not gene_id in gene_coords: + gene_coords[gene_id] = ( + int(f.strip().split("\t")[3]), + int(f.strip().split("\t")[4]), + ) +genelist = list(set(genelist)) # print(genelist) # print(len(blank_gene_id_lines)) -#get genes2transcripts ... this is only for verifying that every gene has only 1 transript... this is the assumption -gene_id_2_transcript_ids=dict() +# get genes2transcripts ... this is only for verifying that every gene has only 1 transript... this is the assumption +gene_id_2_transcript_ids = dict() for g in genelist: if not g in gene_id_2_transcript_ids: - gene_id_2_transcript_ids[g]=list() - lines_with_gene_id=list(filter(lambda x: g in x,all_gtflines)) - non_gene_lines=list(filter(lambda x:x.split("\t")[2]!="gene",lines_with_gene_id)) + gene_id_2_transcript_ids[g] = list() + lines_with_gene_id = list(filter(lambda x: g in x, all_gtflines)) + non_gene_lines = list( + filter(lambda x: x.split("\t")[2] != "gene", lines_with_gene_id) + ) for l in non_gene_lines: - t_id=get_transcript_id(l.strip().split("\t")[8]) - if t_id!='"transcript_id_unknown";': + t_id = get_transcript_id(l.strip().split("\t")[8]) + if t_id != '"transcript_id_unknown";': gene_id_2_transcript_ids[g].append(t_id) - gene_id_2_transcript_ids[g]=list(set(gene_id_2_transcript_ids[g])) + gene_id_2_transcript_ids[g] = list(set(gene_id_2_transcript_ids[g])) -geneid2transcriptidfile=open(args.ingtf+".geneid2transcriptid",'w') -for k,v in gene_id_2_transcript_ids.items(): - geneid2transcriptidfile.write("%s\t%s\n"%(k,v)) +geneid2transcriptidfile = open(args.ingtf + ".geneid2transcriptid", "w") +for k, v in gene_id_2_transcript_ids.items(): + geneid2transcriptidfile.write("%s\t%s\n" % (k, v)) geneid2transcriptidfile.close() -#get genenames -gene_id_2_gene_name=dict() +# get genenames +gene_id_2_gene_name = dict() for g in genelist: if not g in gene_id_2_gene_name: - gene_id_2_gene_name[g]=list() - lines_with_gene_id=list(filter(lambda x: g in x,all_gtflines)) - gene_line=list(filter(lambda x:x.split("\t")[2]=="gene",lines_with_gene_id)) + gene_id_2_gene_name[g] = list() + lines_with_gene_id = list(filter(lambda x: g in x, all_gtflines)) + gene_line = list(filter(lambda x: x.split("\t")[2] == "gene", lines_with_gene_id)) # if len(gene_line)==0: # for l in lines_with_gene_id: # print(l,) - gene_line=gene_line[0] - gene_name=get_gene_name(gene_line.split("\t")[8]) - if gene_name=="": - gene_name=g - gene_id_2_gene_name[g]=gene_name + gene_line = gene_line[0] + gene_name = get_gene_name(gene_line.split("\t")[8]) + if gene_name == "": + gene_name = g + gene_id_2_gene_name[g] = gene_name # for k,v in gene_id_2_gene_name.items(): # print(k,v) - -#get transcript coordinates -gene_id_2_transcript_coordinates=dict() + +# get transcript coordinates +gene_id_2_transcript_coordinates = dict() for g in genelist: # print("gene=",g) if not g in gene_id_2_transcript_coordinates: - gene_id_2_transcript_coordinates[g]=list() - if len(gene_id_2_transcript_ids[g])==1: + gene_id_2_transcript_coordinates[g] = list() + if len(gene_id_2_transcript_ids[g]) == 1: gene_id_2_transcript_coordinates[g].append(gene_coords[g]) else: - lines_with_gene_id=list(filter(lambda x: g in x,all_gtflines)) - non_gene_lines=list(filter(lambda x:x.split("\t")[2]!="gene",lines_with_gene_id)) + lines_with_gene_id = list(filter(lambda x: g in x, all_gtflines)) + non_gene_lines = list( + filter(lambda x: x.split("\t")[2] != "gene", lines_with_gene_id) + ) for t in gene_id_2_transcript_ids[g]: # print("transcript=",t) - transcript_lines=list(filter(lambda x:t in x,non_gene_lines)) - coords=[] + transcript_lines = list(filter(lambda x: t in x, non_gene_lines)) + coords = [] for l in transcript_lines: # print(l.strip()) - l_split=l.split("\t") + l_split = l.split("\t") coords.append(int(l_split[3])) coords.append(int(l_split[4])) # print() - gene_id_2_transcript_coordinates[g].append((min(coords),max(coords))) + gene_id_2_transcript_coordinates[g].append((min(coords), max(coords))) # print(gene_id_2_transcript_coordinates[g]) # for k,v in gene_id_2_transcript_coordinates.items(): - # print(k,v) +# print(k,v) # exit() -#get gene biotype\ -gene_id_2_gene_biotype=dict() +# get gene biotype\ +gene_id_2_gene_biotype = dict() for g in genelist: - lines_with_gene_id=list(filter(lambda x: g in x,all_gtflines)) - gene_line=list(filter(lambda x:x.split("\t")[2]=="gene",lines_with_gene_id)) - gene_line=gene_line[0] - gene_biotype=get_gene_biotype(gene_line.split("\t")[8]) - gene_id_2_gene_biotype[g]=gene_biotype + lines_with_gene_id = list(filter(lambda x: g in x, all_gtflines)) + gene_line = list(filter(lambda x: x.split("\t")[2] == "gene", lines_with_gene_id)) + gene_line = gene_line[0] + gene_biotype = get_gene_biotype(gene_line.split("\t")[8]) + gene_id_2_gene_biotype[g] = gene_biotype # for k,v in gene_id_2_gene_biotype.items(): # print(k,v) -out=open(args.outgtf,'w') +out = open(args.outgtf, "w") for g in genelist: - lines_with_gene_id=list(filter(lambda x: g in x,all_gtflines)) - gene_line=list(filter(lambda x:x.split("\t")[2]=="gene",lines_with_gene_id)) - gene_line=gene_line[0] - gene_line=gene_line.split("\t") - others=gene_line.pop(-1) - gene_line_copy=copy.copy(gene_line) + lines_with_gene_id = list(filter(lambda x: g in x, all_gtflines)) + gene_line = list(filter(lambda x: x.split("\t")[2] == "gene", lines_with_gene_id)) + gene_line = gene_line[0] + gene_line = gene_line.split("\t") + others = gene_line.pop(-1) + gene_line_copy = copy.copy(gene_line) # other key value pairs to add in the gene_line(col9) - others_to_add=[] + others_to_add = [] # print("others=",others) for o in others.strip().split("; "): # print("o=",o) - o2=o.split(" ") + o2 = o.split(" ") # print("o2=",o2) - key=o2[0] - value=o2[1:] - value=" ".join(value) + key = o2[0] + value = o2[1:] + value = " ".join(value) # print("key=",key) # print("value=",value) - if key in ["gene_id","gene","gene_name","gene_type","gene_biotype"]: + if key in ["gene_id", "gene", "gene_name", "gene_type", "gene_biotype"]: continue else: others_to_add.append(key) if not ";" in value: - others_to_add.append(value+";") + others_to_add.append(value + ";") else: others_to_add.append(value) - col9=[] + col9 = [] col9.append("gene_id") col9.append(g) col9.append("gene_name") col9.append(gene_id_2_gene_name[g]) col9.append("gene_biotype") col9.append(gene_id_2_gene_biotype[g]) - col9plus=copy.copy(col9) + col9plus = copy.copy(col9) col9plus.extend(others_to_add) - gene_col9=" ".join(col9plus) + gene_col9 = " ".join(col9plus) gene_line.append(gene_col9) - gene_line="\t".join(gene_line) - out.write("%s\n"%(gene_line)) + gene_line = "\t".join(gene_line) + out.write("%s\n" % (gene_line)) - non_gene_lines=list(filter(lambda x:x.split("\t")[2]!="gene",lines_with_gene_id)) - for i,t in enumerate(gene_id_2_transcript_ids[g]): - transcript_line=copy.copy(gene_line_copy) - transcript_line[2]="transcript" - transcript_line[3]=str(gene_id_2_transcript_coordinates[g][i][0]) - transcript_line[4]=str(gene_id_2_transcript_coordinates[g][i][1]) - new_trascript_id=create_new_transript_id(g,i+1) - transcript_col9=copy.copy(col9) + non_gene_lines = list( + filter(lambda x: x.split("\t")[2] != "gene", lines_with_gene_id) + ) + for i, t in enumerate(gene_id_2_transcript_ids[g]): + transcript_line = copy.copy(gene_line_copy) + transcript_line[2] = "transcript" + transcript_line[3] = str(gene_id_2_transcript_coordinates[g][i][0]) + transcript_line[4] = str(gene_id_2_transcript_coordinates[g][i][1]) + new_trascript_id = create_new_transript_id(g, i + 1) + transcript_col9 = copy.copy(col9) transcript_col9.append("transcript_id") transcript_col9.append(new_trascript_id) transcript_col9.append("transcript_name") transcript_col9.append(new_trascript_id) transcript_col9.append("transcript_type") transcript_col9.append(gene_id_2_gene_biotype[g]) - transcript_col9=" ".join(transcript_col9) + transcript_col9 = " ".join(transcript_col9) transcript_line.append(transcript_col9) - transcript_line="\t".join(transcript_line) - out.write("%s\n"%(transcript_line)) + transcript_line = "\t".join(transcript_line) + out.write("%s\n" % (transcript_line)) - transcript_lines=list(filter(lambda x:t in x,non_gene_lines)) - have_exons=are_exons_present(transcript_lines) + transcript_lines = list(filter(lambda x: t in x, non_gene_lines)) + have_exons = are_exons_present(transcript_lines) for l in transcript_lines: # print(l) - l=l.strip().split("\t") - tofix=l.pop(-1) - l.append(fix_transcript_id(tofix,new_trascript_id)) - if l[2]=="CDS" and have_exons==False: - l2=copy.copy(l) - l2[7]="." - l2[2]="exon" - l2="\t".join(l2) - out.write("%s\n"%(l2)) - l="\t".join(l) - out.write("%s\n"%(l)) + l = l.strip().split("\t") + tofix = l.pop(-1) + l.append(fix_transcript_id(tofix, new_trascript_id)) + if l[2] == "CDS" and have_exons == False: + l2 = copy.copy(l) + l2[7] = "." + l2[2] = "exon" + l2 = "\t".join(l2) + out.write("%s\n" % (l2)) + l = "\t".join(l) + out.write("%s\n" % (l)) # print(l) out.close() -out=open(args.ingtf+".extralines",'w') +out = open(args.ingtf + ".extralines", "w") for b in blank_gene_id_lines: out.write(b) out.close() diff --git a/workflow/scripts/gather_cluster_stats.sh b/workflow/scripts/gather_cluster_stats.sh index 49c326e..1b9cc98 100755 --- a/workflow/scripts/gather_cluster_stats.sh +++ b/workflow/scripts/gather_cluster_stats.sh @@ -64,4 +64,4 @@ echo -ne "##SubmitTime\tHumanSubmitTime\tJobID:JobState:JobName\tAllocNode:Alloc while read jid;do get_jobid_stats $jid done < ${snakemakelogfile}.jobids.lst |sort -k1,1n -rm -f ${snakemakelogfile}.jobids.lst \ No newline at end of file +rm -f ${snakemakelogfile}.jobids.lst diff --git a/workflow/scripts/get_index_rl.py b/workflow/scripts/get_index_rl.py index 36265c1..1a30190 100755 --- a/workflow/scripts/get_index_rl.py +++ b/workflow/scripts/get_index_rl.py @@ -1,12 +1,13 @@ import sys import gzip from itertools import islice -with gzip.open(sys.argv[1],'r') as fin: - for line in islice(fin,1,2) : - r=len(line.strip()) -offset=2 -rls=[50,75,100,125,150] -b=list(map(lambda x:x-int(r),rls)) -c=list(filter(lambda x:x<=(0+offset),b)) +with gzip.open(sys.argv[1], "r") as fin: + for line in islice(fin, 1, 2): + r = len(line.strip()) + +offset = 2 +rls = [50, 75, 100, 125, 150] +b = list(map(lambda x: x - int(r), rls)) +c = list(filter(lambda x: x <= (0 + offset), b)) print(rls[b.index(max(c))]) diff --git a/workflow/scripts/junctions2readids.py b/workflow/scripts/junctions2readids.py index c8053c3..192c592 100755 --- a/workflow/scripts/junctions2readids.py +++ b/workflow/scripts/junctions2readids.py @@ -39,35 +39,43 @@ # e. site2 # f. list of cigars comma-separated (soft-clips are converted to hard-clips) + def split_text(s): for k, g in groupby(s, str.isalpha): - yield ''.join(g) + yield "".join(g) + def split_cigar(c): - cigars=[] - if 'p' in c: - x=list(split_text(c)) - cigars.append(''.join(x[:x.index('p')-1]).replace('S','H')) - cigars.append(''.join(x[x.index('p')+1:]).replace('S','H')) - else: - cigars.append(c.replace('S','H')) - return cigars + cigars = [] + if "p" in c: + x = list(split_text(c)) + cigars.append("".join(x[: x.index("p") - 1]).replace("S", "H")) + cigars.append("".join(x[x.index("p") + 1 :]).replace("S", "H")) + else: + cigars.append(c.replace("S", "H")) + return cigars + def get_cigars(l): - cigars=[] - cigars.extend(split_cigar(l.split()[11])) - cigars.extend(split_cigar(l.split()[13])) - cigars=list(filter(lambda x:x!='',cigars)) - return cigars - -parser = argparse.ArgumentParser(description=""" + cigars = [] + cigars.extend(split_cigar(l.split()[11])) + cigars.extend(split_cigar(l.split()[13])) + cigars = list(filter(lambda x: x != "", cigars)) + return cigars + + +parser = argparse.ArgumentParser( + description=""" Extract readids,strand,site,cigar etc. of reads with spliced junction from chimeric junctions file generated using STAR. -""") -parser.add_argument('-j',dest='junctions',required=True,help='chimeric junctions file') +""" +) +parser.add_argument( + "-j", dest="junctions", required=True, help="chimeric junctions file" +) # parser.add_argument('-r',dest='readids',required=True,help='Output txt file with a readid per line') args = parser.parse_args() # ofile=open(args.readids,'w') -with open(args.junctions, 'r') as junc_f: +with open(args.junctions, "r") as junc_f: for line in junc_f: if "junction_type" in line: continue @@ -75,9 +83,11 @@ def get_cigars(l): if flag < 0: # junction type : -1=encompassing junction (between the mates) continue chr1, site1, strand1, chr2, site2, strand2 = line.split()[:6] - if chr1 != chr2 or strand1 != strand2: # D & A need to be on the same chrom and same strand + if ( + chr1 != chr2 or strand1 != strand2 + ): # D & A need to be on the same chrom and same strand continue - if strand1 == '+': + if strand1 == "+": start = int(site2) end = int(site1) - 1 else: @@ -85,5 +95,7 @@ def get_cigars(l): end = int(site2) - 1 if start > end: continue - readid=line.split()[9] - print("\t".join([readid,chr1,strand1,site1,site2,",".join(get_cigars(line))])) + readid = line.split()[9] + print( + "\t".join([readid, chr1, strand1, site1, site2, ",".join(get_cigars(line))]) + ) diff --git a/workflow/scripts/make_star_index.sh b/workflow/scripts/make_star_index.sh index f81cc53..e3e6b9b 100755 --- a/workflow/scripts/make_star_index.sh +++ b/workflow/scripts/make_star_index.sh @@ -3,4 +3,4 @@ STAR \ --runThreadN 56 \ --runMode genomeGenerate \ --genomeDir ./STAR_index_no_GTF \ ---genomeFastaFiles ./ref.fa \ No newline at end of file +--genomeFastaFiles ./ref.fa diff --git a/workflow/scripts/merge_ReadsPerGene_counts.R b/workflow/scripts/merge_ReadsPerGene_counts.R index 4eb352b..25fae3b 100755 --- a/workflow/scripts/merge_ReadsPerGene_counts.R +++ b/workflow/scripts/merge_ReadsPerGene_counts.R @@ -16,15 +16,15 @@ for (i in 1:length(files)){ sname=unlist(strsplit(basename(files[i]),"_p2"))[1] datasets_unstranded[[sname]]=read_counts(files[i],sname,2) datasets_stranded[[sname]]=read_counts(files[i],sname,3) - datasets_revstranded[[sname]]=read_counts(files[i],sname,4) + datasets_revstranded[[sname]]=read_counts(files[i],sname,4) } -x=Reduce(function(d1, d2) merge(d1, d2, by = "Gene", all.x = TRUE, all.y = FALSE), +x=Reduce(function(d1, d2) merge(d1, d2, by = "Gene", all.x = TRUE, all.y = FALSE), datasets_unstranded) -y=Reduce(function(d1, d2) merge(d1, d2, by = "Gene", all.x = TRUE, all.y = FALSE), +y=Reduce(function(d1, d2) merge(d1, d2, by = "Gene", all.x = TRUE, all.y = FALSE), datasets_stranded) -z=Reduce(function(d1, d2) merge(d1, d2, by = "Gene", all.x = TRUE, all.y = FALSE), +z=Reduce(function(d1, d2) merge(d1, d2, by = "Gene", all.x = TRUE, all.y = FALSE), datasets_revstranded) write.table(x,file="unstranded_STAR_GeneCounts.tsv",quote = FALSE,row.names = FALSE,sep="\t") diff --git a/workflow/scripts/merge_counts_tables_2_counts_matrix.py b/workflow/scripts/merge_counts_tables_2_counts_matrix.py index b9bc314..312e6fc 100755 --- a/workflow/scripts/merge_counts_tables_2_counts_matrix.py +++ b/workflow/scripts/merge_counts_tables_2_counts_matrix.py @@ -9,35 +9,60 @@ import os import numpy -debug=False +debug = False # no truncations during if debug: print pandas data frames -pandas.set_option('display.max_rows', None) -pandas.set_option('display.max_columns', None) -pandas.set_option('display.width', None) -pandas.set_option('display.max_colwidth', None) - - -parser = argparse.ArgumentParser(description='Merge per sample counts tables to a single annotated counts matrix') -parser.add_argument('--per_sample_tables', nargs='+', dest='ctables', type=argparse.FileType('r'), required=True, - help='space separated list of input per-sample count tables') -parser.add_argument('--lookup_table', dest='lookup', type=argparse.FileType('r'), required=True, - help='annotation lookup table (host-only)') -parser.add_argument('-o',dest='outfile',required=True,type=argparse.FileType('w'),help='merged countsmatrix') +pandas.set_option("display.max_rows", None) +pandas.set_option("display.max_columns", None) +pandas.set_option("display.width", None) +pandas.set_option("display.max_colwidth", None) + + +parser = argparse.ArgumentParser( + description="Merge per sample counts tables to a single annotated counts matrix" +) +parser.add_argument( + "--per_sample_tables", + nargs="+", + dest="ctables", + type=argparse.FileType("r"), + required=True, + help="space separated list of input per-sample count tables", +) +parser.add_argument( + "--lookup_table", + dest="lookup", + type=argparse.FileType("r"), + required=True, + help="annotation lookup table (host-only)", +) +parser.add_argument( + "-o", + dest="outfile", + required=True, + type=argparse.FileType("w"), + help="merged countsmatrix", +) args = parser.parse_args() if debug: print(args) + def prefix_counts(colname): # returns true if the col needs to be an int - if colname.endswith("_read_count"): return True - if colname.endswith("_ntools"): return True - if colname.endswith(".length"): return True + if colname.endswith("_read_count"): + return True + if colname.endswith("_ntools"): + return True + if colname.endswith(".length"): + return True return False + def prefix_annotations(colname): - if colname.endswith("_annotation"): return True + if colname.endswith("_annotation"): + return True return False @@ -48,134 +73,174 @@ def atof(text): retval = text return retval + def natural_keys(text): - ''' + """ alist.sort(key=natural_keys) sorts in human order http://nedbatchelder.com/blog/200712/human_sorting.html (See Toothy's implementation in the comments) float regex comes from https://stackoverflow.com/a/12643073/190597 - ''' - return [ atof(c) for c in re.split(r'[+-]?([0-9]+(?:[.][0-9]*)?|[.][0-9]+)', str(text)) ] + """ + return [ + atof(c) for c in re.split(r"[+-]?([0-9]+(?:[.][0-9]*)?|[.][0-9]+)", str(text)) + ] + def get_count_and_annotation_columns(df): - count_cols=['circRNA_id2'] - annotation_cols=['circRNA_id2'] + count_cols = ["circRNA_id2"] + annotation_cols = ["circRNA_id2"] for col in df.columns: if prefix_counts(col): count_cols.append(col) if prefix_annotations(col): annotation_cols.append(col) - return count_cols,annotation_cols + return count_cols, annotation_cols + def readin_counts_file(f): - intable=pandas.read_csv(f,sep="\t",header=0) - intable['circRNA_id2']=intable['circRNA_id'].astype(str)+"##"+intable['strand'].astype(str) - intable.drop(['circRNA_id','strand'],axis=1,inplace=True) - count_cols,annotation_cols=get_count_and_annotation_columns(intable) - count_table=intable[count_cols] - annotation_table=intable[annotation_cols] - count_table.set_index(['circRNA_id2'],inplace=True) - annotation_table.set_index(['circRNA_id2'],inplace=True) - return(count_table,annotation_table) + intable = pandas.read_csv(f, sep="\t", header=0) + intable["circRNA_id2"] = ( + intable["circRNA_id"].astype(str) + "##" + intable["strand"].astype(str) + ) + intable.drop(["circRNA_id", "strand"], axis=1, inplace=True) + count_cols, annotation_cols = get_count_and_annotation_columns(intable) + count_table = intable[count_cols] + annotation_table = intable[annotation_cols] + count_table.set_index(["circRNA_id2"], inplace=True) + annotation_table.set_index(["circRNA_id2"], inplace=True) + return (count_table, annotation_table) # per_sample_files=list(Path(args.folder).rglob("*.circRNA_counts.txt")) # per_sample_files=list(filter(lambda x: os.stat(x).st_size !=0, per_sample_files)) # per_sample_files.sort(key=natural_keys) -per_sample_files=args.ctables +per_sample_files = args.ctables -if debug: print(per_sample_files) +if debug: + print(per_sample_files) -annotation_tables=list() -f=per_sample_files[0] -if debug: print("Currently reading file:"+str(f)) -ctable,atable=readin_counts_file(f) +annotation_tables = list() +f = per_sample_files[0] +if debug: + print("Currently reading file:" + str(f)) +ctable, atable = readin_counts_file(f) annotation_tables.append(atable) -count_matrix=ctable.copy() +count_matrix = ctable.copy() # count_matrix.set_index(['circRNA_id2'],inplace=True) -if debug: print("Head of this file looks like this:") -if debug: print(count_matrix.head()) -for i in range(1,len(per_sample_files)): - f=per_sample_files[i] - if debug: print("Currently reading file:"+str(f)) - ctable,atable=readin_counts_file(f) +if debug: + print("Head of this file looks like this:") +if debug: + print(count_matrix.head()) +for i in range(1, len(per_sample_files)): + f = per_sample_files[i] + if debug: + print("Currently reading file:" + str(f)) + ctable, atable = readin_counts_file(f) # ctable.set_index(['circRNA_id2'],inplace=True) - if debug: print("Head of this file looks like this:") - if debug: print(ctable.head()) - count_matrix=pandas.concat([count_matrix,ctable],axis=1,join="outer",sort=False) - count_matrix.fillna(0,inplace=True) + if debug: + print("Head of this file looks like this:") + if debug: + print(ctable.head()) + count_matrix = pandas.concat( + [count_matrix, ctable], axis=1, join="outer", sort=False + ) + count_matrix.fillna(0, inplace=True) annotation_tables.append(atable) -for i,a in enumerate(annotation_tables): - if i==0: - amatrix=a.copy() +for i, a in enumerate(annotation_tables): + if i == 0: + amatrix = a.copy() else: - oldi=set(list(amatrix.index)) - newi=set(list(a.index)) - toadd=newi-oldi - suba=a.loc[list(toadd)] - amatrix=pandas.concat([amatrix,suba]) + oldi = set(list(amatrix.index)) + newi = set(list(a.index)) + toadd = newi - oldi + suba = a.loc[list(toadd)] + amatrix = pandas.concat([amatrix, suba]) -if debug: print(count_matrix.head()) -if debug: print(annotation_tables[0].head()) -if debug: print(count_matrix.shape) -if debug: print(annotation_tables[0].shape) -if debug: print(annotation_tables[1].shape) -if debug: print(amatrix.shape) +if debug: + print(count_matrix.head()) +if debug: + print(annotation_tables[0].head()) +if debug: + print(count_matrix.shape) +if debug: + print(annotation_tables[0].shape) +if debug: + print(annotation_tables[1].shape) +if debug: + print(amatrix.shape) -annotations=pandas.read_csv(args.lookup,sep="\t",header=0) -annotations_cols=annotations.columns +annotations = pandas.read_csv(args.lookup, sep="\t", header=0) +annotations_cols = annotations.columns # annotations.set_index([annotations_cols[0]],inplace=True) -annotations['circRNA_id2']=annotations[annotations_cols[0]].astype(str)+"##"+annotations['strand'].astype(str) -annotations.set_index(annotations['circRNA_id2'],inplace=True) +annotations["circRNA_id2"] = ( + annotations[annotations_cols[0]].astype(str) + + "##" + + annotations["strand"].astype(str) +) +annotations.set_index(annotations["circRNA_id2"], inplace=True) # annotations.drop(['strand'],axis=1,inplace=True) -if debug: print(annotations.head()) -if debug: print(annotations.shape) - +if debug: + print(annotations.head()) +if debug: + print(annotations.shape) # count_matrix=pandas.concat([count_matrix,annotations],axis=1,join="outer",sort=False) -cmatrix = pandas.merge(amatrix,annotations,left_index=True,right_index=True,sort=False,how='left') -cmatrix['circRNA_id2']=cmatrix.index -count_matrix = pandas.merge(count_matrix,cmatrix,left_index=True,right_index=True,sort=False,how='left') -count_matrix.replace('.',numpy.nan,inplace=True) -count_matrix.fillna(0,inplace=True) +cmatrix = pandas.merge( + amatrix, annotations, left_index=True, right_index=True, sort=False, how="left" +) +cmatrix["circRNA_id2"] = cmatrix.index +count_matrix = pandas.merge( + count_matrix, cmatrix, left_index=True, right_index=True, sort=False, how="left" +) +count_matrix.replace(".", numpy.nan, inplace=True) +count_matrix.fillna(0, inplace=True) # count_matrix.replace(re.compile('\.'),'0', regex=True,inplace=True) -if debug: print(count_matrix.head()) -if debug: print(count_matrix.shape) +if debug: + print(count_matrix.head()) +if debug: + print(count_matrix.shape) -coltypes=dict() +coltypes = dict() for col in count_matrix.columns: - coltypes[col]=str + coltypes[col] = str if prefix_counts(col): # count_matrix[[col]].replace(re.compile('\.'),'0', regex=True,inplace=True) - coltypes[col]=int + coltypes[col] = int count_matrix = count_matrix.astype(coltypes) -count_matrix[['circRNA_coord', 'circRNA_strand']] = count_matrix['circRNA_id2'].str.split('##', expand=True) -count_matrix.drop(['circRNA_id2'],axis=1,inplace=True) -cols=list(count_matrix.columns) -col1index=cols.index('circRNA_coord') -col2index=cols.index('circRNA_strand') -other_indices=list(set(range(len(cols)))-set([col1index,col2index])) -new_order=['circRNA_coord','circRNA_strand'] +count_matrix[["circRNA_coord", "circRNA_strand"]] = count_matrix[ + "circRNA_id2" +].str.split("##", expand=True) +count_matrix.drop(["circRNA_id2"], axis=1, inplace=True) +cols = list(count_matrix.columns) +col1index = cols.index("circRNA_coord") +col2index = cols.index("circRNA_strand") +other_indices = list(set(range(len(cols))) - set([col1index, col2index])) +new_order = ["circRNA_coord", "circRNA_strand"] for i in other_indices: new_order.append(cols[i]) -count_matrix=count_matrix[new_order] -if debug: print(count_matrix.head()) - -df2 = count_matrix[list(filter(lambda x:x.endswith("_read_count"),list(count_matrix.columns)))] -df2 = df2.astype('int') -count_matrix['sum_of_all_counts'] = df2.sum(axis=1) -df3 = count_matrix[list(filter(lambda x:x.endswith("_ntools"),list(count_matrix.columns)))] -df3 = df3.astype('int') -count_matrix['sum_of_all_ntools'] = df3.sum(axis=1) - -count_matrix = count_matrix.sort_values(by=['sum_of_all_ntools','sum_of_all_counts'], ascending=False) -count_matrix.drop(['sum_of_all_ntools','sum_of_all_counts'],axis=1,inplace=True) -count_matrix.to_csv(args.outfile,sep="\t",header=True,index=False) - - +count_matrix = count_matrix[new_order] +if debug: + print(count_matrix.head()) + +df2 = count_matrix[ + list(filter(lambda x: x.endswith("_read_count"), list(count_matrix.columns))) +] +df2 = df2.astype("int") +count_matrix["sum_of_all_counts"] = df2.sum(axis=1) +df3 = count_matrix[ + list(filter(lambda x: x.endswith("_ntools"), list(count_matrix.columns))) +] +df3 = df3.astype("int") +count_matrix["sum_of_all_ntools"] = df3.sum(axis=1) + +count_matrix = count_matrix.sort_values( + by=["sum_of_all_ntools", "sum_of_all_counts"], ascending=False +) +count_matrix.drop(["sum_of_all_ntools", "sum_of_all_counts"], axis=1, inplace=True) +count_matrix.to_csv(args.outfile, sep="\t", header=True, index=False) diff --git a/workflow/scripts/reformat_hg38_2_hg19.py b/workflow/scripts/reformat_hg38_2_hg19.py index 14b310a..67ba897 100755 --- a/workflow/scripts/reformat_hg38_2_hg19.py +++ b/workflow/scripts/reformat_hg38_2_hg19.py @@ -1,53 +1,53 @@ -f=open("hg19_hg38_annotated_lookup.txt") -hg38_2_hg19=dict() +f = open("hg19_hg38_annotated_lookup.txt") +hg38_2_hg19 = dict() for l in f.readlines(): - l=l.strip().split("\t") - hg19ID=l[0] - hg38ID=l[1] - strand=l[2] - circRNA_ID=l[3] - genomic_length=l[4] - spliced_seq_length=l[5] - samples=l[6].split(",") - repeats=l[7] - annotation=l[8].split(",") - best_transcript=l[9] - gene_symbol=l[10] - circRNA_study=l[11].split(",") - if not hg38ID in hg38_2_hg19: - hg38_2_hg19[hg38ID]=dict() - hg38_2_hg19[hg38ID]['hg19ID']=list() - hg38_2_hg19[hg38ID]['circRNA_ID']=list() - hg38_2_hg19[hg38ID]['samples']=list() - hg38_2_hg19[hg38ID]['annotation']=list() - hg38_2_hg19[hg38ID]['circRNA_study']=list() - hg38_2_hg19[hg38ID]['hg19ID'].append(hg19ID) - hg38_2_hg19[hg38ID]['strand']=strand - hg38_2_hg19[hg38ID]['circRNA_ID'].append(circRNA_ID) - hg38_2_hg19[hg38ID]['genomic_length']=genomic_length - hg38_2_hg19[hg38ID]['spliced_seq_length']=spliced_seq_length - hg38_2_hg19[hg38ID]['samples'].extend(samples) - hg38_2_hg19[hg38ID]['repeats']=repeats - hg38_2_hg19[hg38ID]['annotation'].extend(annotation) - hg38_2_hg19[hg38ID]['best_transcript']=best_transcript - hg38_2_hg19[hg38ID]['gene_symbol']=gene_symbol - hg38_2_hg19[hg38ID]['circRNA_study'].extend(circRNA_study) - -#print("\t".join(["hg38ID","hg19ID","strand","circRNA.ID","genomic.length","spliced.seq.length","samples","repeats","annotation","best.transcript","gene.symbol","circRNA.study"]),) -for k,v in hg38_2_hg19.items(): - l=list() - l.append(k) - l.append(",".join(set(v['hg19ID']))) - l.append(v['strand']) - l.append(",".join(set(v['circRNA_ID']))) - l.append(v['genomic_length']) - l.append(v['spliced_seq_length']) - l.append(",".join(set(v['samples']))) - l.append(v['repeats']) - l.append(",".join(set(v['annotation']))) - l.append(v['best_transcript']) - l.append(v['gene_symbol']) - l.append(",".join(set(v['circRNA_study']))) - print("\t".join(l),) - + l = l.strip().split("\t") + hg19ID = l[0] + hg38ID = l[1] + strand = l[2] + circRNA_ID = l[3] + genomic_length = l[4] + spliced_seq_length = l[5] + samples = l[6].split(",") + repeats = l[7] + annotation = l[8].split(",") + best_transcript = l[9] + gene_symbol = l[10] + circRNA_study = l[11].split(",") + if not hg38ID in hg38_2_hg19: + hg38_2_hg19[hg38ID] = dict() + hg38_2_hg19[hg38ID]["hg19ID"] = list() + hg38_2_hg19[hg38ID]["circRNA_ID"] = list() + hg38_2_hg19[hg38ID]["samples"] = list() + hg38_2_hg19[hg38ID]["annotation"] = list() + hg38_2_hg19[hg38ID]["circRNA_study"] = list() + hg38_2_hg19[hg38ID]["hg19ID"].append(hg19ID) + hg38_2_hg19[hg38ID]["strand"] = strand + hg38_2_hg19[hg38ID]["circRNA_ID"].append(circRNA_ID) + hg38_2_hg19[hg38ID]["genomic_length"] = genomic_length + hg38_2_hg19[hg38ID]["spliced_seq_length"] = spliced_seq_length + hg38_2_hg19[hg38ID]["samples"].extend(samples) + hg38_2_hg19[hg38ID]["repeats"] = repeats + hg38_2_hg19[hg38ID]["annotation"].extend(annotation) + hg38_2_hg19[hg38ID]["best_transcript"] = best_transcript + hg38_2_hg19[hg38ID]["gene_symbol"] = gene_symbol + hg38_2_hg19[hg38ID]["circRNA_study"].extend(circRNA_study) +# print("\t".join(["hg38ID","hg19ID","strand","circRNA.ID","genomic.length","spliced.seq.length","samples","repeats","annotation","best.transcript","gene.symbol","circRNA.study"]),) +for k, v in hg38_2_hg19.items(): + l = list() + l.append(k) + l.append(",".join(set(v["hg19ID"]))) + l.append(v["strand"]) + l.append(",".join(set(v["circRNA_ID"]))) + l.append(v["genomic_length"]) + l.append(v["spliced_seq_length"]) + l.append(",".join(set(v["samples"]))) + l.append(v["repeats"]) + l.append(",".join(set(v["annotation"]))) + l.append(v["best_transcript"]) + l.append(v["gene_symbol"]) + l.append(",".join(set(v["circRNA_study"]))) + print( + "\t".join(l), + ) diff --git a/workflow/scripts/transcript2gene.py b/workflow/scripts/transcript2gene.py index 9e5b963..95c29aa 100755 --- a/workflow/scripts/transcript2gene.py +++ b/workflow/scripts/transcript2gene.py @@ -1,20 +1,23 @@ import sys -def get_id(s,whatid): - s=s.split() - for i,j in enumerate(s): - if j==whatid: - r=s[i+1] - r=r.replace('"','') - r=r.replace(';','') - return r -gtffile=sys.argv[1] + + +def get_id(s, whatid): + s = s.split() + for i, j in enumerate(s): + if j == whatid: + r = s[i + 1] + r = r.replace('"', "") + r = r.replace(";", "") + return r + + +gtffile = sys.argv[1] for i in open(gtffile).readlines(): - if i.startswith("#"): - continue - i=i.strip().split("\t") - if i[2]!="transcript": - continue - gid=get_id(i[8],"gene_id") - tid=get_id(i[8],"transcript_id") - print("%s\t%s"%(tid,gid)) - + if i.startswith("#"): + continue + i = i.strip().split("\t") + if i[2] != "transcript": + continue + gid = get_id(i[8], "gene_id") + tid = get_id(i[8], "transcript_id") + print("%s\t%s" % (tid, gid)) diff --git a/workflow/scripts/validate_BSJ_reads_and_split_BSJ_bam_by_strand.py b/workflow/scripts/validate_BSJ_reads_and_split_BSJ_bam_by_strand.py index 8970797..1e0a3db 100755 --- a/workflow/scripts/validate_BSJ_reads_and_split_BSJ_bam_by_strand.py +++ b/workflow/scripts/validate_BSJ_reads_and_split_BSJ_bam_by_strand.py @@ -10,10 +10,10 @@ 3. BSJ bed file with score(number of reads supporting the BSJ) and strand information Logic (for PE reads): Each BSJ is represented by a 3 alignments in the output BAM file. -Alignment 1 is complete alignment of one of the reads in pair and -Alignments 2 and 3 are split alignment of the mate at two distinct loci on the same reference +Alignment 1 is complete alignment of one of the reads in pair and +Alignments 2 and 3 are split alignment of the mate at two distinct loci on the same reference chromosome. -These alignments are grouped together by the "HI" tags in SAM file. For example, all 3 +These alignments are grouped together by the "HI" tags in SAM file. For example, all 3 alignments for the same BSJ will have the same "HI" value... something like "HI:i:1". BSJ alignment sam bitflag combinations can have 8 different possibilities, 4 from sense strand and 4 from anti-sense strand: @@ -29,12 +29,12 @@ # |<------------------BSJ----------------->| 3. 83,163,2209 4. 339,419,2465 -# R1 -# <------ +# R1 +# <------ # 5'--|------------------------------------------|---3' # 3'--|------------------------------------------|---5' # |------> ------>| -# | R2.2 R2.1 | +# | R2.2 R2.1 | # | | # |<-----------------BSJ-------------------->| 5. 99,147,2193 @@ -49,346 +49,378 @@ # |<------------------BSJ----------------->| 7. 99,147,2145 8. 355, 403, 2401 -# R2 -# <------ +# R2 +# <------ # 5'--|------------------------------------------|---3' # 3'--|------------------------------------------|---5' # |------> ------>| -# | R1.2 R1.1 | +# | R1.2 R1.1 | # | | # |<-----------------BSJ-------------------->| """ class BSJ: - def __init__(self): - self.chrom="" - self.start="" - self.end="" - self.score=0 - self.name="." - self.strand="U" - self.bitids=list() - self.rids=list() - - def plusone(self): - self.score+=1 - - def set_strand(self,strand): - self.strand=strand - - def set_chrom(self,chrom): - self.chrom=chrom - - def set_start(self,start): - self.start=start - - def set_end(self,end): - self.end=end - - def append_bitid(self,bitid): - self.bitids.append(bitid) - - def append_rid(self,rid): - self.rids.append(rid) - - def write_out_BSJ(self,outbed): - t=[] - t.append(self.chrom) - t.append(str(self.start)) - t.append(str(self.end)) - t.append(self.name) - t.append(str(self.score)) - t.append(self.strand) - t.append(",".join(self.bitids)) - t.append(",".join(self.rids)) - outbed.write("\t".join(t)+"\n") - + def __init__(self): + self.chrom = "" + self.start = "" + self.end = "" + self.score = 0 + self.name = "." + self.strand = "U" + self.bitids = list() + self.rids = list() + + def plusone(self): + self.score += 1 + + def set_strand(self, strand): + self.strand = strand + + def set_chrom(self, chrom): + self.chrom = chrom + + def set_start(self, start): + self.start = start + + def set_end(self, end): + self.end = end + + def append_bitid(self, bitid): + self.bitids.append(bitid) + + def append_rid(self, rid): + self.rids.append(rid) + + def write_out_BSJ(self, outbed): + t = [] + t.append(self.chrom) + t.append(str(self.start)) + t.append(str(self.end)) + t.append(self.name) + t.append(str(self.score)) + t.append(self.strand) + t.append(",".join(self.bitids)) + t.append(",".join(self.rids)) + outbed.write("\t".join(t) + "\n") + + class Readinfo: - def __init__(self,readid,rname): - self.readid=readid - self.refname=rname - self.alignments=list() - self.bitflags=list() - self.bitid="" - self.strand="." - self.start=-1 - self.end=-1 - self.refcoordinates=dict() - self.isread1=dict() - self.isreverse=dict() - self.issecondary=dict() - self.issupplementary=dict() - - def __str__(self): - s = "readid: %s"%(self.readid) - s = "%s\tbitflags: %s"%(s,self.bitflags) - s = "%s\tbitid: %s"%(s,self.bitid) - return s - - def set_refcoordinates(self,bitflag,refpos): - self.refcoordinates[bitflag]=refpos - - def set_read1_reverse_secondary_supplementary(self,bitflag,read): - if read.is_read1: - self.isread1[bitflag]="Y" - else: - self.isread1[bitflag]="N" - if read.is_reverse: - self.isreverse[bitflag]="Y" - else: - self.isreverse[bitflag]="N" - if read.is_secondary: - self.issecondary[bitflag]="Y" - else: - self.issecondary[bitflag]="N" - if read.is_supplementary: - self.issupplementary[bitflag]="Y" - else: - self.issupplementary[bitflag]="N" - - def append_alignment(self,read): - self.alignments.append(read) - - def append_bitflag(self,bf): - self.bitflags.append(bf) - - # def extend_ref_positions(self,refcoords): - # self.refcoordinates.extend(refcoords) - - def generate_bitid(self): - bitlist=sorted(self.bitflags) - self.bitid="##".join(list(map(lambda x:str(x),bitlist))) -# self.bitid=str(bitlist[0])+"##"+str(bitlist[1])+"##"+str(bitlist[2]) - - def get_strand(self): - if self.bitid=="83##163##2129": - self.strand="+" - elif self.bitid=="339##419##2385": - self.strand="+" - elif self.bitid=="83##163##2209": - self.strand="+" - elif self.bitid=="339##419##2465": - self.strand="+" - elif self.bitid=="99##147##2193": - self.strand="-" - elif self.bitid=="355##403##2449": - self.strand="-" - elif self.bitid=="99##147##2145": - self.strand="-" - elif self.bitid=="355##403##2401": - self.strand="-" - elif self.bitid=="16##2064": - self.strand="+" - elif self.bitid=="272##2320": - self.strand="+" - elif self.bitid=="0##2048": - self.strand="-" - elif self.bitid=="256##2304": - self.strand="-" - elif self.bitid=="153##2201": - self.strand="-" - else: - self.strand="U" - - def validate_read(self): - """ - Checks if read is truly a BSJ originitor. - * Defines left, right and middle alignments - * Left and right alignments should not overlap - * Middle alignment should be between left and right alignments - """ - if len(self.bitid.split("##"))==3: - left=-1 - right=-1 - middle=-1 - if self.bitid=="83##163##2129": - left=2129 - right=83 - middle=163 - if self.bitid=="339##419##2385": - left=2385 - right=339 - middle=419 - if self.bitid=="83##163##2209": - left=163 - right=2209 - middle=83 - if self.bitid=="339##419##2465": - left=419 - right=2465 - middle=339 - if self.bitid=="99##147##2145": - left=99 - right=2145 - middle=147 - if self.bitid=="355##403##2401": - left=355 - right=2401 - middle=403 - if self.bitid=="99##147##2193": - left=2193 - right=147 - middle=99 - if self.bitid=="355##403##2449": - left=2449 - right=403 - middle=355 - print(left,right,middle) - if left == -1 or right == -1 or middle == -1: - return False - if not (self.refcoordinates[left][-1] < self.refcoordinates[right][0] and self.refcoordinates[middle][-1] <= self.refcoordinates[right][-1] and self.refcoordinates[middle][0] >= self.refcoordinates[left][0]): - print("HERE") - print(self.refcoordinates[left][-1]) - print(self.refcoordinates[right][0]) - print(self.refcoordinates[middle][-1]) - print(self.refcoordinates[right][-1]) - print(self.refcoordinates[middle][0]) - print(self.refcoordinates[left][0]) - print(self.refcoordinates[left][-1] < self.refcoordinates[right][0]) - print(self.refcoordinates[middle][-1] <= self.refcoordinates[right][-1]) - print(self.refcoordinates[middle][0] >= self.refcoordinates[left][0]) - return False - else: - return True - else: - return False - # print("NOT_THREE",self.readid,self.bitid,self.refcoordinates.keys()) - # if not (self.refcoordinates[163][-1] < self.refcoordinates[2209][0] and self.refcoordinates[83][-1] <= self.refcoordinates[2209][-1] and self.refcoordinates[83][0] >= self.refcoordinates[163][0]): - # print(self.readid,self.bitid) - # print(self.refcoordinates.keys()) - # print(self.refcoordinates[163][0],self.refcoordinates[163][-1],"\t",self.refcoordinates[2209][0],self.refcoordinates[2209][-1]) - # print(self.refcoordinates[83][0],self.refcoordinates[83][-1]) - - - def get_start_end(self): - refcoordinates=self.refcoordinates - isread1=self.isread1 - if len(self.isread1)!=3: - refcoords=[] - for i in refcoordinates.keys(): - refcoords.extend(refcoordinates[i]) - else: - l=[] - for i in isread1.keys(): - l.append(isread1[i]) - Ycount=l.count("Y") - Ncount=l.count("N") - if Ycount>Ncount: - useread1="Y" - else: - useread1="N" - refcoords=[] - for i in refcoordinates.keys(): - if isread1[i]==useread1: - refcoords.extend(refcoordinates[i]) - refcoords=sorted(refcoords) - self.start=str(refcoords[0]) - self.end=str(int(refcoords[-1])+1) - - def get_bsjid(self): - t=[] - t.append(self.refname) - t.append(self.start) - t.append(self.end) - t.append(self.strand) - return "##".join(t) - - def write_out_reads(self,outbam): - for r in self.alignments: - outbam.write(r) - - + def __init__(self, readid, rname): + self.readid = readid + self.refname = rname + self.alignments = list() + self.bitflags = list() + self.bitid = "" + self.strand = "." + self.start = -1 + self.end = -1 + self.refcoordinates = dict() + self.isread1 = dict() + self.isreverse = dict() + self.issecondary = dict() + self.issupplementary = dict() + + def __str__(self): + s = "readid: %s" % (self.readid) + s = "%s\tbitflags: %s" % (s, self.bitflags) + s = "%s\tbitid: %s" % (s, self.bitid) + return s + + def set_refcoordinates(self, bitflag, refpos): + self.refcoordinates[bitflag] = refpos + + def set_read1_reverse_secondary_supplementary(self, bitflag, read): + if read.is_read1: + self.isread1[bitflag] = "Y" + else: + self.isread1[bitflag] = "N" + if read.is_reverse: + self.isreverse[bitflag] = "Y" + else: + self.isreverse[bitflag] = "N" + if read.is_secondary: + self.issecondary[bitflag] = "Y" + else: + self.issecondary[bitflag] = "N" + if read.is_supplementary: + self.issupplementary[bitflag] = "Y" + else: + self.issupplementary[bitflag] = "N" + + def append_alignment(self, read): + self.alignments.append(read) + + def append_bitflag(self, bf): + self.bitflags.append(bf) + + # def extend_ref_positions(self,refcoords): + # self.refcoordinates.extend(refcoords) + + def generate_bitid(self): + bitlist = sorted(self.bitflags) + self.bitid = "##".join(list(map(lambda x: str(x), bitlist))) + + # self.bitid=str(bitlist[0])+"##"+str(bitlist[1])+"##"+str(bitlist[2]) + + def get_strand(self): + if self.bitid == "83##163##2129": + self.strand = "+" + elif self.bitid == "339##419##2385": + self.strand = "+" + elif self.bitid == "83##163##2209": + self.strand = "+" + elif self.bitid == "339##419##2465": + self.strand = "+" + elif self.bitid == "99##147##2193": + self.strand = "-" + elif self.bitid == "355##403##2449": + self.strand = "-" + elif self.bitid == "99##147##2145": + self.strand = "-" + elif self.bitid == "355##403##2401": + self.strand = "-" + elif self.bitid == "16##2064": + self.strand = "+" + elif self.bitid == "272##2320": + self.strand = "+" + elif self.bitid == "0##2048": + self.strand = "-" + elif self.bitid == "256##2304": + self.strand = "-" + elif self.bitid == "153##2201": + self.strand = "-" + else: + self.strand = "U" + + def validate_read(self): + """ + Checks if read is truly a BSJ originitor. + * Defines left, right and middle alignments + * Left and right alignments should not overlap + * Middle alignment should be between left and right alignments + """ + if len(self.bitid.split("##")) == 3: + left = -1 + right = -1 + middle = -1 + if self.bitid == "83##163##2129": + left = 2129 + right = 83 + middle = 163 + if self.bitid == "339##419##2385": + left = 2385 + right = 339 + middle = 419 + if self.bitid == "83##163##2209": + left = 163 + right = 2209 + middle = 83 + if self.bitid == "339##419##2465": + left = 419 + right = 2465 + middle = 339 + if self.bitid == "99##147##2145": + left = 99 + right = 2145 + middle = 147 + if self.bitid == "355##403##2401": + left = 355 + right = 2401 + middle = 403 + if self.bitid == "99##147##2193": + left = 2193 + right = 147 + middle = 99 + if self.bitid == "355##403##2449": + left = 2449 + right = 403 + middle = 355 + print(left, right, middle) + if left == -1 or right == -1 or middle == -1: + return False + if not ( + self.refcoordinates[left][-1] < self.refcoordinates[right][0] + and self.refcoordinates[middle][-1] <= self.refcoordinates[right][-1] + and self.refcoordinates[middle][0] >= self.refcoordinates[left][0] + ): + print("HERE") + print(self.refcoordinates[left][-1]) + print(self.refcoordinates[right][0]) + print(self.refcoordinates[middle][-1]) + print(self.refcoordinates[right][-1]) + print(self.refcoordinates[middle][0]) + print(self.refcoordinates[left][0]) + print(self.refcoordinates[left][-1] < self.refcoordinates[right][0]) + print(self.refcoordinates[middle][-1] <= self.refcoordinates[right][-1]) + print(self.refcoordinates[middle][0] >= self.refcoordinates[left][0]) + return False + else: + return True + else: + return False + # print("NOT_THREE",self.readid,self.bitid,self.refcoordinates.keys()) + # if not (self.refcoordinates[163][-1] < self.refcoordinates[2209][0] and self.refcoordinates[83][-1] <= self.refcoordinates[2209][-1] and self.refcoordinates[83][0] >= self.refcoordinates[163][0]): + # print(self.readid,self.bitid) + # print(self.refcoordinates.keys()) + # print(self.refcoordinates[163][0],self.refcoordinates[163][-1],"\t",self.refcoordinates[2209][0],self.refcoordinates[2209][-1]) + # print(self.refcoordinates[83][0],self.refcoordinates[83][-1]) + + def get_start_end(self): + refcoordinates = self.refcoordinates + isread1 = self.isread1 + if len(self.isread1) != 3: + refcoords = [] + for i in refcoordinates.keys(): + refcoords.extend(refcoordinates[i]) + else: + l = [] + for i in isread1.keys(): + l.append(isread1[i]) + Ycount = l.count("Y") + Ncount = l.count("N") + if Ycount > Ncount: + useread1 = "Y" + else: + useread1 = "N" + refcoords = [] + for i in refcoordinates.keys(): + if isread1[i] == useread1: + refcoords.extend(refcoordinates[i]) + refcoords = sorted(refcoords) + self.start = str(refcoords[0]) + self.end = str(int(refcoords[-1]) + 1) + + def get_bsjid(self): + t = [] + t.append(self.refname) + t.append(self.start) + t.append(self.end) + t.append(self.strand) + return "##".join(t) + + def write_out_reads(self, outbam): + for r in self.alignments: + outbam.write(r) + + def get_uniq_readid(r): - rname=r.query_name - hi=r.get_tag("HI") - rid=rname+"##"+str(hi) - return rid + rname = r.query_name + hi = r.get_tag("HI") + rid = rname + "##" + str(hi) + return rid -def get_bitflag(r): - bitflag=str(r).split("\t")[1] - return int(bitflag) +def get_bitflag(r): + bitflag = str(r).split("\t")[1] + return int(bitflag) def main(): - debug = True - parser = argparse.ArgumentParser() - parser.add_argument("-i","--inbam",dest="inbam",required=True,type=argparse.FileType('r'), - help="Input bam file") - parser.add_argument("-p","--plusbam",dest="plusbam",required=True,type=argparse.FileType('w'), - help="Output plus strand bam file") - parser.add_argument("-m","--minusbam",dest="minusbam",required=True,type=argparse.FileType('w'), - help="Output plus strand bam file") - parser.add_argument("-b","--bed",dest="bed",required=True,type=argparse.FileType('w', encoding='UTF-8'), - help="Output BSJ bed file (with strand info)") - args = parser.parse_args() - samfile = pysam.AlignmentFile(args.inbam, "rb") - plusfile = pysam.AlignmentFile(args.plusbam, "wb", template=samfile) - minusfile = pysam.AlignmentFile(args.minusbam, "wb", template=samfile) -# bsjfile = open(args.bed,"w") - bigdict=dict() - for read in samfile.fetch(): - if read.reference_id != read.next_reference_id: continue - rid=get_uniq_readid(read) - if debug:print(rid) - if not rid in bigdict: - bigdict[rid]=Readinfo(rid,read.reference_name) - bigdict[rid].append_alignment(read) - bitflag=get_bitflag(read) - if debug:print(bitflag) - bigdict[rid].append_bitflag(bitflag) - # bigdict[rid].extend_ref_positions(read.get_reference_positions(full_length=False)) - refpos=list(filter(lambda x:x!=None,read.get_reference_positions(full_length=True))) - bigdict[rid].set_refcoordinates(bitflag,refpos) - bigdict[rid].set_read1_reverse_secondary_supplementary(bitflag,read) - # bigdict[rid].extend_ref_positions(list(filter(lambda x:x!=None,read.get_reference_positions(full_length=True)))) - if debug:print(bigdict[rid]) - bsjdict=dict() - bitid_counts=dict() - for rid in bigdict.keys(): - bigdict[rid].generate_bitid() - if debug:print(bigdict[rid]) - bigdict[rid].get_strand() - if not bigdict[rid].validate_read(): - continue - if debug:print("HERE",bigdict[rid]) - bigdict[rid].get_start_end() - # print(bigdict[rid]) - if bigdict[rid].strand=="+": - bigdict[rid].write_out_reads(plusfile) - if bigdict[rid].strand=="-": - bigdict[rid].write_out_reads(minusfile) - bsjid=bigdict[rid].get_bsjid() - if not bsjid in bsjdict: - bsjdict[bsjid]=BSJ() - bsjdict[bsjid].set_chrom(bigdict[rid].refname) - bsjdict[bsjid].set_start(bigdict[rid].start) - bsjdict[bsjid].set_end(bigdict[rid].end) - bsjdict[bsjid].set_strand(bigdict[rid].strand) - bsjdict[bsjid].plusone() - bsjdict[bsjid].append_bitid(bigdict[rid].bitid) - if not bigdict[rid].bitid in bitid_counts: - bitid_counts[bigdict[rid].bitid]=0 - bitid_counts[bigdict[rid].bitid]+=1 - bsjdict[bsjid].append_rid(rid) - - for b in bitid_counts.keys(): - print(b,bitid_counts[b]) - - for bsjid in bsjdict.keys(): - bsjdict[bsjid].write_out_BSJ(args.bed) - - plusfile.close() - minusfile.close() - samfile.close() - args.bed.close() - - + debug = True + parser = argparse.ArgumentParser() + parser.add_argument( + "-i", + "--inbam", + dest="inbam", + required=True, + type=argparse.FileType("r"), + help="Input bam file", + ) + parser.add_argument( + "-p", + "--plusbam", + dest="plusbam", + required=True, + type=argparse.FileType("w"), + help="Output plus strand bam file", + ) + parser.add_argument( + "-m", + "--minusbam", + dest="minusbam", + required=True, + type=argparse.FileType("w"), + help="Output plus strand bam file", + ) + parser.add_argument( + "-b", + "--bed", + dest="bed", + required=True, + type=argparse.FileType("w", encoding="UTF-8"), + help="Output BSJ bed file (with strand info)", + ) + args = parser.parse_args() + samfile = pysam.AlignmentFile(args.inbam, "rb") + plusfile = pysam.AlignmentFile(args.plusbam, "wb", template=samfile) + minusfile = pysam.AlignmentFile(args.minusbam, "wb", template=samfile) + # bsjfile = open(args.bed,"w") + bigdict = dict() + for read in samfile.fetch(): + if read.reference_id != read.next_reference_id: + continue + rid = get_uniq_readid(read) + if debug: + print(rid) + if not rid in bigdict: + bigdict[rid] = Readinfo(rid, read.reference_name) + bigdict[rid].append_alignment(read) + bitflag = get_bitflag(read) + if debug: + print(bitflag) + bigdict[rid].append_bitflag(bitflag) + # bigdict[rid].extend_ref_positions(read.get_reference_positions(full_length=False)) + refpos = list( + filter(lambda x: x != None, read.get_reference_positions(full_length=True)) + ) + bigdict[rid].set_refcoordinates(bitflag, refpos) + bigdict[rid].set_read1_reverse_secondary_supplementary(bitflag, read) + # bigdict[rid].extend_ref_positions(list(filter(lambda x:x!=None,read.get_reference_positions(full_length=True)))) + if debug: + print(bigdict[rid]) + bsjdict = dict() + bitid_counts = dict() + for rid in bigdict.keys(): + bigdict[rid].generate_bitid() + if debug: + print(bigdict[rid]) + bigdict[rid].get_strand() + if not bigdict[rid].validate_read(): + continue + if debug: + print("HERE", bigdict[rid]) + bigdict[rid].get_start_end() + # print(bigdict[rid]) + if bigdict[rid].strand == "+": + bigdict[rid].write_out_reads(plusfile) + if bigdict[rid].strand == "-": + bigdict[rid].write_out_reads(minusfile) + bsjid = bigdict[rid].get_bsjid() + if not bsjid in bsjdict: + bsjdict[bsjid] = BSJ() + bsjdict[bsjid].set_chrom(bigdict[rid].refname) + bsjdict[bsjid].set_start(bigdict[rid].start) + bsjdict[bsjid].set_end(bigdict[rid].end) + bsjdict[bsjid].set_strand(bigdict[rid].strand) + bsjdict[bsjid].plusone() + bsjdict[bsjid].append_bitid(bigdict[rid].bitid) + if not bigdict[rid].bitid in bitid_counts: + bitid_counts[bigdict[rid].bitid] = 0 + bitid_counts[bigdict[rid].bitid] += 1 + bsjdict[bsjid].append_rid(rid) + for b in bitid_counts.keys(): + print(b, bitid_counts[b]) + for bsjid in bsjdict.keys(): + bsjdict[bsjid].write_out_BSJ(args.bed) -if __name__ == "__main__": - main() + plusfile.close() + minusfile.close() + samfile.close() + args.bed.close() +if __name__ == "__main__": + main()