Skip to content

Latest commit

 

History

History
321 lines (226 loc) · 10.8 KB

File metadata and controls

321 lines (226 loc) · 10.8 KB

AB1 Analysis Tool

This folder contains a small Sanger AB1 assembly workflow with two entrypoints:

  • assemble.py: process one sample folder that normally contains one forward read and one reverse read, with an optional single-strand fallback mode.
  • main.py: batch-run the same assembly logic across all subfolders of a parent folder with multiprocessing, a progress bar, and optional combined FASTA output.
  • web_service.py: run a local web UI that accepts a zip upload or forward/reverse .ab1 files, then serves the generated report pages and downloads.

In this README, AB1 file and ABI file mean the same Sanger trace file format.

What It Does

Given paired ABI trace files, the toolkit:

  • reads sequence, Phred quality values, ABI trace channels, base positions, and selected metadata from .ab1 files
  • trims low-quality ends from each read
  • reverse-complements the reverse read and reverse trace into forward orientation
  • aligns the trimmed forward and reverse-complement reads in overlap style
  • builds a consensus sequence with conservative per-position rules
  • optionally uses paired-read IUPAC mixture calling
  • always detects single-strand candidate mixtures for review, and can optionally apply selected confidence levels back into consensus
  • writes per-sample FASTA, QA HTML, alignment HTML, and Warning.html when assembly cannot complete for that sample
  • batch-processes many sample folders in parallel, shows progress, cleans each sample folder before assembly, can write one combined FASTA, and can generate categorized batch warning index HTML files

Expected Input Layout

Single-sample mode

Run assemble.py on a folder that contains:

  • one .ab1 file whose stem ends with F
  • one .ab1 file whose stem ends with R

With --allow-single-strand, the folder may instead contain only one of those files. The available read is mirrored into the missing strand orientation so the normal QA and consensus pipeline can still run.

Example:

sample_001/
  isolate123F.ab1
  isolate123R.ab1

Batch mode

Run main.py on a parent folder where each subfolder is one sample:

batch_run/
  sample_001/
    isolate123F.ab1
    isolate123R.ab1
  sample_002/
    isolate456F.ab1
    isolate456R.ab1

Usage

Install dependencies:

uv sync

Assemble one sample folder:

uv run python assemble.py /path/to/sample_folder

Enable paired-read mixture calling:

uv run python assemble.py --use-paired-mixture /path/to/sample_folder

Apply only high-confidence single-strand mixtures back into consensus:

uv run python assemble.py --use-single-mixture high /path/to/sample_folder

Apply high and medium single-strand mixtures back into consensus:

uv run python assemble.py --use-single-mixture medium /path/to/sample_folder

Use different forward and reverse Phred thresholds:

uv run python assemble.py --min-phred-score-per-base 18:25 /path/to/sample_folder

Tune trimming and overlap thresholds:

uv run python assemble.py \
  --min-phred-score-per-base 20:20 \
  --min-consecutive-high-quality-bases 10 \
  --min-overlap 40 \
  /path/to/sample_folder

Cleanup one sample folder and exit:

uv run python assemble.py --clean /path/to/sample_folder

Allow a single forward-only or reverse-only file:

uv run python assemble.py --allow-single-strand /path/to/sample_folder

Align the consensus to a reference nucleotide FASTA in codon-aware mode:

uv run python assemble.py \
  --reference-nucleotide-fasta /path/to/reference_nt.fasta \
  /path/to/sample_folder

Batch-run all sample subfolders:

uv run python main.py /path/to/parent_folder

Batch-run with a specific worker count:

uv run python main.py --processes 8 /path/to/parent_folder

Cleanup all sample subfolders and exit:

uv run python main.py --clean /path/to/parent_folder

Batch-run and remap combined FASTA headers with an Excel sheet:

uv run python main.py --mapping-xlsx /path/to/mapping.xlsx /path/to/parent_folder

Batch-run with single-strand fallback enabled:

uv run python main.py --allow-single-strand /path/to/parent_folder

Run the web service:

uvicorn web_service:app --host 127.0.0.1 --port 8000

Then open http://127.0.0.1:8000 in a browser.

The mapping file is interpreted as:

  • column 1: sample ID
  • column 2: sample name

Only the combined FASTA headers are remapped. Per-sample FASTA headers are left unchanged.

Outputs

For each sample folder, successful assembly writes:

sample_001/
  sample_001.fasta
  sample_001_aa_alignment.fasta
  forward_trimmed.fasta
  reverse_rc_trimmed.fasta
  sample_001_alignment.html
  sample_001_QA.html

The web service stores each upload under /tmp/ab1_file_analysis_tool/jobs/<job_id>/<sample_name>/ and exposes:

  • an in-browser report view
  • direct download links for each generated HTML report
  • direct download links for sample_name.fasta, forward_trimmed.fasta, and reverse_rc_trimmed.fasta when present
  • one zip bundle that contains all generated HTML reports, consensus FASTA output, trimmed FASTA outputs, and Warning.html when present

If trimming fails, the folder instead gets:

sample_001/
  Warning.html

Batch mode may also write:

parent_folder/
  parent_folder_combined.fasta
  parent_folder_trim_warning.html
  parent_folder_assembly_warning.html

Reports

QA report

<folder_name>_QA.html includes:

  • a single-strand warning banner at the top when --allow-single-strand mirrored one read into the missing strand
  • Quality Plot: forward and reverse-complement quality plots with threshold, median, and trim-boundary markers
  • Chromatogram: forward and reverse-complement trace plots with shared x-axis controls
  • forward QA table
  • reverse QA table
  • AB1 / ABI: a short explanation of AB1 files, instrument metadata, key ABI sections, and an ASCII hierarchy view

Alignment report

<folder_name>_alignment.html includes:

  • a single-strand warning banner at the top when --allow-single-strand mirrored one read into the missing strand
  • forward and reverse trimming summaries
  • overlap and alignment parameters
  • resolve_consensus_base rule summary
  • the aligned forward, reverse-complement, consensus, read-position, and Phred rows
  • a low-quality table
  • a merged Single Strand Mixture panel with:
    • parameter summary
    • table-header explanation
    • forward candidate mixture table
    • reverse-complement candidate mixture table

All HTML tables support client-side sorting and CSV download.

Consensus Rules

Consensus calling is intentionally conservative:

  • if both aligned bases match and both are at least --min-phred-score-for-paired-base, accept that base
  • if aligned bases match and only one strand is above its own per-read threshold, accept that base
  • if only one strand contributes a base, accept it only if that strand is above its own per-read threshold
  • if bases disagree and both are above their own per-read thresholds, emit an IUPAC mixture only when --use-paired-mixture is enabled
  • otherwise fall back to N

When --use-single-mixture high or --use-single-mixture medium is enabled, selected single-strand mixture calls are mapped back onto trimmed read positions before consensus resolution.

Parameters

assemble.py

  • folder: sample folder containing one *F.ab1 and one *R.ab1
  • --clean: remove all non-.ab1 outputs in the sample folder and exit
  • --use-paired-mixture: allow two-strand IUPAC mixture calls for high-confidence disagreements
  • --use-single-mixture {high,medium}: apply selected-confidence single-strand mixtures back into consensus; default is disabled
  • --allow-single-strand: allow a single forward-only or reverse-only .ab1 file and mirror it into the missing strand
  • --min-phred-score-per-base: per-read threshold in forward:reverse format; default 20:20
  • --min-phred-score-for-paired-base: minimum Phred score accepted when both strands agree on the same base; default 10
  • --min-consecutive-high-quality-bases: run length used to define trim boundaries; default 10
  • --min-overlap: minimum aligned overlap after trimming; default 40
  • --detect-single-strand-mixture: currently enabled by default in code
  • --reference-nucleotide-fasta: optional FASTA with exactly one nucleotide reference whose length is a multiple of 3; aligns the consensus in codon units against that reference, then derives RefAA, ConsAA, and ConsNA for <folder_name>_aa_alignment.fasta and the alignment report

Single-strand mixture detection and QA use these internal defaults:

  • min_peak_ratio=0.25
  • min_secondary_snr=5.0
  • noise_window_radius=20
  • noise_window_exclude_radius=4
  • high_phred_quality=30
  • high_peak_ratio=0.33
  • high_secondary_snr=8.0

main.py

main.py exposes the same assembly parameters plus:

  • --clean: clean each sample subfolder and continue to the next one without assembly
  • --mapping-xlsx: remap headers in the combined FASTA using column 1 = ID and column 2 = name
  • --processes: number of worker processes used for batch assembly; default 8

web_service.py

  • --host: bind host for the local HTTP server; default 127.0.0.1
  • --port: bind port for the local HTTP server; default 8000
  • implementation: FastAPI app served by Uvicorn, with request-body multipart parsing handled in local code

The upload form accepts either:

  • one .zip file containing one sample with one or two .ab1 files
  • one forward .ab1 and one reverse .ab1 file uploaded separately

For separate uploads, the service renames the saved files to end with F.ab1 and R.ab1 so the existing assembly logic can pair them reliably.

Normal batch runs also clean each sample subfolder before assembly starts.

At the end of a batch run it prints:

  • Consensus ready: X/Y
  • Warning: N

During the batch run it also shows a progress bar on stderr.

If warnings exist, main.py writes one HTML file per warning category, for example <parent_folder>_trim_warning.html and <parent_folder>_assembly_warning.html, each with:

  • the total warning count for that category
  • one section per warning file
  • a relative link to each Warning.html
  • an embedded view of each warning page

Current Limitations

  • The default workflow still expects two reads per sample unless --allow-single-strand is enabled.
  • File pairing depends on filename stem suffixes F and R.
  • The single-strand detector thresholds are still code defaults, not CLI parameters.
  • In single-strand mode, the missing strand is synthesized from the available read, which keeps the pipeline uniform but does not add new experimental evidence.
  • --detect-single-strand-mixture is effectively redundant at the moment because the current parser default is already enabled.
  • Batch warning detection is based on whether Warning.html exists.