Skip to content

hivdb/AB1_file_Analysis_Tool

Repository files navigation

AB1 Analysis Tool

This folder contains a small Sanger AB1 assembly workflow with two entrypoints:

  • assemble.py: process one sample folder that normally contains one forward read and one reverse read, with an optional single-strand fallback mode.
  • main.py: batch-run the same assembly logic across all subfolders of a parent folder with multiprocessing, a progress bar, and optional combined FASTA output.

In this README, AB1 file and ABI file mean the same Sanger trace file format.

What It Does

Given paired ABI trace files, the toolkit:

  • reads sequence, Phred quality values, ABI trace channels, base positions, and selected metadata from .ab1 files
  • trims low-quality ends from each read
  • reverse-complements the reverse read and reverse trace into forward orientation
  • aligns the trimmed forward and reverse-complement reads in overlap style
  • builds a consensus sequence with conservative per-position rules
  • optionally uses paired-read IUPAC mixture calling
  • always detects single-strand candidate mixtures for review, and can optionally apply selected confidence levels back into consensus
  • writes per-sample FASTA, QA HTML, alignment HTML, and Warning.html when assembly cannot complete for that sample
  • batch-processes many sample folders in parallel, shows progress, cleans each sample folder before assembly, can write one combined FASTA, and can generate categorized batch warning index HTML files

Expected Input Layout

Single-sample mode

Run assemble.py on a folder that contains:

  • one .ab1 file whose stem ends with F
  • one .ab1 file whose stem ends with R

With --allow-single-strand, the folder may instead contain only one of those files. The available read is mirrored into the missing strand orientation so the normal QA and consensus pipeline can still run.

Example:

sample_001/
  isolate123F.ab1
  isolate123R.ab1

Batch mode

Run main.py on a parent folder where each subfolder is one sample:

batch_run/
  sample_001/
    isolate123F.ab1
    isolate123R.ab1
  sample_002/
    isolate456F.ab1
    isolate456R.ab1

Usage

Install dependencies:

uv sync

Assemble one sample folder:

uv run python assemble.py /path/to/sample_folder

Enable paired-read mixture calling:

uv run python assemble.py --use-paired-mixture /path/to/sample_folder

Apply only high-confidence single-strand mixtures back into consensus:

uv run python assemble.py --use-single-mixture high /path/to/sample_folder

Apply high and medium single-strand mixtures back into consensus:

uv run python assemble.py --use-single-mixture medium /path/to/sample_folder

Use different forward and reverse Phred thresholds:

uv run python assemble.py --min-phred-score-per-base 18:25 /path/to/sample_folder

Tune trimming and overlap thresholds:

uv run python assemble.py \
  --min-phred-score-per-base 20:20 \
  --min-consecutive-high-quality-bases 10 \
  --min-overlap 40 \
  /path/to/sample_folder

Cleanup one sample folder and exit:

uv run python assemble.py --clean /path/to/sample_folder

Allow a single forward-only or reverse-only file:

uv run python assemble.py --allow-single-strand /path/to/sample_folder

Batch-run all sample subfolders:

uv run python main.py /path/to/parent_folder

Batch-run with a specific worker count:

uv run python main.py --processes 8 /path/to/parent_folder

Cleanup all sample subfolders and exit:

uv run python main.py --clean /path/to/parent_folder

Batch-run and remap combined FASTA headers with an Excel sheet:

uv run python main.py --mapping-xlsx /path/to/mapping.xlsx /path/to/parent_folder

Batch-run with single-strand fallback enabled:

uv run python main.py --allow-single-strand /path/to/parent_folder

The mapping file is interpreted as:

  • column 1: sample ID
  • column 2: sample name

Only the combined FASTA headers are remapped. Per-sample FASTA headers are left unchanged.

Outputs

For each sample folder, successful assembly writes:

sample_001/
  sample_001.fasta
  forward_trimmed.fasta
  reverse_rc_trimmed.fasta
  sample_001_alignment.html
  sample_001_QA.html

If trimming fails, the folder instead gets:

sample_001/
  Warning.html

Batch mode may also write:

parent_folder/
  parent_folder_combined.fasta
  parent_folder_trim_warning.html
  parent_folder_assembly_warning.html

Reports

QA report

<folder_name>_QA.html includes:

  • a single-strand warning banner at the top when --allow-single-strand mirrored one read into the missing strand
  • Quality Plot: forward and reverse-complement quality plots with threshold, median, and trim-boundary markers
  • Chromatogram: forward and reverse-complement trace plots with shared x-axis controls
  • forward QA table
  • reverse QA table
  • AB1 / ABI: a short explanation of AB1 files, instrument metadata, key ABI sections, and an ASCII hierarchy view

Alignment report

<folder_name>_alignment.html includes:

  • a single-strand warning banner at the top when --allow-single-strand mirrored one read into the missing strand
  • forward and reverse trimming summaries
  • overlap and alignment parameters
  • resolve_consensus_base rule summary
  • the aligned forward, reverse-complement, consensus, read-position, and Phred rows
  • a low-quality table
  • a merged Single Strand Mixture panel with:
    • parameter summary
    • table-header explanation
    • forward candidate mixture table
    • reverse-complement candidate mixture table

All HTML tables support client-side sorting and CSV download.

Consensus Rules

Consensus calling is intentionally conservative:

  • if both aligned bases match and both are at least --min-phred-score-for-paired-base, accept that base
  • if aligned bases match and only one strand is above its own per-read threshold, accept that base
  • if only one strand contributes a base, accept it only if that strand is above its own per-read threshold
  • if bases disagree and both are above their own per-read thresholds, emit an IUPAC mixture only when --use-paired-mixture is enabled
  • otherwise fall back to N

When --use-single-mixture high or --use-single-mixture medium is enabled, selected single-strand mixture calls are mapped back onto trimmed read positions before consensus resolution.

Parameters

assemble.py

  • folder: sample folder containing one *F.ab1 and one *R.ab1
  • --clean: remove all non-.ab1 outputs in the sample folder and exit
  • --use-paired-mixture: allow two-strand IUPAC mixture calls for high-confidence disagreements
  • --use-single-mixture {high,medium}: apply selected-confidence single-strand mixtures back into consensus; default is disabled
  • --allow-single-strand: allow a single forward-only or reverse-only .ab1 file and mirror it into the missing strand
  • --min-phred-score-per-base: per-read threshold in forward:reverse format; default 20:20
  • --min-phred-score-for-paired-base: minimum Phred score accepted when both strands agree on the same base; default 10
  • --min-consecutive-high-quality-bases: run length used to define trim boundaries; default 10
  • --min-overlap: minimum aligned overlap after trimming; default 40
  • --detect-single-strand-mixture: currently enabled by default in code

Single-strand mixture detection and QA use these internal defaults:

  • min_peak_ratio=0.25
  • min_secondary_snr=5.0
  • noise_window_radius=20
  • noise_window_exclude_radius=4
  • high_phred_quality=30
  • high_peak_ratio=0.33
  • high_secondary_snr=8.0

main.py

main.py exposes the same assembly parameters plus:

  • --clean: clean each sample subfolder and continue to the next one without assembly
  • --mapping-xlsx: remap headers in the combined FASTA using column 1 = ID and column 2 = name
  • --processes: number of worker processes used for batch assembly; default 8

Normal batch runs also clean each sample subfolder before assembly starts.

At the end of a batch run it prints:

  • Consensus ready: X/Y
  • Warning: N

During the batch run it also shows a progress bar on stderr.

If warnings exist, main.py writes one HTML file per warning category, for example <parent_folder>_trim_warning.html and <parent_folder>_assembly_warning.html, each with:

  • the total warning count for that category
  • one section per warning file
  • a relative link to each Warning.html
  • an embedded view of each warning page

Current Limitations

  • The default workflow still expects two reads per sample unless --allow-single-strand is enabled.
  • File pairing depends on filename stem suffixes F and R.
  • The single-strand detector thresholds are still code defaults, not CLI parameters.
  • In single-strand mode, the missing strand is synthesized from the available read, which keeps the pipeline uniform but does not add new experimental evidence.
  • --detect-single-strand-mixture is effectively redundant at the moment because the current parser default is already enabled.
  • Batch warning detection is based on whether Warning.html exists.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors