Skip to content

bao-mathmod/SCA_NIPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NIPT SCA Classification and Chromosome Ratio Calculation Pipeline

This repository contains a two-step pipeline for Non-Invasive Prenatal Testing (NIPT) analysis, focusing on Sex Chromosome Aneuploidy (SCA) detection, particularly for Chromosome X.

  1. Chromosome Ratio Calculation Pipeline (Bash): Calculates the ratio of reads mapped to a target chromosome (ChrX or ChrY) relative to Chr1 from a directory of BAM files.
  2. ChrX Aneuploidy Classification Pipeline (Python): Uses the ChrX ratio output from the first pipeline, applies log normalization, trains a Bagging Classifier on labeled training data, and predicts ChrX aneuploidy status (Healthy, XO, XXX) for unlabeled test samples. It can optionally integrate WisecondorX results and generate visualizations.

Overview

This project provides a two-step workflow for NIPT analysis focusing on Sex Chromosome Aneuploidies, specifically XO and XXX detection using ChrX ratios.

  • Step 1 (Bash Script: ratio.sh) processes raw alignment data (BAM files) to compute essential chromosome ratios (specifically ChrX relative to Chr1 for input into Step 2). It outputs these ratios into a CSV file.
  • Step 2 (Python Script: nipt_x.py) takes the CSV file generated in Step 1 as a basis for its input. It requires separate training (with labels) and test (without labels) files derived from this format. It then applies log transformation, trains a machine learning model (Bagging Classifier), predicts the ChrX aneuploidy status (Healthy, XO, XXX) for test samples, and optionally integrates results with WisecondorX and generates plots.

Step 1: Chromosome Ratio Calculation (Bash)

Description

This Bash pipeline (pipeline.sh) iterates through BAM files in a specified input directory. For each file, it counts the number of primary alignments mapped to a target chromosome (ChrX or ChrY, user-selected) and to Chr1 using samtools view -c -F 2308. It then calculates the ratio of the target chromosome reads to Chr1 reads and outputs results to a CSV file.

Note: To generate the necessary input for Step 2, you must run this script with the -c X option.

Requirements

  • Input BAM files (indexed .bai files are implicitly required by samtools view for specific regions)
  • samtools (must be installed and accessible in the system's PATH)
  • Bash environment (Linux, macOS, WSL on Windows)

Usage

Execute the script from your terminal:

./ratio.sh -i <input_directory> -o <output_directory> -c <X|Y>

Parameter Descriptions

  • -i: Input directory containing BAM files.
  • -o: Output directory where the CSV file will be saved.
  • -c: Chromosome option. Use X for ChrX ratio calculation or Y for ChrY ratio calculation.

Examples

./ratio.sh -i /mnt/rdisk/tuanthanh/niptune_wf_run/results_bmk_nofil/bwa \
              -o /mnt/rdisk/bao_script/zscore_code/Chr_ratio/Data/Ref/tuanthanh/bwa/ChrX_ratio \
              -c X

Step 2: SCA Classification

Description

This step leverages the output from the Chromosome Ratio Calculation pipeline (Step 1) and applies a Bagging Classification algorithm to accurately classify samples for Sex Chromosome Aneuploidies (SCA). The classifier is trained using gold standard reference samples that have been clinically validated and labeled, ensuring that the model learns from high-quality, reliable data.

To account for potential bias across different samples, a log normalization (using either log2 or log10 transformation) is applied prior to classification. This normalization step standardizes the data, thereby improving the robustness and accuracy of the classifier when distinguishing between normal samples and those with chromosomal abnormalities.

Requirements

  • Python 3.x: Ensure Python 3.x is installed and accessible in your system's PATH.
  • Required Python packages: Install the following packages using pip:
    • pandas
    • numpy
    • scikit-learn
    • matplotlib
    • seaborn
  • Input CSV files: These should be derived from Step 1's output. Refer to the Data Preparation section for details.
  • Optional (for --table option): WisecondorX output _statistics.txt files are required if you integrate WisecondorX results.

Usage

Run the script using the following command:

python nipt_x.py \
    --train prepared_data/train_labeled.csv \
    --test prepared_data/test_unlabeled_with_gender.csv \
    --output_dir results_step2/full_log10 \
    --log_type log10 \
    --plot \
    --table \
    --wisecondorx_dir /path/to/my/wisecondorx_outputs

Parameter Descriptions

--train: Required. Path to the prepared training CSV file (with label column).

--test: Required. Path to the prepared test CSV file (without label column, potentially with gender).

--output_dir: Required. Directory where all output files from this script will be saved. Created if it doesn't exist.

--log_type: Required. Log transformation type for the ChrX_Ratio. Choose either

log2 or log10.

--plot: Optional. Generate visualization plots (e.g., decision boundary, example tree).

--table : Optional. Generate the final output table integrating WisecondorX results. Requires

--wisecondorx_dir to be specified or default path to be valid. Also requires the test CSV to have a gender column.

Examples

python nipt_x.py \
    --train prepared_data/train_labeled.csv \
    --test prepared_data/test_unlabeled_with_gender.csv \
    --output_dir results_step2/full_log10 \
    --log_type log10 \
    --plot \
    --table \
    --wisecondorx_dir /path/to/my/wisecondorx_outputs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published