This repository contains a two-step pipeline for Non-Invasive Prenatal Testing (NIPT) analysis, focusing on Sex Chromosome Aneuploidy (SCA) detection, particularly for Chromosome X.
- Chromosome Ratio Calculation Pipeline (Bash): Calculates the ratio of reads mapped to a target chromosome (ChrX or ChrY) relative to Chr1 from a directory of BAM files.
- ChrX Aneuploidy Classification Pipeline (Python): Uses the ChrX ratio output from the first pipeline, applies log normalization, trains a Bagging Classifier on labeled training data, and predicts ChrX aneuploidy status (Healthy, XO, XXX) for unlabeled test samples. It can optionally integrate WisecondorX results and generate visualizations.
This project provides a two-step workflow for NIPT analysis focusing on Sex Chromosome Aneuploidies, specifically XO and XXX detection using ChrX ratios.
- Step 1 (Bash Script:
ratio.sh) processes raw alignment data (BAM files) to compute essential chromosome ratios (specifically ChrX relative to Chr1 for input into Step 2). It outputs these ratios into a CSV file. - Step 2 (Python Script:
nipt_x.py) takes the CSV file generated in Step 1 as a basis for its input. It requires separate training (with labels) and test (without labels) files derived from this format. It then applies log transformation, trains a machine learning model (Bagging Classifier), predicts the ChrX aneuploidy status (Healthy, XO, XXX) for test samples, and optionally integrates results with WisecondorX and generates plots.
This Bash pipeline (pipeline.sh) iterates through BAM files in a specified input directory. For each file, it counts the number of primary alignments mapped to a target chromosome (ChrX or ChrY, user-selected) and to Chr1 using samtools view -c -F 2308. It then calculates the ratio of the target chromosome reads to Chr1 reads and outputs results to a CSV file.
Note: To generate the necessary input for Step 2, you must run this script with the -c X option.
- Input BAM files (indexed
.baifiles are implicitly required bysamtools viewfor specific regions) - samtools (must be installed and accessible in the system's PATH)
- Bash environment (Linux, macOS, WSL on Windows)
Execute the script from your terminal:
./ratio.sh -i <input_directory> -o <output_directory> -c <X|Y>-i: Input directory containing BAM files.-o: Output directory where the CSV file will be saved.-c: Chromosome option. UseXfor ChrX ratio calculation orYfor ChrY ratio calculation.
./ratio.sh -i /mnt/rdisk/tuanthanh/niptune_wf_run/results_bmk_nofil/bwa \
-o /mnt/rdisk/bao_script/zscore_code/Chr_ratio/Data/Ref/tuanthanh/bwa/ChrX_ratio \
-c XThis step leverages the output from the Chromosome Ratio Calculation pipeline (Step 1) and applies a Bagging Classification algorithm to accurately classify samples for Sex Chromosome Aneuploidies (SCA). The classifier is trained using gold standard reference samples that have been clinically validated and labeled, ensuring that the model learns from high-quality, reliable data.
To account for potential bias across different samples, a log normalization (using either log2 or log10 transformation) is applied prior to classification. This normalization step standardizes the data, thereby improving the robustness and accuracy of the classifier when distinguishing between normal samples and those with chromosomal abnormalities.
- Python 3.x: Ensure Python 3.x is installed and accessible in your system's PATH.
- Required Python packages: Install the following packages using
pip:pandasnumpyscikit-learnmatplotlibseaborn
- Input CSV files: These should be derived from Step 1's output. Refer to the Data Preparation section for details.
- Optional (for
--tableoption): WisecondorX output_statistics.txtfiles are required if you integrate WisecondorX results.
Run the script using the following command:
python nipt_x.py \
--train prepared_data/train_labeled.csv \
--test prepared_data/test_unlabeled_with_gender.csv \
--output_dir results_step2/full_log10 \
--log_type log10 \
--plot \
--table \
--wisecondorx_dir /path/to/my/wisecondorx_outputs--train: Required. Path to the prepared training CSV file (with label column).
--test: Required. Path to the prepared test CSV file (without label column, potentially with gender).
--output_dir: Required. Directory where all output files from this script will be saved. Created if it doesn't exist.
--log_type: Required. Log transformation type for the ChrX_Ratio. Choose either
log2 or log10.
--plot: Optional. Generate visualization plots (e.g., decision boundary, example tree).
--table : Optional. Generate the final output table integrating WisecondorX results. Requires
--wisecondorx_dir to be specified or default path to be valid. Also requires the test CSV to have a gender column.
python nipt_x.py \
--train prepared_data/train_labeled.csv \
--test prepared_data/test_unlabeled_with_gender.csv \
--output_dir results_step2/full_log10 \
--log_type log10 \
--plot \
--table \
--wisecondorx_dir /path/to/my/wisecondorx_outputs