👤 Author: Gang Xie, PhD candidate
🏫 Affiliation: PKU-THU-NIBS Joint Graduate Program, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
📧 Email: gangx1e@stu.pku.edu.cn
📅 Date: July 25th, 2025
✅ Version: 1.7
MAPIT-seq (Modification Added to RNA-binding Protein Interacting Transcript Sequencing) enables the identification RNA-binding protein (RBP) target transcripts by detecting adjacent RNA editing events introduced by both hADAR2dd and rAPOBEC1. This approach allows for the simultaneous profiling of RBP–RNA interactions and the transcriptome in tissues and single cells. The MAPIT-seq pipeline is designed to detect RNA editing events from hADAR2dd and rAPOBEC1, identify RBP targets, pinpoint high-confidence RBP-binding regions and de novo discover RBP-binding motifs.
If you found this pipeline useful in your work, please cite our paper:
Cheng, Q.-X.#, Xie, G.#, Zhang, X., Wang, J., Ding, S., Wu, Y.-X., Shi, M., Duan, F.-F., Wan, Z.-L., Wei, J.-J., Xiao, J., Wang, Y. Co-profiling of transcriptome and in situ RNA-protein interactions in single cells and tissues. Nature Methods (2025). https://doi.org/10.1038/s41592-025-02774-4
All softwares in our pipeline are available in conda. We recommend you to install these softwares via conda:
git clone https://github.com/WangLabPKU/MAPIT-seq
cd MAPIT-seq
conda env create -f env_specific.yml # or use env.yml, more flexible but slower
wget https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64.v479/bedGraphToBigWig
chmod +x Mapit src/MAPIT-seq.sh bedGraphToBigWig
/home/gangx/apps/Mapit-seq needs to be replaced with your MAPIT-seq path.
conda activate Mapit-seq
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d
cat <<EOF > $CONDA_PREFIX/etc/conda/activate.d/activate-mapit.sh
#!/usr/bin/env sh
export PATH="\$PATH:/home/gangx/apps/Mapit-seq"
EOF
cat <<EOF > $CONDA_PREFIX/etc/conda/deactivate.d/deactivate-mapit.sh
#!/usr/bin/env sh
export PATH=\$(echo "\$PATH" | tr ':' '\\n' | grep -v '^/home/gangx/apps/Mapit-seq\$' | paste -sd:)
EOF
chmod +x $CONDA_PREFIX/etc/conda/activate.d/activate-mapit.sh $CONDA_PREFIX/etc/conda/deactivate.d/deactivate-mapit.sh
cd ..
git clone https://github.com/BioinfoUNIBA/REDItools2
conda install mpi4py -c bioconda -c conda-forge
cd REDItools2
pip install -r requirements.txt
Please apply the following changes before running the pipeline:
- Add
from functools import reduceat line 17 insrc/cineca/parallel_reditools.py - Replace line 573 in
src/cineca/parallel_reditools.pywith:
keys = list(chromosomes.keys()) - In
src/cineca/reditools.py, replacesys.maxint(line 817) withsys.maxsize - In
src/cineca/reditools.py, replace"w"(line 912) with"wt"
cd ..
git clone https://github.com/YeoLab/FLARE
conda install snakemake -c bioconda -c conda-forge
pip install deeptools gffutils pyfaidx Bio
Prepare these files beforehand:
-
Reference genome FASTA
-
Annotation files (.gtf, .gff3)
-
RepeatMasker annotation
-
Known SNP datasets, split by chromosome (e.g. dbSNP, 1000 Genomes, EVS/EVA)
All configuration is stored in conf/GenomeVersion.json, e.g. conf/GRCh38.json.
-
Abundant RNA (rRNA, tRNA, mt-rRNA and mt-tRNA) sequence (provided in
reffold) and create the index by BWA-MEM.
Below is an example using the human genome (GRCh38):
cd "your_ref_path"
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/GRCh38.p13.genome.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.chr_patch_hapl_scaff.annotation.gtf.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.chr_patch_hapl_scaff.annotation.gff3.gz
# RepeatMasker
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz # for mouse, just replace hg38 to mm10/mm39
gzip -d *
# ERCC spike-in
wget https://assets.thermofisher.cn/TFS-Assets/LSG/manuals/ERCC92.zip
unzip ERCC92.zip
🔺 We recommend retaining only autosomal, allosomal (sex chromosomes), and mitochondrial sequences—i.e., entries with the "chr" prefix—in both FASTA and GTF/GFF annotation files.
Download known variants annotation: dbSNP, 1000Genome, EVS and EVA
genomeVersion=GRCh38
mkdir "your_ref_path"/${genomeVersion}_SNP
cd "your_ref_path"/${genomeVersion}_SNP
# human (hg38/GRCh38)
wget https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/GATK/All_20180418.vcf.gz # download data in VCF/GATK fold; column 1 starts with "chr"
# mouse (mm10/GRCm38)
wget https://ftp.ncbi.nih.gov/snp/organisms/archive/mouse_10090/VCF/00-All.vcf.gz
# human (hg38/GRCh38)
wget http://evs.gs.washington.edu/evs_bulk_data/ESP6500SI-V2-SSA137.GRCh38-liftover.snps_indels.vcf.tar.gz
tar -xvf ESP6500SI-V2-SSA137.GRCh38-liftover.snps_indels.vcf.tar.gz
for i in {1..22} X Y; do
awk '{if(substr($1, 1, 1) == "#"){print $0}else if((length($4) == 1) && (length($5) == 1)) {gsub("MT","M");{if($1 ~ "chr") print $0; else print "chr"$0 }}}' ESP6500SI-V2-SSA137.GRCh38-liftover.chr${i}.snps_indels.vcf | gzip > EVS_split_chr/chr${i}.gz
done
# human (hg38/GRCh38)
wget -O EVS_split_chr.zip "https://zenodo.org/records/17089899/files/EVS_split_chr.zip?download=1"
unzip EVS_split_chr.zip
# mouse (mm10/GRCm38)
wget http://ftp.ebi.ac.uk/pub/databases/eva/rs_releases/release_3/by_species/mus_musculus/GRCm38.p4/GCA_000001635.6_current_ids.vcf.gz
mkdir EVA_split_chr
zcat GCA_000001635.6_current_ids.vcf.gz | awk -v dir_SNP=EVA_split_chr '{if(substr($1, 1, 1) == "#"){print $0 > "EVA_header"}else if((length($4) == 1) && (length($5) == 1)) {gsub("MT","M");{if($1 ~ "chr") print $0 > dir_SNP"/"$1; else print "chr"$0 > dir_SNP"/chr"$1 }}}'
gzip EVA_split_chr/chr*
# mouse (mm39/GRCm39)
# Not recommended due to the lack of supporting dbSNP data for mm39 in NCBI. Consider lifting over from mm10/GRCm38 if needed.
wget http://ftp.ebi.ac.uk/pub/databases/eva/rs_releases/release_3/by_species/mus_musculus/GRCm39/GCA_000001635.9_current_ids.vcf.gz
# human
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz
mkdir 1000genomes_split_chr
zcat ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz | awk -v dir_SNP=1000genomes_split_chr '{if(substr($1, 1, 1) == "#"){print $0 > "1000genomes_header"}else if((length($4) == 1) && (length($5) == 1)) {gsub("MT","M");{if($1 ~ "chr") print $0 > dir_SNP"/"$1; else print "chr"$0 > dir_SNP"/chr"$1 }}}'
gzip 1000genomes_split_chr/chr*
There is no any known SNP data for mouse in 1000 Genomes Project, so you could create void file in void_split_chr fold.
# mouse
chroms=($(grep '>' $genome_fasta |sed 's/>//' | awk '{print $1}' | grep 'chr' ))
for chr in ${chroms[@]}; do touch void_split_chr/${chr}; done
gzip void_split_chr/chr*
*_split_chr directories begin with the "chr" prefix. If not, manually prepend "chr" to maintain naming consistency across datasets.
- MAPIT configuration
After downloading the genome sequence (.fasta), genome annotations (.gff3), RepeatMasker annotations, and known SNP datasets split by chromosome, you can run the following command to generate the configuration file. This step includes building the genome sequence index, extracting gene element annotations, and creating chromosome-split dbSNP VCF files.
Mapit config --genomeVersion GRCh38 \
--genomeFasta "full_path"/GRCh38.p13.genome.fa \
--ERCC "full_path"/"ERCC.fa" \
--species human \
--outpath "full_path" \
--genomeAnno "full_path"/gencode.v40.chr_patch_hapl_scaff.annotation.gff3 \
--rmsk "full_path"/rmsk.txt \
--dbSNP "full_path"/GRCh38_SNP/All_20180418.vcf.gz \
--1000Genomes "full_path"/GRCh38_SNP/1000genomes_split_chr \
--EVSEVA "full_path"/GRCh38_SNP/EVS_split_chr \
--Reditools "full_path"/Directory_of_RediTools2.0_software \
--FLARE "full_path"/Directory_of_FLARE_software
-
Options:
-h, --helpShow help message and exit-v GENOMEVERSION, --genomeVersion GENOMEVERSIONGenome build version (e.g., GRCh38)-s {human,mouse}, --species {human,mouse}Species identifier (humanormouse)-f GENOMEFASTA, --genomeFasta GENOMEFASTAPath to genome FASTA file-E ERCC, --ERCC ERCCPath to ERCC spike-in FASTA file-a GENOMEANNO, --genomeAnno GENOMEANNOPath to genome annotation file in GFF3 format-r RMSK, --rmsk RMSKPath to RepeatMask annotation file downloaded from UCSC-o OUTPATH, --outpath OUTPATHOutput directory for annotation files for Mapit--dbSNP DBSNPPath to dbSNP VCF file--1000Genomes 1000GENOMESDirectory of VCF files of 1000Genomes split by chromosome--EVSEVA EVSEVADirectory of VCF files of EVS/EVA split by chromosome--hisat2Index HISAT2INDEX(Optional) Path to pre-built HISAT2 index--Reditools REDITOOLSDirectory of RediTools2.0 software--FLARE FLAREDirectory of FLARE software--overwriteOverwrite existing configuration file if present.
This command will generate a GenomeVersion.json configuration file under the conf directory of the specified output path, which is required for downstream MAPIT-seq analysis.
- FLARE configuration
To run the FLARE pipeline, you must first generate region files that define genomic intervals for cluster identification. This can be done using the generate_regions.py script. See FLARE repository in details.
"full_path_to_FLARE"/workflow_FLARE/scripts/generate_regions.py <full/path/to/your/genome/gtf/file> <genome_name>_regions
- Homer configuration
In an effort to make sure things are standardized for analysis, HOMER organizes promoters, genome sequences and annotation into packages. See http://homer.ucsd.edu/homer/introduction/configure.html in details.
perl configureHomer.pl -install hg38
---
config:
theme: 'base'
themeVariables:
primaryColor: '#25b1bbff'
primaryTextColor: '#fff'
primaryBorderColor: '#055516ff'
lineColor: '#bacdf1ff'
secondaryColor: '#006100'
tertiaryColor: '#fff'
---
graph TD;
A{{Clean reads}}-->B(**mapping**
Two-round uniquely mapping);
B-->A1
C(**finetuning**
Fine tune alignments);
C-->D(**callediting**
GATK RNA editing detection);
D-->E(**calltargets**
Differential editing analysis to call RBP binding targets)
B-->A2
F(**prepare**
Prepare files for SAILOR);
F-->G(**SAILOR**
RNA editing identification);
G-->H(**FLARE**
Edit cluster identification);
H-->I(**hc_cluster**
High-confidence edit cluster identification)
I-->J(HOMER
RBP binding motif *de novo* identification)
B-->A3
K(featureCounts
Gene expression quantification)
K-->L(DESeq2
Differential expressed gene analysis)
L-->M(clusterProfiler
Functional analysis)
subgraph A1[Transcript binding level]
direction TB
C
D
E
end
subgraph A2[RBP-binding site level]
direction TB
F
G
H
I
J
end
subgraph A3[Expression level]
direction TB
K
L
M
end
N(***RBP-RNA interactome
and transcriptome
co-profiling***)
A1-.->N
A2-.->N
A3-.->N
style A fill: #7cb8c7ff,stroke:#fff,stroke-width:2px;
style B stroke:#fff,stroke-width:2px;
style C fill: #6e9beeff,stroke:#fff,stroke-width:2px;
style D fill: #6e9beeff,stroke:#fff,stroke-width:2px;
style E fill: #6e9beeff,stroke:#fff,stroke-width:2px;
style F fill: #56caa4ff,stroke:#fff,stroke-width:2px;
style G fill: #56caa4ff,stroke:#fff,stroke-width:2px;
style H fill: #56caa4ff,stroke:#fff,stroke-width:2px;
style I fill: #56caa4ff,stroke:#fff,stroke-width:2px;
style J fill: #56caa4ff,stroke:#fff,stroke-width:2px;
style K fill: #a0ca69ff,stroke:#fff,stroke-width:2px;
style L fill: #a0ca69ff,stroke:#fff,stroke-width:2px;
style M fill: #a0ca69ff,stroke:#fff,stroke-width:2px;
style N fill: #b1a054ff,stroke:#fff,stroke-width:2px;
style A1 fill: #ffffff16,stroke:#bbb,stroke-width:2px;
style A2 fill: #ffffff16,stroke:#bbb,stroke-width:2px;
style A3 fill: #ffffff16,stroke:#bbb,stroke-width:2px;
This is a dual-omics analysis framework for MAPIT-seq data that enables the simultaneous interrogation of RBP–RNA interactions and gene expression profiles. The example codes provided demonstrate a bulk MAPIT-seq workflow focused on RBP target identification. For the gene level transcriptome analysis component, please refer to standard RNA-seq analysis procedures (external tools and workflows are typically used, e.g., featureCounts, DESeq2, and clusterProfiler).
In addition to the bulk workflow, we also provide customized pipelines for single-cell and long-read MAPIT-seq data. These can be found in the pipeline/ directory.
The MAPIT-seq pipeline consists of five main steps. The first step combines abundant RNA filtering and read alignment:
Mapit mapping -v GRCh38 --fq R1.fq.gz --fq2 R2.fq.gz --rna-strandness FR -n EN -r 1 -o Mapit_result -t THREAD
-
Options:
-h, --helpShow help message and exit-v GENOMEVERSION, --genomeVersion GENOMEVERSIONGenome build version (e.g., GRCh38)--fq FQFASTQ file for R1 (paired-end) or single-end reads--fq2 FQ2FASTQ file for R2 (paired-end reads only)--rna-strandness {F,R,FR,RF}Strand-specific information for HISAT2. For single-end reads, use F or R. For paired-end reads, use either FR or RF. Detailed descriptions of this option was available in HISAT2 manual (https://ccb.jhu.edu/software/hisat2/manual.shtml)-n SAMPLENAME, --sampleName SAMPLENAMESample name prefix (avoid underscores)-r REPLICATE, --replicate REPLICATEReplicate ID used as the second prefix in output files-o OUTPATH, --outpath OUTPATHOutput directory (default:./Mapit_result).-t THREAD, --thread THREADMaximum threads used for computation. (default: 10)--ERCCAdd this flag if ERCC spike-in controls were used. (default: False)
This step generates 4 subdirectories in the output path and outputs mapping .bam files.
Mapit finetuning -v GRCh38 -n EN -r 1 -o "output_path"/Mapit_result
-
Options:
-h, --helpShow help message and exit-v GENOMEVERSION, --genomeVersion GENOMEVERSIONGenome build version-n SAMPLENAME, --sampleName SAMPLENAMESample name prefix-r REPLICATE, --replicate REPLICATEReplicate ID used as the second prefix in output files-o OUTPATH, --outpath OUTPATHOutput directory (default:./Mapit_result). Must match the path used in themappingstep.
Supports simultaneous analysis of multiple samples.
Mapit callediting -v GRCh38 --sampleList EN,EM,NN,NM -o "output_path"/Mapit_result --prefix RBP -e Both [-t THREAD]
-
Options:
-h, --helpShow help message and exit-v GENOMEVERSION, --genomeVersion GENOMEVERSIONGenome build version--sampleList SAMPLELISTComma-separated list of sample names-o OUTPATH, --outpath OUTPATHOutput directory (default: "./Mapit_result"). Must match the path used in themappingandfinetuningsteps--prefix PREFIXPrefix for output directories within the result path-e {ADAR,APOBEC,Both}, --enzyme {ADAR,APOBEC,Both}RNA-editing enzymes used. ADAR,APOBEC: MAPIT-seq (default: Both); ADAR: HyperTRIBE or TRIBE; APOBEC: STAMP-t THREAD, --thread THREADMaximum threads used for computation. (default: 10)
Performs differential RNA editing analysis. Results are saved to: ${outpath}/6-Edit_calling/${prefix}${treatSampleName}/.
Mapit calltargets -v GRCh38 -i "output_path"/Mapit_result/6-Edit_calling/RBP/RBP_Edit_GE_RPE_DATA.tsv --treatName EM --controlName EN -o "output_path"/Mapit_result -p 0.05
-
Options:
-h, --helpshow this help message and exit-v GENOMEVERSION, --genomeVersion GENOMEVERSIONGenome Build Version-i INPUTEDIT, --inputEdit INPUTEDITInput table of RNA editing events. (File generated incalleditingstep)-l {transcript,gene,Both}, --level {transcript,gene,Both}Perform Differential Editing Analysis in transcript or gene(including intron) level--treatName TREATNAMESample name for treatment samples in output directory--controlName CONTROLNAMESample name for control samples in output directory-o OUTPATH, --outpath OUTPATHOutput directory (default: "./Mapit_result"). Must match the path used in themapping,finetuningandcalledtingsteps-b BINSIZE, --binSize BINSIZEThe length of bins that are split from transcripts. (default: 50)-c COVERAGE, --coverage COVERAGEMinimun coverage for valid editing sites. (default: 10)--supply-zeroSupply zero to non-editing bins. (default: False)-p PVALUE, --pvalue PVALUEMaximum p-value for Poisson filter of editing sites. (default: 0.1)--dropTreatRep DROPTREATREPReplicate id to exclude for treatment samples in output directory. (e.g. 1,4,5)--dropControlRep DROPCONTROLREPReplicate id to exclude for control samples in output directory. (e.g. 1,4,5)--singleCellFlag for single-cell input (replicate = individual cell). (default: False)
- Prepare files for SAILOR workflow.
Mapit prepare -v GRCh38 -n EN -r 1 -o "output_path"/Mapit_result
- Options (see
finetuningstep)
- Run SAILOR workflow
Mapit SAILOR -v GRCh38 -n EN -r 1 -c coverage -o "output_path"/Mapit_result -t THREAD
-
Options:
-h, --helpShow help message and exit-v GENOMEVERSION, --genomeVersion GENOMEVERSIONGenome build version-n SAMPLENAME, --sampleName SAMPLENAMESample name prefix-r REPLICATE, --replicate REPLICATEReplicate ID used as the second prefix in output files-c COVERAGE, --coverage COVERAGEMinimun coverage for valid editing sites. (default: 10)-o OUTPATH, --outpath OUTPATHOutput directory (default: "./Mapit_result"). Must match the path used in themappingstep-t THREAD, --thread THREADMaximum threads used for computation. (default: 10)
- Run FLARE workflow to identify edit clusters
Mapit FLARE -v GRCh38 -e AG -n EN -r 1 --regions <genome_name>_regions -o "output_path"/Mapit_result -t THREAD
-
Options:
-h, --helpShow help message and exit-v GENOMEVERSION, --genomeVersion GENOMEVERSIONGenome build version-n SAMPLENAME, --sampleName SAMPLENAMESample name prefix-r REPLICATE, --replicate REPLICATEReplicate ID used as the second prefix in output files--regions REGIONSFLARE configuration directory of files for regions of the genome, see FLARE configuration part-o OUTPATH, --outpath OUTPATHOutput directory (default: "./Mapit_result"). Must match the path used in the above steps-t THREAD, --thread THREADMaximum threads used for computation. (default: 10)-e {AG,CT}, --edittype {AG,CT}Editing types
Mapit hc_cluster -v GRCh38 -n EN -o "output_path"/Mapit_result -s 150
-
Options:
-h, --helpShow help message and exit-v GENOMEVERSION, --genomeVersion GENOMEVERSIONGenome build version-n SAMPLENAME, --sampleName SAMPLENAMESample name prefix-o OUTPATH, --outpath OUTPATHOutput directory (default: "./Mapit_result"). Must match the path used in the above steps-s SLOPLENGTH, --sloplength SLOPLENGTHLength of High-confidence clusters expanded for up- and down-stream sides
After completing all analysis steps, the Mapit_result/ directory will contain multiple output files and subdirectories corresponding to each processing stage. For a detailed description of the output structure, please refer to the structure.md file in the MAPIT-seq repository.
See the LICENSE file for more details.
© 2025 WangLabPKU.
