- Introduction
- Tools
- Installation
- Usage
- Use
u4falignfor manipulating alignment results - Difference between
read-SAMandfrag-SAM
Falign is a sequence alignment toolkit for long noisy chromosome conformation capture (3C) reads, such as Pore-C.
Falign is written in C and C++ programming language.
Falign relies on the HTSLIB library. We have dowanloaded one copy of that and save in ./htslib/htslib-1.19.1.tar.bz2 to make it convenient for off-line installation.
Two tools are released together in this toolkit.
falign. The alignment tool.dip3d. A utility used internally byDip3D, a diploid human 3D genome construction algorithm.
$ git clone https://github.com/xiaochuanle/Falign.git$ cd Falign
$ ./htslib/install-htslib.shIn Step 2 above, we install HTSLIB in directory ./htslib/htslib:
$ ls ./htslib/htslib
bin include lib shareWe have to tell Falign where to link HTSLIB. To do this, we export the location of this directory to an environment variable name LIBHTS:
$ export LIBHTS=$(pwd)/htslib/htslib
$ echo $LIBHTS
/share/home/chuanlex/chenying/data/Falign/htslib/htslibBefore installing Falign, make sure you have executed Step 3 above. Otherwise the linker will complain that it cannot find -lhts.
$ cd src/ && make && cd ..Falign is installed in the directory Linux-amd64/bin/:
$ ls Linux-amd64/bin/
dip3d falignThe directory ara-sample-data/ provides sample reference file reference.fa.gz and sample reads file reads.fq.gz.
./Linux-amd64/bin/falign build-repeat ara-sample-data/reference.fa.gz ara-sample-data/reference.fa.gz.repeat.bed./Linux-amd64/bin/falign -repeat_bed ara-sample-data/reference.fa.gz.repeat.bed -num_threads 4 ara-sample-data/reference.fa.gz ara-sample-data/reads.fq.gz > map.pafThe mapping results are output to the file map.paf in PAF format.
Falign supports different output formats (specified by the -outfmt option):
- PAF
- SAM
- BAM
- FRAG-SAM
- FRAG-BAM
In each output result (note that every alignment has for offsets: read start, read end, reference start, reference end), falign adds the following additional fields:
qS:i:the nearest restiction enzyme site to the read start positionqE:i:the nearest restiction enzyme site to the read end positionvS:i:the nearest restriction enzyme site to the reference start positionvE:i:the nearest restriction enzyme site to the reference end positionpi:f:percentage of identity of the alignmentSA:Z:a homologous map of the fragment
By convention, in SAM mapping results, a read may contain multiple mapping results. For saving space, the read sequence is usually only presented in the first mapping result of this read. In the second to last mapping results of this read, the sequence field is filled by a start *. falign outputs SAM mapping results in this manner. And we call SAM mapping results output in this manner read-SAM. Since a Pore-C read always contains multiple fragments, a Pore-C read usually contains many mapping results:
0cd79600-51cf-4255-a6f7-0e9660721e85 16 2 1794202 1 ...
0cd79600-51cf-4255-a6f7-0e9660721e85 16 2 1791136 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85 0 2 1733063 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85 16 4 12727850 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85 0 2 1713207 60 ...In the example above, the read 0cd79600-51cf-4255-a6f7-0e9660721e85 contains five alignment results.
Some tools such as whatshap taking BAM format as input will complain if there exist too many duplicate read names. In this case we suggest transfer read-SAM to frag-SAM. In frag-SAM every fragment is treadted as an individual read. We can use the sam2frag-sam command in the u4falign to transfer read-SAM to frag-SAM:
$ u4falign sam2frag-sam map.sam > frag-map.samAfter transformation, we have
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:000:0000000000:0001794201 16 2 1794202 1 ...
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:001:0000000000:0001791135 16 2 1791136 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:002:0000000000:0001733062 0 2 1733063 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:003:0000000001:0012727849 16 4 12727850 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:004:0000000000:0001713206 0 2 1713207 60 ...In frag-SAM, the sequence field in every alignment results is represented by the fragment sequence (not the whose read sequence). Note that we add a suffix to every read name to avoid duplicate read names in the SAM file. The meanings of the fields in the suffix string is
read-id:fragment-id:reference-sequence-id:reference-mapping-position
- Chuan-Le Xiao, xiaochuanle@126.com
- Ying Chen, chenying2016@gmail.com
GPLv3