Single Amino Acids Polymorphism Statistics (SAAPS) is a Python3-based character interface program that can be used and deployed on multiple systems, including Windows, Linux, and MacOS. The software is mainly used to quickly calculate single amino acid polymorphism (SAP), Shannon information entropy at each site of the sequence, and to analyze sequence features and cluster analysis by using Onehot encoding and dimensionality reduction algorithms.
The raw input file of the SAAPS software is very simple, as long as the sequence file after alignment (nucleotide or amino acid), it can be used for processing and analysis. The main workflow of SAAPS is as follows: Firstly, by analyzing the characteristics of each sequence, the SAPs and Shannon information entropy of each point are calculated, and the corresponding result figures can be drawn at the same time. Then SAAPS screens the key sites according to the specified information entropy range, and transforms them into quantificable data by using Onehot encoding. Finally, the certain dimensionality reduction algorithm (PCA or t-SNE) is used to calculate the similarity level between sequences and draw the clustering scatter plot.
SAAPS is developed via Python 3, and you can get and install in following ways.
SAAPS has been distributed to the standard library pf PYPI, and can be easily installed by the tool pip.
pip install saaps
saaps -h
In addition to the pip method, you can download and install it using the file setup.py.
You need to download the repository, and then run:
python setup.py install
saaps -h
You can use "-h" or "--help" to get the help document. The following is a brief introduction to the each parameter of SAAPS:
Input Options
| Parameter | Description |
|---|---|
| -i, --input | the path of input file |
| -sr, --shannon_result | the path of shannon result |
Basic Options
| Parameter | Description |
|---|---|
| -dg, --delete_gap | delete the columns containing gaps in sequences |
| -t, --translate | translate nucleotides to amino acids |
| -cs, --cut_start | the start fragment/site of target sequence |
| -ce, --cut_end | the end fragment/site of target sequence |
| -osn, --output_seq_name | output sequence names |
| -csn, --change_seq_name | change the sequence names |
Polymorphism Options
| Parameter | Description |
|---|---|
| -cp, --compute_ppm | compute polymorphism |
| -pp, --plot_ppm | plot polymorphism figure |
Plotting Options
| Parameter | Description |
|---|---|
| -fw, --figure_width | the width of figure |
| -fh, --figure_height | the height of figure |
| -fname, --figure_name | the name of figure |
| -c, --color | the palette of figure |
| -pac, --print_all_colors | print all supported palettes |
| -dpi, --DPI | the dpi of the figure |
| -fm, --figure_format | the format of figure |
| -tp, --transparent | make the background of the figure transparent |
| -sl, --scatter_label | display scatter plot labels |
Shannon Entropy Options
| Parameter | Description |
|---|---|
| -sha, --shannon | compute shannon entropy |
| -smax, --shannon_max | the maximum of the shannon value (IC) |
| -smin, --shannon_min | the minimum of the shannon value (IC) |
| -pic, --plot_ic | plot IC scatter figure |
Dimension Reduction Options
| Parameter | Description |
|---|---|
| -oh, --onehot | OneHot Encoding |
| -pca, --PCA | dimensionality reduction by PCA |
| -tsne, --TSNE | dimensionality reduction by t-SNE |
| -tpp, --tSNE_perplexity | the perplexity in t_SNE, default is 3 |
| -tlr, --tSNE_learn_rate | the learning rate in t_SNE, default is 200 |
| -tni, --tSNE_n_iter | the n_iter in t_SNE, default is 5000 |
| -trs, --tSNE_random_state | the random state in t_SNE, default is 1 |
| -pc, --plot_cluster | plot clustering scatter plot |
Output Options
| Parameter | Description |
|---|---|
| -o, --output_dir | the directory path of output file, the default is the current user document folder |
| -pre, --prefix | the prefix of the output file name |
This chapter uses the test data ('testSeq.txt' in the test folder) provided by the software as an example to describe the main functions of SAAPS. Note that all sample files in this manual are generated by the software using test data (" testSeq.txt "), and users can follow the tutorial to learn it before exploring other features on their own.
Users needs to prepare an alignment sequence file (nucleotide or amino acid file), the alignment process can be completed by MAFFT and other software, and the format of the sequence file after alignment should be "sequence name + (line/multiple lines) sequence body" structure. The aligned sequence file should look like this:
>TestSeqNo.1 Sequence1
GPQSGAIYVG……LWLDEEAMEQ
>TestSeqNo.2 Sequence2
GPLGQQSGAV……LWLDDEVMEQ
...
Before the subsequent analysis, it is usually necessary to format the sequence, such as whether the nucleotide sequence needs to be translated into amino acids, whether the sequence name needs to be changed, and so on. This allows us to get the best results in the subsequent analysis and drawing. Here we use the example of modifying the sequence name:
saaps -i your/seq/path -osn -o your/output/path
Code explanation: -i your/seq/path indicates that the matched sequence file is passed. -osn indicates the sequence name in the output sequence file. By default, the output is in a .csv table. -o your/output/path indicates the output path. If this parameter is not specified, the result will be output in the default path.After the program is executed, a file named SeqName.csv will be generated, which can be used for users to modify the sequence names individually. It is recommended to save the modified new names in the second column, so that subsequent programs can modify the sequence names in batches. The format of the file is as follows:
| Old_Name | |
|---|---|
| >TestSeqNo.1 Sequence1 | |
| >TestSeqNo.1 Sequence2 | |
| ... |
saaps -i your/seq/path -csn your/names/file/path -o your/output/path
Code explanation:In this step, -csn your/names/file/path represents the matching table of the old and new names in the sequence, which is the table output in the previous step. The pairing table format for the sequence names is as follows:
| Old_Name | New_Name |
|---|---|
| >TestSeqNo.1 Sequence1 | >Seq1 |
| >TestSeqNo.1 Sequence2 | >Seq2 |
| ... | ... |
The modified sequence file is named "NewNameSeq.txt" by default. The comparison between the original sequence file and the modified sequence file is as follows:
| Old Seq File | New Seq File |
|---|---|
| >TestSeqNo.1 Sequence1 | >Seq1 |
| GPQSGAIYVG……LWLDEEAMEQ | GPQSGAIYVG……LWLDEEAMEQ |
| ... | ... |
Although SAAPS was developed only for the analysis of SAP, the basis of polymorphism analysis is statistical analysis of the symbols and frequencies of each site in the sequence, so SAAPS can also be applied to sequences such as nucleotides.
saaps -i your/seq/path -cp -pp -o your/output/path
Code explanation:-csn your/names/file/path represents the matching table of the old and new names in the sequence, which is the table output in the previous step. -cp represents to activate the computational polymorphism function. -pp represents that the polymorphism result is plotted.
The statistical result of polymorphism analysis will generate three files named Concise_Result.txt, Detail_Result.txt, and Seq_Matrix.csv. If the drawing parameter (-pp) is set, A resulting graph (default) named SAPs.pdf is generated accordingly. The main result files are explained as follows:
Concise_Result.txt: a concise result of polymorphism analysis, which mainly records the frequency of amino acids at each site, and is also an input file for the subsequent calculation of information entropy at each point. The format is shown as follows (part) :
Site 1 (G, 15, 100.00%)
Site 2 (K, 7, 46.67%) (R, 1, 6.67%) (C, 1, 6.67%) …
… … … … …
Detail_Result.txt: detailed result of polymorphism analysis, which records the sequence name of each amino acid at each site. Its file content is shown below (part) :
| col1 | col2 | col3 | coln |
|---|---|---|---|
| Site1 | (G: All Sequences) | ||
| Site2 | (P: >Seq2,>Seq1) | (K: >Seq10,>Seq15,>Seq7,>Seq13,>Seq9,>Seq12,>Seq14) | ... |
| Site3 | (F: Most Sequences) | (L: >Seq2) | ... |
| ... | ... | ... | ... |
Seq_Matrix.csv: indicates the sequence amino acid matrix. The main contents of the table are as follows (part) :
| Seq1 | Seq2 | Seq3 | Seq4 | Seq5 | Seq6 | Seq7 | ... |
|---|---|---|---|---|---|---|---|
| G | G | G | G | G | G | G | ... |
| P | P | A | A | A | A | K | ... |
| ... | ... | ... | ... | ... | ... | ... | ... |
SAPs.pdf: Polymorphism distribution map. The example figure is shown below, and users can customize the resulting graph according to the drawing parameters provided by the software:
In the SAAPS calculation results, information content (IC value) is used to replace information entropy, and the higher the IC value, the more conservative the amino acid at the site. During SAAPS information entropy analysis, two result files, Shannon_log.txt and Shannon_IC_Result.csv, are generated by default.
saaps -i your/Concise/result/path -sha -pic -o your/output/path
Code explanation: When calculating information entropy, we need to pass the Concise_Result.txt file in the previous step of polymorphism analysis, -sha means to activate the information entropy calculation function, -pic means to draw the information entropy result, and a picture file named IC-Plot.pdf is generated in the output path by default.
Shannon_log.txt: indicates the simple statistical result of IC analysis. The contents of the document are as follows (part) :
Basic Statistic Information of the Shannon Entropy (IC)
IC
Count 150
mean 3.8647165333333335
... ...
Shannon_IC_Result.csv: IC values of all sites in sequence. The table contains all site IC values, which are convenient for the subsequent drawing of IC figure. The contents of the table are as follows (part) :
| SITE | IC |
|---|---|
| Site1 | 4.45943 |
| Site2 | 2.52919 |
| Site3 | 3.75947 |
| Site4 | 3.75947 |
| ... | ... |
IC-Plot.pdf: Sequence IC scatter figure. The figure is as follows:
saaps -i your/Seq/Matrix/path -sr your/IC/Result/path -smin ICmin -smax ICmax -o your/output/path
Code explanation:During the Onehot encoding process, the Seq_Matrix.csv file from the polymorphism analysis in the previous step and the Shannon_IC_Result.csv file from the information entropy calculation need to be passed in. If only the sites within the threshold range need to be converted, You can use -smin and -smax to set the IC threshold range. In the Onehot encoding step, the software automatically generates 4 files: OneHot-Original.csv, One-Hot-ConciseMatrix.csv, OneHot-Transform.csv, and One-Hot-ForIntersect.csv.
OneHot-Original.csv: a table of polymorphic loci selected according to the IC threshold range and used for onehot encoding. The file content is shown as follows (part) :
| 3 | 4 | 5 | 6 | 7 | 8 | ... | |
|---|---|---|---|---|---|---|---|
| >Seq1 | Q | S | G | A | I | Y | ... |
| >Seq2 | L | G | Q | Q | S | G | --- |
| ... | ... | ... | ... | ... | ... | ... | ... |
One-Hot-ConciseMatrix.csv: the process file after the Onehot encoding of OneHot-Original.csv, where each amino acid is represented by the corresponding binary number. The file content is shown as follows (part) :
| 3 | ... | |
|---|---|---|
| Seq1 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0] | ... |
| Seq2 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] | ... |
| ... | ... | ... |
OneHot-Transform.csv: the result of upgrading the dimensions according to the One-Hot-ConciseMatrix.csv file, in which the binary list of each point is expanded into the corresponding number dimensions, and each dimension only contains the corresponding values in the binary list. The main contents of the file are shown as follows (part) :
| 3_1 | 3_2 | 3_3 | .. | |
|---|---|---|---|---|
| Seq1 | 0 | 0 | 0 | ... |
| Seq2 | 0 | 0 | 0 | ... |
| ... | ... | ... | ... | ... |
One-Hot-ForIntersect.csv: indicates the record list of sequence polymorphism loci. The table records the amino acids of each sequence at the polymorphic site, which can be used for differential complement analysis of subsequent sets, etc. The contents of the table are as follows (part) :
| 3 | 4 | 5 | 6 | 7 | ... | |
|---|---|---|---|---|---|---|
| >Seq1 | 3Q | 4S | 5G | 6A | 7I | ... |
| >Seq2 | 3L | 4G | 5Q | 6Q | 7S | ... |
| >Seq3 | 3F | 4G | 5Q | 6Q | 7S | ... |
| ... | ... | ... | ... | ... | ... | ... |
The function of Onehot encoding is to convert classified data into numerical data that can be quantitatively analyzed. After the onehot encoding conversion, the information of each location of each sequence is converted into high-dimensional binary data. Therefore, in order to better analyze the sequence features, we usually adopt a certain dimensionality reduction algorithm to analyze it and observe the relationship between each sequence. Here we take the most commonly used PCA algorithm as an example.
saaps -i your/Transform/result/path -pca -pc -o your/output/path
Code explanation:In the process of dimensionality reduction analysis, the OneHot-Transform.csv file in the previous step needs to be passed in; -pca indicates that the PCA dimensionality reduction function is activated. -pc means to draw a scatter plot based on the dimensionality reduction results. This process produces two result files called PCA.csv and PCA-Plot.pdf.
PCA.csv: PCA calculation result of sequence file. The contents of the document are as follows (part) :
| 0 | 1 | 2 | 3 | ... | |
|---|---|---|---|---|---|
| Seq1 | 6.254129 | -2.01068 | -0.27758 | -0.33611 | ... |
| Seq2 | -0.74108 | -0.47749 | -0.1452 | -1.75072 | ... |
| ... | ... | ... | ... | ... | ... |
PCA-plot.pdf: Based on the PCA analysis results, the scatter Plot is generated according to the default parameters, and the example figure is as follows:


