Signal peptides are short sequences at the N-terminus of proteins that direct them to the secretory pathway and are typically cleaved after translocation (1). In-silico prediction of signal peptides is crucial for functional annotation and localization.
Objective: retrieve positive and negative datasets of eukaryotic proteins from UniProtKB.
The full description of the procedure can be found in the README.md of the data_colection folder.
| Section | Title |
|---|---|
| a | Selection criteria |
| b | Filtering the Positive Dataset |
| c | Implementation notes |
| d | Output files |
| e | Reproducibility |
| Datasets | Positive | Negative |
|---|---|---|
| Query | (existence:1) AND (length:[40 TO *]) AND (reviewed:true) AND (fragment:false) AND (taxonomy_id:2759) AND (ft_signal_exp:*) |
(existence:1) AND (length:[40 TO *]) AND (reviewed:true) AND (fragment:false) AND (taxonomy_id:2759) NOT (ft_signal:*) AND ((cc_scl_term_exp:SL-0091) OR (cc_scl_term_exp:SL-0191) OR (cc_scl_term_exp:SL-0173) OR (cc_scl_term_exp:SL-0204) OR (cc_scl_term_exp:SL-0209) OR (cc_scl_term_exp:SL-0039)) |
| No. entries | Before filtering: 2,949 After filtering: 2,932 |
20,615 |
| Output | eukarya_SP_pos.tsv pos.fasta |
eukarya_SP_neg.tsv neg.fasta |
Objective: Reduce redundancy in the datasets, generate training and benchmarking sets, and create 5-fold cross-validation subsets for robust model evaluation
The full description of the procedure can be found in the README.md of the data_split folder.
| Section | Title |
|---|---|
| a | Clustering |
| b | Extract Representative IDs |
| c | Metadata Collection |
| d | Data Splitting and Cross-Validation |
| e | Output |
Clustering
| Dataset | Input sequences | No. of clusters | File |
|---|---|---|---|
| Positive | 2,932 | 1,093 | cluster-results-pos_rep_seq.fasta |
| Negative | 20,615 | 8,934 | cluster-results-neg_rep_seq.fasta |
Extract Representative IDs and Metadata Collection
The ID lists were randomized and split. The output files were used to filter the collective .tsv file. Two .tsv files were obtained to organize metadata related to positive and negative datasets.
| Section | Scripts | Files |
|---|---|---|
| b | extract_rep_ids.py | neg_rep_id.txt pos_rep_id.txt |
| c | 20,615 | organizing_metadata.py |
Output
| Set / Fold | Negative sequences | Positive sequences | Total sequences |
|---|---|---|---|
| Benchmarking | 1,787 | 219 | 2,006 |
| Fold 1 | 1,430 | 175 | 1,605 |
| Fold 2 | 1,430 | 175 | 1,605 |
| Fold 3 | 1,429 | 175 | 1,604 |
| Fold 4 | 1,429 | 175 | 1,604 |
| Fold 5 | 1,429 | 174 | 1,603 |
Objective: understand the structure and characteristics of the dataset.
The data visualization step provides insights into the characteristics of the positive and negative protein datasets used in this study. The plots were generated in Python using matplotlib and seaborn.
The full description of the procedure can be found in the README.md of the data_analysis folder.
| Section | Title |
|---|---|
| a | Analyses |
| b | Plot Summary |
| c | Results |
| Description and Plot Type | Dataset | Filename |
|---|---|---|
| Kingdom distribution (Pie & Bar) |
All | kingdom_dist.pdf |
| Species distribution (Pie & Bar) |
All | species_dist.pdf |
| Sequence length distribution (KDE Plot, Boxplot, Histogram) |
All | seq_length.pdf |
| Signal Peptide length distribution ((KDE Plot, Boxplot, Histogram) |
Positive | SP_length.pdf |
| Residue composition (Bar Plot) |
All compared to SwissProt | residue_composition.pdf |
| Signal Peptide cleavage site logos (Sequence Logo) |
Positive | logo.pdf |
Note: All plots and analyses are reproducible using the uploaded Data_Visualization.ipynb notebook.
Objective: Classify eukaryotic protein sequences with respect to the presence or absence of a signal peptide (SP) using a position-specific weight matrix (PSWM)-based approach inspired by the von Heijne method.
| Section | Title |
|---|---|
| a | Data Organization |
| b | Training |
| b.1 | Position-Specific Weight Matrix Computation |
| c | Validation |
| c.1 | Sequence Scoring |
| c.2 | Optimal Threshold Selection |
| d | Testing |
| d.1 | Sequences Classification |
| e | Performance Evaluation |
The detailed workflow and implementation can be found in the vonHeijne/ directory.
| Metric | Value |
|---|---|
| Accuracy | 0.9320 ± 0.0085 |
| Precision | 0.6830 ± 0.0646 |
| Recall | 0.7300 ± 0.0560 |
| F1 Score | 0.7012 ± 0.0226 |
| MCC | 0.6664 ± 0.0258 |
| Threshold | 8.8089 ± 0.5967 |
Objective: Classify eukaryotic protein sequences with respect to the presence or absence of a signal peptide (SP) building a Support Vector Machine (SVM) based on features extracted from the training dataset sequences.
| Section | Title |
|---|---|
| a | Data Organization |
| b | Features Definition |
| c | Training and Validation |
| c.1 | Feature Extraction and Scaling |
| c.2 | Grid Search Over Hyperparameters |
| c.3 | Features Selection (Random Forest) |
| d | Model Testing |
| d.1 | Over Selected Features |
| d.2 | Over All Features |
The best models were selected by a grid search over the hyperparameters utilizing MCC as performance metric.
Overall, the models built utilizing all features performed best.
Selected Hyperparameters and best validation MCC:
| Round | Kernel | C | Gamma | MCC |
|---|---|---|---|---|
| 1 | 'rbf' | 1 | 0.01 | 0.823 |
| 2 | 'rbf' | 10 | 'scale' | 0.877 |
| 3 | 'rbf' | 10 | 0.01 | 0.822 |
| 4 | 'rbf' | 10 | 0.01 | 0.844 |
| 5 | 'rbf' | 10 | 0.01 | 0.856 |
Performance Evaluation Metrics over Testing data:
| Metrics | Value |
|---|---|
| MCC | 0.826 ± 0.030 |
| Precision | 0.851 ± 0.038 |
| Recall | 0.841 ± 0.039 |
| Accuracy | 0.967 ± 0.007 |
| F1 score | 0.845 ± 0.026 |
Selected Hyperparameters and best validation MCC:
| Round | Kernel | C | Gamma | MCC |
|---|---|---|---|---|
| 1 | 'rbf' | 0.1 | 'scale' | 0.802 |
| 2 | 'rbf' | 10 | 'scale' | 0.849 |
| 3 | 'rbf' | 10 | 0.01 | 0.807 |
| 4 | 'rbf' | 1 | 'scale' | 0.807 |
| 5 | 'rbf' | 1 | 'scale' | 0.857 |
Performance Evaluation Metrics over Testing data:
| Metrics | Value |
|---|---|
| MCC | 0.801 ± 0.018 |
| Precision | 0.849 ± 0.034 |
| Recall | 0.796 ± 0.041 |
| Accuracy | 0.962 ± 0.003 |
| F1 score | 0.821 ± 0.016 |
Objective: Evaluate the performance in the classification with respect to the presence (1) or absence (0) of a signal peptide (SP) of the eukaryotic protein sequences in the benchmarking dataset for both the Von-Heijne and the SVM models.
| Section | Title |
|---|---|
| a | Data Organization |
| b | Training: Building PSWM on Training data |
| c | Testing: Classification of Testing data |
| d | Performance Evaluation |
The threshold used for the classification was the average of the best threshold values obtained in the cross validation step for the von Heijne implementation. The description can be found in the dedicated folder: 04_vonHeijne.
The performance of the classifier built utilizing the whole training dataset is summarized in the following table:
| Metric | Value |
|---|---|
| Accuracy | 0.9312 |
| Precision | 0.6614 |
| Recall | 0.7580 |
| F1-score | 0.7064 |
| MCC | 0.6696 |
| Threshold | 8.8089 |
- Owji, Hajar & Nezafat, Navid & Negahdaripour, Manica & HajiEbrahimi, Ali & Younes, Ghasemi. (2018). A Comprehensive Review of Signal Peptides: Structure, Roles, and Applications. European Journal of Cell Biology. 97. 10.1016/j.ejcb.2018.06.003.