Skip to content

Complete Python pipeline for signal peptide detection combining classical and machine learning approaches.

Notifications You must be signed in to change notification settings

kianinsilico/SignalPeptidePrediction

 
 

Repository files navigation

Prediction of Secretory Signal Peptide Presence in Eukaryotic Proteins

Laboratory of Bioinformatics 2 2025/2026 - Alma Mater Studiorum Università di Bologna

Abstract

Signal peptides are short sequences at the N-terminus of proteins that direct them to the secretory pathway and are typically cleaved after translocation (1). In-silico prediction of signal peptides is crucial for functional annotation and localization.

1. Data Collection

Objective: retrieve positive and negative datasets of eukaryotic proteins from UniProtKB.

The full description of the procedure can be found in the README.md of the data_colection folder.

Workflow

Section Title
a Selection criteria
b Filtering the Positive Dataset
c Implementation notes
d Output files
e Reproducibility

Results

Datasets Positive Negative
Query (existence:1) AND (length:[40 TO *]) AND (reviewed:true) AND (fragment:false) AND (taxonomy_id:2759) AND (ft_signal_exp:*) (existence:1) AND (length:[40 TO *]) AND (reviewed:true) AND (fragment:false) AND (taxonomy_id:2759) NOT (ft_signal:*) AND ((cc_scl_term_exp:SL-0091) OR (cc_scl_term_exp:SL-0191) OR (cc_scl_term_exp:SL-0173) OR (cc_scl_term_exp:SL-0204) OR (cc_scl_term_exp:SL-0209) OR (cc_scl_term_exp:SL-0039))
No. entries Before filtering: 2,949
After filtering: 2,932
20,615
Output eukarya_SP_pos.tsv
pos.fasta
eukarya_SP_neg.tsv
neg.fasta

2. Data Preparation

Objective: Reduce redundancy in the datasets, generate training and benchmarking sets, and create 5-fold cross-validation subsets for robust model evaluation

The full description of the procedure can be found in the README.md of the data_split folder.

Workflow

Section Title
a Clustering
b Extract Representative IDs
c Metadata Collection
d Data Splitting and Cross-Validation
e Output

Results

Clustering

Dataset Input sequences No. of clusters File
Positive 2,932 1,093 cluster-results-pos_rep_seq.fasta
Negative 20,615 8,934 cluster-results-neg_rep_seq.fasta

Extract Representative IDs and Metadata Collection

The ID lists were randomized and split. The output files were used to filter the collective .tsv file. Two .tsv files were obtained to organize metadata related to positive and negative datasets.

Section Scripts Files
b extract_rep_ids.py neg_rep_id.txt
pos_rep_id.txt
c 20,615 organizing_metadata.py

Output

Set / Fold Negative sequences Positive sequences Total sequences
Benchmarking 1,787 219 2,006
Fold 1 1,430 175 1,605
Fold 2 1,430 175 1,605
Fold 3 1,429 175 1,604
Fold 4 1,429 175 1,604
Fold 5 1,429 174 1,603

3. Data Analysis and Visualization

Objective: understand the structure and characteristics of the dataset.

The data visualization step provides insights into the characteristics of the positive and negative protein datasets used in this study. The plots were generated in Python using matplotlib and seaborn.

The full description of the procedure can be found in the README.md of the data_analysis folder.

Workflow

Section Title
a Analyses
b Plot Summary
c Results

Results

Description and Plot Type Dataset Filename
Kingdom distribution
(Pie & Bar)
All kingdom_dist.pdf
Species distribution
(Pie & Bar)
All species_dist.pdf
Sequence length distribution
(KDE Plot, Boxplot, Histogram)
All seq_length.pdf
Signal Peptide length distribution
((KDE Plot, Boxplot, Histogram)
Positive SP_length.pdf
Residue composition
(Bar Plot)
All compared to SwissProt residue_composition.pdf
Signal Peptide cleavage site logos
(Sequence Logo)
Positive logo.pdf

Note: All plots and analyses are reproducible using the uploaded Data_Visualization.ipynb notebook.

4. The vonHeijne method for SP detection

Objective: Classify eukaryotic protein sequences with respect to the presence or absence of a signal peptide (SP) using a position-specific weight matrix (PSWM)-based approach inspired by the von Heijne method.

Workflow

Section Title
a Data Organization
b Training
b.1 Position-Specific Weight Matrix Computation
c Validation
c.1 Sequence Scoring
c.2 Optimal Threshold Selection
d Testing
d.1 Sequences Classification
e Performance Evaluation

The detailed workflow and implementation can be found in the vonHeijne/ directory.

Results

Metric Value
Accuracy 0.9320 ± 0.0085
Precision 0.6830 ± 0.0646
Recall 0.7300 ± 0.0560
F1 Score 0.7012 ± 0.0226
MCC 0.6664 ± 0.0258
Threshold 8.8089 ± 0.5967

5. SVM classifier for SP detection

Objective: Classify eukaryotic protein sequences with respect to the presence or absence of a signal peptide (SP) building a Support Vector Machine (SVM) based on features extracted from the training dataset sequences.

Workflow

Section Title
a Data Organization
b Features Definition
c Training and Validation
c.1 Feature Extraction and Scaling
c.2 Grid Search Over Hyperparameters
c.3 Features Selection (Random Forest)
d Model Testing
d.1 Over Selected Features
d.2 Over All Features

Results

The best models were selected by a grid search over the hyperparameters utilizing MCC as performance metric.
Overall, the models built utilizing all features performed best.

Models trained over ALL FEATURES:

Selected Hyperparameters and best validation MCC:

Round Kernel C Gamma MCC
1 'rbf' 1 0.01 0.823
2 'rbf' 10 'scale' 0.877
3 'rbf' 10 0.01 0.822
4 'rbf' 10 0.01 0.844
5 'rbf' 10 0.01 0.856

Performance Evaluation Metrics over Testing data:

Metrics Value
MCC 0.826 ± 0.030
Precision 0.851 ± 0.038
Recall 0.841 ± 0.039
Accuracy 0.967 ± 0.007
F1 score 0.845 ± 0.026

Models trained over SELECTED FEATURES:

Selected Hyperparameters and best validation MCC:

Round Kernel C Gamma MCC
1 'rbf' 0.1 'scale' 0.802
2 'rbf' 10 'scale' 0.849
3 'rbf' 10 0.01 0.807
4 'rbf' 1 'scale' 0.807
5 'rbf' 1 'scale' 0.857

Performance Evaluation Metrics over Testing data:

Metrics Value
MCC 0.801 ± 0.018
Precision 0.849 ± 0.034
Recall 0.796 ± 0.041
Accuracy 0.962 ± 0.003
F1 score 0.821 ± 0.016

6. Performance evaluation of the Von-Heijne and the SVM classifiers

Objective: Evaluate the performance in the classification with respect to the presence (1) or absence (0) of a signal peptide (SP) of the eukaryotic protein sequences in the benchmarking dataset for both the Von-Heijne and the SVM models.

6.1 : Von Heijne Classifier Performance

Workflow

Section Title
a Data Organization
b Training:
Building PSWM on Training data
c Testing:
Classification of Testing data
d Performance Evaluation

Results

The threshold used for the classification was the average of the best threshold values obtained in the cross validation step for the von Heijne implementation. The description can be found in the dedicated folder: 04_vonHeijne.
The performance of the classifier built utilizing the whole training dataset is summarized in the following table:

Metric Value
Accuracy 0.9312
Precision 0.6614
Recall 0.7580
F1-score 0.7064
MCC 0.6696
Threshold 8.8089

References

  1. Owji, Hajar & Nezafat, Navid & Negahdaripour, Manica & HajiEbrahimi, Ali & Younes, Ghasemi. (2018). A Comprehensive Review of Signal Peptides: Structure, Roles, and Applications. European Journal of Cell Biology. 97. 10.1016/j.ejcb.2018.06.003.

About

Complete Python pipeline for signal peptide detection combining classical and machine learning approaches.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 99.9%
  • Python 0.1%