Prediction of Secretory Signal Peptide Presence in Eukaryotic Proteins

Laboratory of Bioinformatics 2 2025/2026 - Alma Mater Studiorum Università di Bologna

Abstract

Signal peptides are short sequences at the N-terminus of proteins that direct them to the secretory pathway and are typically cleaved after translocation (1). In-silico prediction of signal peptides is crucial for functional annotation and localization.

1. Data Collection

Objective: retrieve positive and negative datasets of eukaryotic proteins from UniProtKB.

The full description of the procedure can be found in the README.md of the data_colection folder.

Workflow

Section	Title
a	Selection criteria
b	Filtering the Positive Dataset
c	Implementation notes
d	Output files
e	Reproducibility

Results

Datasets	Positive	Negative
Query	`(existence:1) AND (length:[40 TO ]) AND (reviewed:true) AND (fragment:false) AND (taxonomy_id:2759) AND (ft_signal_exp:)`	`(existence:1) AND (length:[40 TO ]) AND (reviewed:true) AND (fragment:false) AND (taxonomy_id:2759) NOT (ft_signal:) AND ((cc_scl_term_exp:SL-0091) OR (cc_scl_term_exp:SL-0191) OR (cc_scl_term_exp:SL-0173) OR (cc_scl_term_exp:SL-0204) OR (cc_scl_term_exp:SL-0209) OR (cc_scl_term_exp:SL-0039))`
No. entries	Before filtering: 2,949 After filtering: 2,932	20,615
Output	eukarya_SP_pos.tsv pos.fasta	eukarya_SP_neg.tsv neg.fasta

2. Data Preparation

Objective: Reduce redundancy in the datasets, generate training and benchmarking sets, and create 5-fold cross-validation subsets for robust model evaluation

The full description of the procedure can be found in the README.md of the data_split folder.

Workflow

Section	Title
a	Clustering
b	Extract Representative IDs
c	Metadata Collection
d	Data Splitting and Cross-Validation
e	Output

Results

Clustering

Dataset	Input sequences	No. of clusters	File
Positive	2,932	1,093	cluster-results-pos_rep_seq.fasta
Negative	20,615	8,934	cluster-results-neg_rep_seq.fasta

Extract Representative IDs and Metadata Collection

The ID lists were randomized and split. The output files were used to filter the collective .tsv file. Two .tsv files were obtained to organize metadata related to positive and negative datasets.

Section	Scripts	Files
b	extract_rep_ids.py	neg_rep_id.txt pos_rep_id.txt
c	20,615	organizing_metadata.py

Output

Set / Fold	Negative sequences	Positive sequences	Total sequences
Benchmarking	1,787	219	2,006
Fold 1	1,430	175	1,605
Fold 2	1,430	175	1,605
Fold 3	1,429	175	1,604
Fold 4	1,429	175	1,604
Fold 5	1,429	174	1,603

3. Data Analysis and Visualization

Objective: understand the structure and characteristics of the dataset.

The data visualization step provides insights into the characteristics of the positive and negative protein datasets used in this study. The plots were generated in Python using matplotlib and seaborn.

The full description of the procedure can be found in the README.md of the data_analysis folder.

Workflow

Section	Title
a	Analyses
b	Plot Summary
c	Results

Results

Description and Plot Type	Dataset	Filename
Kingdom distribution (Pie & Bar)	All	kingdom_dist.pdf
Species distribution (Pie & Bar)	All	species_dist.pdf
Sequence length distribution (KDE Plot, Boxplot, Histogram)	All	seq_length.pdf
Signal Peptide length distribution ((KDE Plot, Boxplot, Histogram)	Positive	SP_length.pdf
Residue composition (Bar Plot)	All compared to SwissProt	residue_composition.pdf
Signal Peptide cleavage site logos (Sequence Logo)	Positive	logo.pdf

Note: All plots and analyses are reproducible using the uploaded Data_Visualization.ipynb notebook.

4. The vonHeijne method for SP detection

Objective: Classify eukaryotic protein sequences with respect to the presence or absence of a signal peptide (SP) using a position-specific weight matrix (PSWM)-based approach inspired by the von Heijne method.

Workflow

Section	Title
a	Data Organization
b	Training
b.1	Position-Specific Weight Matrix Computation
c	Validation
c.1	Sequence Scoring
c.2	Optimal Threshold Selection
d	Testing
d.1	Sequences Classification
e	Performance Evaluation

The detailed workflow and implementation can be found in the vonHeijne/ directory.

Results

Metric	Value
Accuracy	0.9320 ± 0.0085
Precision	0.6830 ± 0.0646
Recall	0.7300 ± 0.0560
F1 Score	0.7012 ± 0.0226
MCC	0.6664 ± 0.0258
Threshold	8.8089 ± 0.5967

5. SVM classifier for SP detection

Objective: Classify eukaryotic protein sequences with respect to the presence or absence of a signal peptide (SP) building a Support Vector Machine (SVM) based on features extracted from the training dataset sequences.

Workflow

Section	Title
a	Data Organization
b	Features Definition
c	Training and Validation
c.1	Feature Extraction and Scaling
c.2	Grid Search Over Hyperparameters
c.3	Features Selection (Random Forest)
d	Model Testing
d.1	Over Selected Features
d.2	Over All Features

Results

The best models were selected by a grid search over the hyperparameters utilizing MCC as performance metric.
Overall, the models built utilizing all features performed best.

Models trained over ALL FEATURES:

Selected Hyperparameters and best validation MCC:

Round	Kernel	C	Gamma	MCC
1	'rbf'	1	0.01	0.823
2	'rbf'	10	'scale'	0.877
3	'rbf'	10	0.01	0.822
4	'rbf'	10	0.01	0.844
5	'rbf'	10	0.01	0.856

Performance Evaluation Metrics over Testing data:

Metrics	Value
MCC	0.826 ± 0.030
Precision	0.851 ± 0.038
Recall	0.841 ± 0.039
Accuracy	0.967 ± 0.007
F1 score	0.845 ± 0.026

Models trained over SELECTED FEATURES:

Selected Hyperparameters and best validation MCC:

Round	Kernel	C	Gamma	MCC
1	'rbf'	0.1	'scale'	0.802
2	'rbf'	10	'scale'	0.849
3	'rbf'	10	0.01	0.807
4	'rbf'	1	'scale'	0.807
5	'rbf'	1	'scale'	0.857

Performance Evaluation Metrics over Testing data:

Metrics	Value
MCC	0.801 ± 0.018
Precision	0.849 ± 0.034
Recall	0.796 ± 0.041
Accuracy	0.962 ± 0.003
F1 score	0.821 ± 0.016

6. Performance evaluation of the Von-Heijne and the SVM classifiers

Objective: Evaluate the performance in the classification with respect to the presence (1) or absence (0) of a signal peptide (SP) of the eukaryotic protein sequences in the benchmarking dataset for both the Von-Heijne and the SVM models.

6.1 : Von Heijne Classifier Performance

Workflow

Section	Title
a	Data Organization
b	Training: Building PSWM on Training data
c	Testing: Classification of Testing data
d	Performance Evaluation

Results

The threshold used for the classification was the average of the best threshold values obtained in the cross validation step for the von Heijne implementation. The description can be found in the dedicated folder: 04_vonHeijne.
The performance of the classifier built utilizing the whole training dataset is summarized in the following table:

Metric	Value
Accuracy	0.9312
Precision	0.6614
Recall	0.7580
F1-score	0.7064
MCC	0.6696
Threshold	8.8089

References

Owji, Hajar & Nezafat, Navid & Negahdaripour, Manica & HajiEbrahimi, Ali & Younes, Ghasemi. (2018). A Comprehensive Review of Signal Peptides: Structure, Roles, and Applications. European Journal of Cell Biology. 97. 10.1016/j.ejcb.2018.06.003.

Name		Name	Last commit message	Last commit date
Latest commit History 384 Commits
01_data_collection		01_data_collection
02_data_preparation		02_data_preparation
03_data_analysis		03_data_analysis
04_vonHeijne		04_vonHeijne
05_SVMs		05_SVMs
06_performance		06_performance
07_supplementary_materials		07_supplementary_materials
.gitignore		.gitignore
README.md		README.md
Report_Group_1.pdf		Report_Group_1.pdf
devdiary		devdiary
devdiary.md		devdiary.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prediction of Secretory Signal Peptide Presence in Eukaryotic Proteins

Laboratory of Bioinformatics 2 2025/2026 - Alma Mater Studiorum Università di Bologna

Abstract

1. Data Collection

Workflow

Results

2. Data Preparation

Workflow

Results

3. Data Analysis and Visualization

Workflow

Results

4. The vonHeijne method for SP detection

Workflow

Results

5. SVM classifier for SP detection

Workflow

Results

Models trained over ALL FEATURES:

Models trained over SELECTED FEATURES:

6. Performance evaluation of the Von-Heijne and the SVM classifiers

6.1 : Von Heijne Classifier Performance

Workflow

Results

References

About

Uh oh!

Releases 1

Packages

Languages

kianinsilico/SignalPeptidePrediction

Folders and files

Latest commit

History

Repository files navigation

Prediction of Secretory Signal Peptide Presence in Eukaryotic Proteins

Laboratory of Bioinformatics 2 2025/2026 - Alma Mater Studiorum Università di Bologna

Abstract

1. Data Collection

Workflow

Results

2. Data Preparation

Workflow

Results

3. Data Analysis and Visualization

Workflow

Results

4. The vonHeijne method for SP detection

Workflow

Results

5. SVM classifier for SP detection

Workflow

Results

Models trained over ALL FEATURES:

Models trained over SELECTED FEATURES:

6. Performance evaluation of the Von-Heijne and the SVM classifiers

6.1 : Von Heijne Classifier Performance

Workflow

Results

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages