Skip to content

RandyHaddad/Protein-Structure-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Protein Structure Prediction with CNNs

This notebook, CNN_Randy.ipynb, provides a comprehensive workflow for predicting protein secondary structure from amino acid sequences using Convolutional Neural Networks (CNNs).

Overview

Protein structure prediction is a key challenge in bioinformatics, as a protein's function is determined by its 3D structure. Experimental methods to determine structure are expensive and time-consuming; deep learning, particularly CNNs, offers a scalable computational alternative.

This notebook focuses on predicting two levels of secondary structure:

  • sst3: Three-state classification (helix, strand, coil)
  • sst8: Eight-state classification (more granular)

Workflow

The notebook follows these main steps:

1. Introduction & Setup

  • Explains the challenge and project goals.
  • Installs required Python packages, including PyTorch, pandas, scikit-learn, matplotlib, seaborn, etc.
  • Sets random seeds for reproducibility.

2. Data Loading & Analysis

  • Loads the protein sequence dataset (CSV format, e.g. from Google Drive).
  • Inspects the dataset and its columns (e.g., pdb_id, chain_code, seq, sst3, sst8, etc.).
  • Performs descriptive statistics and visualizations:
    • Sequence length distribution
    • Amino acid frequency distribution
    • Secondary structure class imbalance for sst3 and sst8

3. Data Preprocessing

  • User-configurable preprocessing parameters:
    • Cutoff length for filtering sequences
    • Padding/cropping length
    • Option to filter non-standard amino acids
    • Embedding method for encoding protein sequences (e.g., one-hot, learnable, ProtBERT, TAPE)
  • Functions for filtering, cropping, padding, and encoding sequences and secondary structure labels.

4. Exploratory Data Analysis (EDA)

  • Analysis of class imbalances
  • Strategies for dealing with long sequences and class imbalance (e.g., dropping long sequences, padding/cropping, etc.)

Usage

To use this notebook:

  1. Upload your protein sequence CSV to your storage (e.g., Google Drive).
  2. Update the file path in the notebook to point to your data.
  3. Run all cells sequentially to:
    • Install dependencies
    • Load and preprocess the data
    • Visualize class distributions
    • Prepare the data for CNN input

Requirements

  • Python 3.10+
  • PyTorch
  • Pandas
  • Matplotlib
  • Seaborn
  • scikit-learn

You can install dependencies via:

pip install torch torchvision torchaudio pandas matplotlib numpy scikit-learn seaborn

File Structure

  • Introduction & Setup: Project description and environment setup.
  • Data Loading & Analysis: Loads and analyzes the protein dataset.
  • Data Preprocessing: Filters, crops, and encodes sequences for CNN input.
  • EDA & Visualization: Shows class and amino acid distributions.

Notes

  • The notebook is designed for use in Google Colab but can be adapted to other environments.
  • The dataset path must be updated to reflect your local or cloud storage.
  • Preprocessing parameters are configurable for experimentation.

References

About

Predicting protein secondary structures (sst3 & sst8) from amino acid sequences using CNNs. Faster, computational approach to support bioinformatics.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors