Protein Structure Prediction with CNNs

This notebook, CNN_Randy.ipynb, provides a comprehensive workflow for predicting protein secondary structure from amino acid sequences using Convolutional Neural Networks (CNNs).

Overview

Protein structure prediction is a key challenge in bioinformatics, as a protein's function is determined by its 3D structure. Experimental methods to determine structure are expensive and time-consuming; deep learning, particularly CNNs, offers a scalable computational alternative.

This notebook focuses on predicting two levels of secondary structure:

sst3: Three-state classification (helix, strand, coil)
sst8: Eight-state classification (more granular)

Workflow

The notebook follows these main steps:

1. Introduction & Setup

Explains the challenge and project goals.
Installs required Python packages, including PyTorch, pandas, scikit-learn, matplotlib, seaborn, etc.
Sets random seeds for reproducibility.

2. Data Loading & Analysis

Loads the protein sequence dataset (CSV format, e.g. from Google Drive).
Inspects the dataset and its columns (e.g., pdb_id, chain_code, seq, sst3, sst8, etc.).
Performs descriptive statistics and visualizations:
- Sequence length distribution
- Amino acid frequency distribution
- Secondary structure class imbalance for sst3 and sst8

3. Data Preprocessing

User-configurable preprocessing parameters:
- Cutoff length for filtering sequences
- Padding/cropping length
- Option to filter non-standard amino acids
- Embedding method for encoding protein sequences (e.g., one-hot, learnable, ProtBERT, TAPE)
Functions for filtering, cropping, padding, and encoding sequences and secondary structure labels.

4. Exploratory Data Analysis (EDA)

Analysis of class imbalances
Strategies for dealing with long sequences and class imbalance (e.g., dropping long sequences, padding/cropping, etc.)

Usage

To use this notebook:

Upload your protein sequence CSV to your storage (e.g., Google Drive).
Update the file path in the notebook to point to your data.
Run all cells sequentially to:
- Install dependencies
- Load and preprocess the data
- Visualize class distributions
- Prepare the data for CNN input

Requirements

Python 3.10+
PyTorch
Pandas
Matplotlib
Seaborn
scikit-learn

You can install dependencies via:

pip install torch torchvision torchaudio pandas matplotlib numpy scikit-learn seaborn

File Structure

Introduction & Setup: Project description and environment setup.
Data Loading & Analysis: Loads and analyzes the protein dataset.
Data Preprocessing: Filters, crops, and encodes sequences for CNN input.
EDA & Visualization: Shows class and amino acid distributions.

Notes

The notebook is designed for use in Google Colab but can be adapted to other environments.
The dataset path must be updated to reflect your local or cloud storage.
Preprocessing parameters are configurable for experimentation.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
CNN_Randy.ipynb		CNN_Randy.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Structure Prediction with CNNs

Overview

Workflow

1. Introduction & Setup

2. Data Loading & Analysis

3. Data Preprocessing

4. Exploratory Data Analysis (EDA)

Usage

Requirements

File Structure

Notes

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Protein Structure Prediction with CNNs

Overview

Workflow

1. Introduction & Setup

2. Data Loading & Analysis

3. Data Preprocessing

4. Exploratory Data Analysis (EDA)

Usage

Requirements

File Structure

Notes

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages