This notebook, CNN_Randy.ipynb, provides a comprehensive workflow for predicting protein secondary structure from amino acid sequences using Convolutional Neural Networks (CNNs).
Protein structure prediction is a key challenge in bioinformatics, as a protein's function is determined by its 3D structure. Experimental methods to determine structure are expensive and time-consuming; deep learning, particularly CNNs, offers a scalable computational alternative.
This notebook focuses on predicting two levels of secondary structure:
- sst3: Three-state classification (helix, strand, coil)
- sst8: Eight-state classification (more granular)
The notebook follows these main steps:
- Explains the challenge and project goals.
- Installs required Python packages, including PyTorch, pandas, scikit-learn, matplotlib, seaborn, etc.
- Sets random seeds for reproducibility.
- Loads the protein sequence dataset (CSV format, e.g. from Google Drive).
- Inspects the dataset and its columns (e.g.,
pdb_id,chain_code,seq,sst3,sst8, etc.). - Performs descriptive statistics and visualizations:
- Sequence length distribution
- Amino acid frequency distribution
- Secondary structure class imbalance for sst3 and sst8
- User-configurable preprocessing parameters:
- Cutoff length for filtering sequences
- Padding/cropping length
- Option to filter non-standard amino acids
- Embedding method for encoding protein sequences (e.g., one-hot, learnable, ProtBERT, TAPE)
- Functions for filtering, cropping, padding, and encoding sequences and secondary structure labels.
- Analysis of class imbalances
- Strategies for dealing with long sequences and class imbalance (e.g., dropping long sequences, padding/cropping, etc.)
To use this notebook:
- Upload your protein sequence CSV to your storage (e.g., Google Drive).
- Update the file path in the notebook to point to your data.
- Run all cells sequentially to:
- Install dependencies
- Load and preprocess the data
- Visualize class distributions
- Prepare the data for CNN input
- Python 3.10+
- PyTorch
- Pandas
- Matplotlib
- Seaborn
- scikit-learn
You can install dependencies via:
pip install torch torchvision torchaudio pandas matplotlib numpy scikit-learn seaborn- Introduction & Setup: Project description and environment setup.
- Data Loading & Analysis: Loads and analyzes the protein dataset.
- Data Preprocessing: Filters, crops, and encodes sequences for CNN input.
- EDA & Visualization: Shows class and amino acid distributions.
- The notebook is designed for use in Google Colab but can be adapted to other environments.
- The dataset path must be updated to reflect your local or cloud storage.
- Preprocessing parameters are configurable for experimentation.