Skip to content

Complete collection of 6 bioinformatics projects — covering sequence alignment, multiple alignment, profile HMMs, phylogenetic trees, and virus genome classification using neural networks.

Notifications You must be signed in to change notification settings

zamirmehdi/Bioinformatics-Course

Repository files navigation

Bioinformatics Course - Complete Project Collection

A comprehensive collection of 6 bioinformatics projects covering fundamental algorithms, sequence analysis, phylogenetics, and machine learning applications in computational biology. Developed as part of the Bioinformatics course at Amirkabir University of Technology.

Python Jupyter Course University License


📋 Overview

This repository contains a complete journey through computational biology and bioinformatics, from basic biological concepts to advanced machine learning applications for viral genome classification. Each project builds upon previous knowledge, creating a comprehensive learning path in modern bioinformatics.

Course Information:

  • Institution: Amirkabir University of Technology (Tehran Polytechnic) - Spring 2022
  • Author: Amirmehdi Zarrinnezhad

What You'll Find:

  • 5 Theoretical & Programming Assignments + 1 Final Machine Learning Project
  • Complete implementations with detailed documentation
  • Step-by-step explanations of algorithms and methodologies
  • Real biological data analysis and interpretation
  • Production-ready code with comprehensive README files

🧬 Projects

Type: Theoretical Analysis
Topics: Stem cells, Gene expression, DNA function, Vestigial traits

A foundational exploration of molecular biology concepts essential for understanding bioinformatics algorithms.

Key Concepts:

  • Stem cell biology and differentiation
  • Gene diversity vs genetic variability
  • Central Dogma: DNA → RNA → Protein
  • Vestigial traits as evolutionary evidence
  • Genotype-phenotype relationships

Deliverables:

  • Comprehensive theoretical report
  • Critical analysis of biological systems
  • Evolutionary biology examples

Type: Programming + Theoretical
Topics: Semi-global alignment, Dynamic programming, Scoring matrices

Implementation of sequence alignment algorithms for comparing protein sequences.

Key Features:

  • ✅ Semi-global alignment algorithm (Needleman-Wunsch variant)
  • ✅ PAM250 substitution matrix
  • ✅ Multiple optimal alignment detection
  • ✅ Manual calculations (Needleman-Wunsch, Smith-Waterman)

Technologies: Python (pure implementation, no libraries)

Highlights:

# Finds ALL optimal alignments
aligned_seqs = semi_global_alignment(seq1, seq2, PAM250, gap_penalty=-9)

# Example output:
# Score: 20
# HEAGAWGHE-
# ---PAW-HEA

Type: Programming + Theoretical
Topics: Star alignment, Block-based refinement, FASTA, BLAST

Advanced MSA implementation with iterative improvement and database search analysis.

Key Features:

  • ✅ Star alignment algorithm (center-star heuristic)
  • ✅ Block-based iterative refinement
  • ✅ Sum-of-pairs scoring
  • ✅ FASTA vs BLAST comparison
  • ✅ Algorithm complexity analysis

Technologies: Python

Performance:

Initial Score: 51
Final Score:   60  (+17% improvement)
Method:        Iterative block realignment
Convergence:   Automatic detection

Theoretical Topics:

  • FASTA k-tuple matching
  • BLAST word search trees
  • Heuristic vs exhaustive search

Type: Programming + Theoretical
Topics: PSSM, Profile search, PSI-BLAST, HMMs

Profile-based sequence search with pseudocount smoothing and HMM analysis.

Key Features:

  • ✅ Profile construction from MSA
  • ✅ Position-Specific Scoring Matrix (PSSM)
  • ✅ Log-odds scoring with pseudocount
  • ✅ Subsequence search with gap insertion
  • ✅ PSI-BLAST mechanism analysis

Technologies: Python, math library

Algorithm:

# Build profile from MSA
profile = build_profile(msa, pseudocount=2)

# Search query sequence
best_match = search_with_profile(query, profile)
# Output: H-L-P (with optimal gap placement)

Theoretical Topics:

  • PSI-BLAST and profile drift
  • Forward algorithm (HMM)
  • Viterbi algorithm (optimal path)
  • Sequence logos

Type: Theoretical Analysis
Topics: UPGMA, Neighbor-Joining, Parsimony, Maximum Likelihood

Comprehensive exploration of phylogenetic tree construction methods.

Key Topics:

  • ✅ Distance-based methods (UPGMA, NJ)
  • ✅ Character-based methods (Parsimony)
  • ✅ Probabilistic methods (Maximum Likelihood)
  • ✅ Tree comparison and evaluation
  • ✅ Algorithm complexity analysis

Methods Compared:

Method Speed Accuracy Molecular Clock
UPGMA Fast Moderate Required
NJ Fast Good Not required
Parsimony Slow Good Not required
ML Very Slow Best Flexible

Analysis Includes:

  • Manual UPGMA tree construction
  • Neighbor-Joining with corrected distances
  • Parsimony scoring (Fitch's algorithm)
  • ML probability calculations
  • Exhaustive search vs branch-and-bound

Type: Machine Learning Project
Topics: K-mer features, Neural networks, Multi-class classification

State-of-the-art virus genome classification using deep learning.

Problem:

  • Classify DNA sequences into 6 virus classes
  • Variable-length sequences (hundreds to thousands of bp)
  • Small dataset (1,320 training samples)

Solution:

DNA Sequence → K-mer Extraction (k=2) → 16 Features → MLP → Class (1-6)

Model Architecture:

Input (16 features)
    ↓
Hidden Layer 1 (64 neurons, ReLU)
    ↓
Hidden Layer 2 (64 neurons, ReLU)
    ↓
Hidden Layer 3 (64 neurons, ReLU)
    ↓
Output (6 classes, Softmax)

Technologies: Python, scikit-learn, pandas, numpy

Performance:

  • 100% accuracy on development set (180 samples)
  • ~100% accuracy on test set (400 samples)
  • Top 10% performance in class
  • CPU training (<5 minutes)

Key Innovation:

  • Optimal k-mer size (k=2) through systematic experimentation
  • Length-normalized features for fair comparison
  • Balanced architecture preventing overfitting

🛠️ Technologies Used

Programming Languages

  • Python 3.x - Primary language for all implementations

Core Libraries

Data Processing:

  • pandas - Data manipulation and CSV handling
  • numpy - Numerical computations
  • itertools - Combinatorial operations (k-mer generation)

Machine Learning:

  • scikit-learn - MLP classifier, metrics, preprocessing
  • tensorflow/keras - Alternative deep learning experiments

Bioinformatics:

  • Custom implementations (no external bio libraries)
  • Pure Python algorithms for educational purposes

Visualization:

  • matplotlib - Learning curves and performance plots

Development Tools

  • Jupyter Notebook - Interactive development and documentation
  • Git - Version control
  • Quera - Submission and evaluation platform

📊 Learning Path

Project 1: Basic Biology
    ↓ (Understand biological foundations)
Project 2: Pairwise Alignment
    ↓ (Dynamic programming, scoring schemes)
Project 3: Multiple Alignment
    ↓ (Heuristic algorithms, database search)
Project 4: Profile & HMM
    ↓ (Position-specific scoring, probabilistic models)
Project 5: Phylogenetic Trees
    ↓ (Evolutionary relationships, tree construction)
Project 6: Machine Learning
    ↓ (Deep learning for genome classification)

Complete Bioinformatics Pipeline!

Skills Progression:

  1. Theoretical foundations → Biological understanding
  2. Algorithm implementation → Dynamic programming
  3. Heuristic methods → Speed vs accuracy trade-offs
  4. Probabilistic models → HMMs, likelihood
  5. Phylogenetic analysis → Evolutionary inference
  6. Machine learning → Modern AI applications

🚀 Getting Started

Prerequisites

Python 3.7 or higher
pip (Python package manager)

Installation

Clone the repository:

git clone https://github.com/zamirmehdi/Bioinformatics-Course.git
cd Bioinformatics-Course

Install dependencies:

pip install pandas numpy scikit-learn matplotlib jupyter

Running Projects

For Python scripts:

cd "2- Pairwise Sequence Alignment/src"
python semi_global_alignment.py < input.txt

For Jupyter notebooks:

cd "Virus Classification (Final Project)/src"
jupyter notebook BioInformatics_FinalProject.ipynb

For theoretical projects:

  • Navigate to project folder
  • Review README.md for detailed explanations
  • Check Report.pdf for solutions (Persian)

📁 Repository Structure

Bioinformatics-Course/
│
├── 1- Basic biology/
│   ├── Instruction.pdf
│   ├── Report.pdf
│   └── README.md
│
├── 2- Pairwise Sequence Alignment/
│   ├── docs/
│   │   ├── Programming Instruction.pdf
│   │   └── Theoretical/
│   │       ├── Instruction.pdf
│   │       └── Report.pdf
│   ├── src/
│   │   └── semi_global_alignment.py
│   └── README.md
│
├── 3- Multiple Sequence Alignment - DB Search/
│   ├── docs/
│   │   ├── Programming Instruction MSA.pdf
│   │   └── Theoretical/
│   │       ├── Instruction.pdf
│   │       ├── Report.pdf
│   │       └── cstar.pdf
│   ├── src/
│   │   └── main.py
│   └── README.md
│
├── 4- Profile - Hidden Markov model/
│   ├── docs/
│   │   ├── Programming Instruction - Profile.pdf
│   │   └── Theoretical/
│   │       ├── Instruction.pdf
│   │       └── Report.pdf
│   ├── src/
│   │   └── Profile.py
│   └── README.md
│
├── 5- Phylogenetic Trees/
│   ├── Instruction.pdf
│   ├── Report.pdf
│   └── README.md
│
├── Virus Classification (Final Project)/
│   ├── data/
│   │   ├── training_set.csv
│   │   ├── development_set.csv
│   │   └── test_set.csv
│   ├── docs/
│   │   ├── Instruction.pdf
│   │   └── Report.pdf
│   ├── src/
│   │   └── BioInformatics_FinalProject.ipynb
│   ├── README.md (Part 1)
│   └── README_PART2.md (Part 2)
│
└── README.md (This file)

📊 Quick Links

Project Type Status README
1. Basic Biology Theory ✅ Complete View
2. Sequence Alignment Code + Theory ✅ Complete View
3. MSA & DB Search Code + Theory ✅ Complete View
4. Profile & HMM Code + Theory ✅ Complete View
5. Phylogenetic Trees Theory ✅ Complete View
6. Virus Classification ML Project ✅ Complete View

🎓 Key Learning Outcomes

After completing these projects, you will be able to:

Algorithms & Data Structures

✅ Implement dynamic programming for sequence alignment
✅ Design heuristic algorithms for NP-hard problems
✅ Optimize time and space complexity
✅ Handle variable-length biological data

Bioinformatics

✅ Perform pairwise and multiple sequence alignment
✅ Search biological databases efficiently
✅ Build and use profiles for sequence search
✅ Construct phylogenetic trees
✅ Apply machine learning to genomic data

Machine Learning

✅ Extract features from biological sequences
✅ Design neural network architectures
✅ Prevent overfitting on small datasets
✅ Evaluate models with proper metrics
✅ Tune hyperparameters systematically

Software Engineering

✅ Write clean, documented, maintainable code
✅ Structure projects professionally
✅ Create comprehensive documentation
✅ Use version control (Git)
✅ Follow best practices


📈 Project Statistics

Metric Value
Total Projects 6 (5 assignments + 1 final)
Lines of Code ~2,000+
Programming Projects 4
Theoretical Projects 2
Algorithms Implemented 15+
Documentation Pages 100+ (combined READMEs)
Test Cases Passed 100%
Final Grade Excellent (Top 10%)

🏆 Highlights & Achievements

Technical Achievements

  • 100% test case success across all programming projects
  • Near-perfect ML model (100% dev, ~100% test accuracy)
  • Efficient implementations (CPU-only, fast execution)
  • Multiple optimal solutions (Project 2 - finds ALL alignments)
  • Iterative refinement (Project 3 - automatic improvement)

Documentation Quality

  • Comprehensive READMEs for every project
  • Step-by-step explanations with examples
  • Visual diagrams and algorithm illustrations
  • Code comments in English
  • Bilingual support (English docs, Persian reports)

Academic Impact

  • Top 10% performance in final project
  • Complete assignment portfolio (6/6 completed)
  • High-quality reports with detailed analysis
  • Reproducible results with clear instructions

🔬 Real-World Applications

Medical & Clinical

  • Disease diagnosis: Sequence-based pathogen identification
  • Personalized medicine: Genetic variant analysis
  • Drug discovery: Protein target identification
  • Epidemiology: Outbreak tracking and surveillance

Research & Academia

  • Evolutionary biology: Phylogenetic studies
  • Comparative genomics: Cross-species analysis
  • Protein function: Homology-based prediction
  • Gene discovery: Novel sequence identification

Biotechnology

  • Genetic engineering: CRISPR guide design
  • Synthetic biology: Sequence optimization
  • Bioinformatics tools: Algorithm development
  • Data analysis: High-throughput sequencing

📚 References & Resources

Key Papers

Online Resources

Textbooks

  • Biological Sequence Analysis - Durbin et al.
  • Introduction to Computational Molecular Biology - Setubal & Meidanis
  • Algorithms on Strings, Trees, and Sequences - Gusfield

🤝 Contributing

While this is a personal academic repository, contributions are welcome!

Ways to contribute:

  • 🐛 Report bugs or issues
  • 💡 Suggest improvements
  • 📖 Improve documentation
  • ✨ Add new features or optimizations

Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

ℹ️ Course Information

Author: Amirmehdi Zarrinnezhad
Course: Bioinformatics
University: Amirkabir University of Technology (Tehran Polytechnic) - Fall 2022
Language: English (README), Persian (Instruction and Report PDFs)
GitHub Link: Bioinformatics Course


📧 Contact

Questions or collaborations? Feel free to reach out!
📧 Email: amzarrinnezhad@gmail.com
💬 Open an Issue
🌐 GitHub: @zamirmehdi


(back to top)

If you found this project helpful, please consider giving it a star!

Amirmehdi Zarrinnezhad

About

Complete collection of 6 bioinformatics projects — covering sequence alignment, multiple alignment, profile HMMs, phylogenetic trees, and virus genome classification using neural networks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published