Skip to content

Comparative genomics algorithm, the reciprocal smallest distance algorithm (RSD), to run within Amazon's Elastic Computing Cloud (EC2) - the project was developed at Wall Lab and Laboratory for Personalized Medicine (Tonellato Lab) at Center for Biomedical Informatics (CBMI) Harvard Medical School

Notifications You must be signed in to change notification settings

Parul-Kudtarkar/cloud-computing-genomics

Repository files navigation

Cloud Computing for Comparative Genomics

Scalable ortholog computation using Amazon's Elastic MapReduce (EMR) and the Reciprocal Smallest Distance (RSD) algorithm

License Cloud

Publications

This repository contains the implementation described in:

Overview

This package enables distributed computation of orthologous genes across multiple genomes using Amazon's cloud infrastructure. By leveraging MapReduce, it significantly reduces the computational time required for large-scale comparative genomics analyses.

Key Features

  • Scalable: Process multiple genomes in parallel using cloud resources
  • Cost-effective: Pay-as-you-go cloud computing model
  • Distributed: Leverages Hadoop MapReduce for efficient computation
  • Comprehensive: Includes BLAST pre-computation and RSD ortholog estimation

Repository Structure

cloud-computing-genomics/
├── Scripts/
│   ├── blastmapper.py         # Mapper for BLAST estimation
│   ├── rsdmapper.py           # Mapper for ortholog estimation
│   ├── generate_blastrunner.py # Generate BLAST runner commands
│   └── generate_rsdrunner.py   # Generate RSD runner commands
├── RSD_standalone/            # RSD algorithm implementation
├── executables.tar.gz        # Required binaries (PAML, ClustalW)
├── blastout.sh               # Setup BLAST output directories
├── example/                  # Example genomes and runner files
├── log/                      # EMR logs directory
├── blast_result/             # BLAST results placeholder
└── ortholog_result/          # Ortholog results placeholder

Prerequisites

Software Requirements

Required Binaries

Important: Washington University BLAST 2.0 requires a license agreement before downloading.

You'll need:

  • blastp (protein BLAST tool)
  • xdget
  • xdformat
  • BLAST matrix folder
  • ClustalW (included)
  • PAML codeml (included)

Installation & Setup

Step 1: Prepare Genomes

Format your FASTA genomes with unique prefix identifiers:

# Each FASTA entry should have a unique prefix
# Strip problematic characters from name fields
# Format for blastp using xdformat

See example/genomes/ for properly formatted examples.

Step 2: Configure Executables

  1. Download BLAST binaries (after licensing)
  2. Set permissions:
    chmod 777 blastp xdget xdformat
  3. Package executables:
    tar czf executables.tar executables/

Step 3: Generate Runner Files

Create a genomeslist file containing genome names, then:

# Generate BLAST runner
python generate_blastrunner.py \
    --source /path/to/genomeslist \
    --destination /path/to/blastrunner

# Generate RSD runner
python generate_rsdrunner.py \
    --source /path/to/genomeslist \
    --destination /path/to/rsdrunner

Step 4: Configure S3 Bucket

Update S3 bucket references in:

  • RSD_standalone/Blast_compute.py (line 101)
  • RSD_standalone/RSD.py (line 845)

Replace <s3bucketname> with your actual bucket name.

Step 5: Upload to S3

# Package genomes and RSD standalone
tar czf genomes.tar genomes/
tar czf RSD_standalone.tar RSD_standalone/

# Upload all required files
s3cmd put *.py *.tar.gz *.sh s3://your-bucket-name/

Running on AWS EMR

1. Create EMR Cluster

./elastic-mapreduce --create --alive \
    --name "ortholog-computation" \
    --num-instances 4 \
    --instance-type c1.xlarge \
    --log-uri s3n://your-bucket/log

2. Setup HDFS Directories

# Get job flow ID
./elastic-mapreduce --list --active

# Create placeholders
./elastic-mapreduce --jobflow YOUR_JOB_ID \
    --jar s3://elasticmapreduce/libs/script-runner/script-runner.jar \
    --args s3://your-bucket/blastout.sh

3. Run BLAST Pre-computation

# Copy input files
./elastic-mapreduce --jobflow YOUR_JOB_ID \
    --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
    --args s3://your-bucket/blastrunner,hdfs:///home/hadoop/blastrunner

# Run BLAST mapper
./elastic-mapreduce -j YOUR_JOB_ID --stream \
    --input hdfs:///home/hadoop/blastrunner \
    --mapper s3n://your-bucket/blastmapper.py \
    --reducer NONE \
    --cache-archive s3n://your-bucket/executables.tar.gz#executables \
    --cache-archive s3n://your-bucket/genomes.tar.gz#genomes \
    --jobconf mapred.map.tasks=10 \
    --jobconf mapred.task.timeout=604800000

4. Run Ortholog Estimation

# Copy RSD runner
./elastic-mapreduce --jobflow YOUR_JOB_ID \
    --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
    --args s3://your-bucket/rsdrunner,hdfs:///home/hadoop/rsdrunner

# Run RSD mapper
./elastic-mapreduce -j YOUR_JOB_ID --stream \
    --input hdfs:///home/hadoop/rsdrunner \
    --mapper s3n://your-bucket/rsdmapper.py \
    --reducer NONE \
    --cache-archive s3n://your-bucket/executables.tar.gz#executables \
    --cache-archive s3n://your-bucket/genomes.tar.gz#genomes \
    --output hdfs:///home/hadoop/output

Monitoring

Monitor job progress using FoxyProxy with SSH tunnel:

  1. Establish SOCKS proxy on local machine
  2. Create SSH tunnel to master node
  3. Access Hadoop UI through browser

Retrieving Results

# Terminate cluster
./elastic-mapreduce --terminate -j YOUR_JOB_ID

# Download results
s3cmd get -r s3://your-bucket/ortholog_result/ ./results/

Configuration Options

MapReduce Parameters

Parameter Default Description
mapred.map.tasks 10 Number of map tasks
mapred.task.timeout 604800000 Task timeout (ms)
mapred.tasktracker.map.tasks.maximum 7-8 Max tasks per node
mapred.map.tasks.speculative.execution false Disable speculation

Notes

  • Ensure all FASTA headers are unique and properly formatted
  • Monitor S3 costs as intermediate results are stored there
  • Consider using spot instances for cost savings
  • Results are stored in tab-delimited format

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions or support, please open an issue on GitHub.

Acknowledgments

  • Amazon Web Services for cloud infrastructure
  • Washington University for BLAST tools
  • Contributors to PAML and ClustalW

For detailed methodology and performance benchmarks, please refer to our publications.

About

Comparative genomics algorithm, the reciprocal smallest distance algorithm (RSD), to run within Amazon's Elastic Computing Cloud (EC2) - the project was developed at Wall Lab and Laboratory for Personalized Medicine (Tonellato Lab) at Center for Biomedical Informatics (CBMI) Harvard Medical School

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published