Scalable ortholog computation using Amazon's Elastic MapReduce (EMR) and the Reciprocal Smallest Distance (RSD) algorithm
This repository contains the implementation described in:
-
Wall, D.P., Kudtarkar, P., Fusaro, V., Pivovarov, R., Patil, P., & Tonellato, P. (2010). Cloud computing for comparative genomics. BMC Bioinformatics, 11(1), 259.
-
Kudtarkar, P., DeLuca, T.F., Fusaro, V.A., Tonellato, P.J., & Wall, D.P. (2010). Cost‐effective cloud computing: a case study using the comparative genomics tool Roundup. Evolutionary Bioinformatics, 6, 197–203.
This package enables distributed computation of orthologous genes across multiple genomes using Amazon's cloud infrastructure. By leveraging MapReduce, it significantly reduces the computational time required for large-scale comparative genomics analyses.
- Scalable: Process multiple genomes in parallel using cloud resources
- Cost-effective: Pay-as-you-go cloud computing model
- Distributed: Leverages Hadoop MapReduce for efficient computation
- Comprehensive: Includes BLAST pre-computation and RSD ortholog estimation
cloud-computing-genomics/
├── Scripts/
│ ├── blastmapper.py # Mapper for BLAST estimation
│ ├── rsdmapper.py # Mapper for ortholog estimation
│ ├── generate_blastrunner.py # Generate BLAST runner commands
│ └── generate_rsdrunner.py # Generate RSD runner commands
├── RSD_standalone/ # RSD algorithm implementation
├── executables.tar.gz # Required binaries (PAML, ClustalW)
├── blastout.sh # Setup BLAST output directories
├── example/ # Example genomes and runner files
├── log/ # EMR logs directory
├── blast_result/ # BLAST results placeholder
└── ortholog_result/ # Ortholog results placeholder
- Python 2.7+
- Amazon S3 Tools (s3cmd)
- Elastic MapReduce Ruby CLI
- Active AWS account with S3 and EMR access
Important: Washington University BLAST 2.0 requires a license agreement before downloading.
You'll need:
blastp(protein BLAST tool)xdgetxdformat- BLAST matrix folder
- ClustalW (included)
- PAML codeml (included)
Format your FASTA genomes with unique prefix identifiers:
# Each FASTA entry should have a unique prefix
# Strip problematic characters from name fields
# Format for blastp using xdformatSee example/genomes/ for properly formatted examples.
- Download BLAST binaries (after licensing)
- Set permissions:
chmod 777 blastp xdget xdformat
- Package executables:
tar czf executables.tar executables/
Create a genomeslist file containing genome names, then:
# Generate BLAST runner
python generate_blastrunner.py \
--source /path/to/genomeslist \
--destination /path/to/blastrunner
# Generate RSD runner
python generate_rsdrunner.py \
--source /path/to/genomeslist \
--destination /path/to/rsdrunnerUpdate S3 bucket references in:
RSD_standalone/Blast_compute.py(line 101)RSD_standalone/RSD.py(line 845)
Replace <s3bucketname> with your actual bucket name.
# Package genomes and RSD standalone
tar czf genomes.tar genomes/
tar czf RSD_standalone.tar RSD_standalone/
# Upload all required files
s3cmd put *.py *.tar.gz *.sh s3://your-bucket-name/./elastic-mapreduce --create --alive \
--name "ortholog-computation" \
--num-instances 4 \
--instance-type c1.xlarge \
--log-uri s3n://your-bucket/log# Get job flow ID
./elastic-mapreduce --list --active
# Create placeholders
./elastic-mapreduce --jobflow YOUR_JOB_ID \
--jar s3://elasticmapreduce/libs/script-runner/script-runner.jar \
--args s3://your-bucket/blastout.sh# Copy input files
./elastic-mapreduce --jobflow YOUR_JOB_ID \
--jar s3://elasticmapreduce/samples/distcp/distcp.jar \
--args s3://your-bucket/blastrunner,hdfs:///home/hadoop/blastrunner
# Run BLAST mapper
./elastic-mapreduce -j YOUR_JOB_ID --stream \
--input hdfs:///home/hadoop/blastrunner \
--mapper s3n://your-bucket/blastmapper.py \
--reducer NONE \
--cache-archive s3n://your-bucket/executables.tar.gz#executables \
--cache-archive s3n://your-bucket/genomes.tar.gz#genomes \
--jobconf mapred.map.tasks=10 \
--jobconf mapred.task.timeout=604800000# Copy RSD runner
./elastic-mapreduce --jobflow YOUR_JOB_ID \
--jar s3://elasticmapreduce/samples/distcp/distcp.jar \
--args s3://your-bucket/rsdrunner,hdfs:///home/hadoop/rsdrunner
# Run RSD mapper
./elastic-mapreduce -j YOUR_JOB_ID --stream \
--input hdfs:///home/hadoop/rsdrunner \
--mapper s3n://your-bucket/rsdmapper.py \
--reducer NONE \
--cache-archive s3n://your-bucket/executables.tar.gz#executables \
--cache-archive s3n://your-bucket/genomes.tar.gz#genomes \
--output hdfs:///home/hadoop/outputMonitor job progress using FoxyProxy with SSH tunnel:
- Establish SOCKS proxy on local machine
- Create SSH tunnel to master node
- Access Hadoop UI through browser
# Terminate cluster
./elastic-mapreduce --terminate -j YOUR_JOB_ID
# Download results
s3cmd get -r s3://your-bucket/ortholog_result/ ./results/| Parameter | Default | Description |
|---|---|---|
mapred.map.tasks |
10 | Number of map tasks |
mapred.task.timeout |
604800000 | Task timeout (ms) |
mapred.tasktracker.map.tasks.maximum |
7-8 | Max tasks per node |
mapred.map.tasks.speculative.execution |
false | Disable speculation |
- Ensure all FASTA headers are unique and properly formatted
- Monitor S3 costs as intermediate results are stored there
- Consider using spot instances for cost savings
- Results are stored in tab-delimited format
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or support, please open an issue on GitHub.
- Amazon Web Services for cloud infrastructure
- Washington University for BLAST tools
- Contributors to PAML and ClustalW
For detailed methodology and performance benchmarks, please refer to our publications.