A comprehensive pipeline for compound data curation and target prediction analysis using PubChem API and SwissTargetPrediction.
This project provides a two-stage pipeline for compound analysis:
- Data Curation Stage: Extract CID and SMILES from compound names using PubChem API
- Target Prediction Stage: Predict molecular targets using SwissTargetPrediction web interface
- Automated PubChem Integration: Fetch compound IDs and SMILES strings from compound names
- Robust Name Handling: Supports Greek letters (Ξ±, Ξ², Ξ³) and name variations
- Quality Control: Validates and cleans data before analysis
- Comprehensive Logging: Detailed logs for troubleshooting and monitoring
- Error Handling: Retry mechanisms and fallback options
- Semi-Automated SwissTarget: Browser automation for target prediction
curated_compound_analysis/
βββ notebooks/
β βββ curated_compound_swiss_target.ipynb
βββ scripts/
β βββ swisstarget.py
βββ data/
β βββ compound_sample_data.csv
βββ requirements.txt
βββ README.md
βββ docs/
βββ installation.md
βββ usage.md
βββ troubleshooting.md
- Python 3.7 or higher
- Google Chrome browser (for SwissTarget automation)
- ChromeDriver (see ChromeDriver Setup)
git clone https://github.com/yourusername/curated_compound_analysis.git
cd curated_compound_analysispip install -r requirements.txtThe script will automatically download ChromeDriver using webdriver-manager.
- Check your Chrome version:
chrome://version/ - Download matching ChromeDriver from ChromeDriver Downloads
- Extract and place
chromedriver.exein the project folder
- Download ChromeDriver
- Add to system PATH
- Verify:
chromedriver --version
-
Prepare Input Data
- Create CSV file with compound names in 'Name' column
- Example:
compound_data.csv
-
Run Curation Notebook
jupyter notebook notebooks/curated_compound_swiss_target.ipynb
-
Execute All Cells
- Cell 1: Setup and configuration
- Cell 2: Input data validation
- Cell 3: PubChem API functions
- Cell 4: Data processing and CID/SMILES extraction
- Cell 5: Quality control and cleaning
- Cell 6: Download results (optional)
-
Output Files
compounds_with_cid_smiles.csv: All processing resultsdata_compounds_final_full.csv: Clean dataset for SwissTargetcuration_log.txt: Detailed processing log
-
Prepare Input
- Use the clean dataset from Stage 1
- Rename to
data_compounds_final_full.csv
-
Run SwissTarget Script
python scripts/swisstarget.py
-
Manual Interaction Required
- Script opens Chrome browser for each compound
- Download results manually (CSV/Excel/PDF)
- Take screenshots as needed
- Close browser to continue to next compound
-
Output Organization
- Results saved in timestamped folders
- Individual folders for each compound
- Processing logs included
Your input CSV should have this structure:
"Name","Formula","Annot. DeltaMass [ppm]","Calc. MW","RT [min]","Area (Max.)"
"Rhynchophylline","C22 H28 N2 O4","-3.08","384.20372","6.81","23288814985.171"
"Mitragynine","C23 H30 N2 O4","-3.35","398.21922","8.635","14364893452.408"
"deacetylvindoline","C23 H30 N2 O5","-2.8","414.21431","7.443","5066565186.2213"
"L-Ξ±-PALMITIN","C19 H38 O4","-3.15","330.27597","14.341","925221458.83439"
"1-Stearoylglycerol","C21 H42 O4","-3.24","358.30715","15.378","693539239.29014"
Additional columns are preserved in the output.
- Request delay: 0.5 seconds (respects rate limits)
- Timeout: 20 seconds per request
- Retry mechanism: Name variations for better success rates
- Browser timeout: 60 seconds for page load
- Results timeout: 240 seconds for predictions
- Retry attempts: 3 per compound
- Target organism: Homo sapiens (default)
Typical success rates based on compound types:
- Well-known compounds: 85-95%
- Natural products: 70-85%
- Synthetic compounds: 60-80%
- Rare/proprietary compounds: 40-60%
-
ChromeDriver Version Mismatch
Solution: Update ChromeDriver to match your Chrome version -
PubChem API Timeout
Solution: Check internet connection, compound names for typos -
SwissTarget Page Load Issues
Solution: Retry with stable internet, check if website is accessible -
Missing Dependencies
pip install --upgrade -r requirements.txt
Enable detailed logging by checking the generated log files:
curation_log.txt: PubChem processing logsprocess_log.txt: SwissTarget automation logs
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- PubChem for compound data API
- SwissTargetPrediction for target prediction service
- Selenium for web automation
- ChromeDriver for browser automation
If you encounter issues or have questions:
- Check the troubleshooting guide
- Search existing issues
- Create a new issue with detailed information
If you use this pipeline in your research, please cite:
@software{curated_compound_analysis,
title={Compound Analysis Pipeline: Automated PubChem and SwissTarget Integration},
author={Nandatama, Engki},
year={2024},
url={https://github.com/yourusername/curated_compound_analysis}
}β Star this repository if you find it helpful! β