ProtDomRetrieverSuite

ProtDomRetrieverSuite builds on ProtDomRetriever, adding a comprehensive graphical interface and extended functionality for protein domain analysis. It retains core features, such as retrieving domain information from InterPro, while introducing support for AlphaFold structure downloads and domain-specific PDB structure processing.

Created by Nicolas-Frédéric Lipp, PhD

ProtDomRetriever Illustration: A visual representation of ProtDomRetrieverSuite’s functionality, generated using AI tools. The exact AI prompt is available in assets/ai_prompt_example.txt.
GUI Screenshot: An example of the graphical user interface of ProtDomRetrieverSuite, illustrating the input configuration, optional steps, and progress tracking.

Keywords

Protein Domain Analysis • Bioinformatics Tools • InterPro • UniProt • AlphaFold • Structural Bioinformatics

Features

Core Features (from ProtDomRetriever)

Retrieve domain information for multiple UniProtKB accessions
Filter domains based on specified InterPro entries
Select longest domains when multiple entries overlap
Generate TSV output with domain ranges
Create FASTA files for the retrieved protein domains

New Features

Modern-like graphical user interface with dark mode
Real-time progress tracking and logging
AlphaFold structure download integration
PDB structure trimming based on domain ranges
Improved error handling and recovery
Multi-threaded processing for better performance

System Requirements

Hardware Requirements

Display resolution: Minimum 870x800 pixels
RAM: 4GB minimum (≥ 8GB recommended for large datasets)
Storage: Space requirements depend on dataset size and features used:
- Basic analysis: < 100MB
- With AlphaFold/PDB structures: ~300KB per structure
- With trimmed structures: additional ~300KB per structure

Software Requirements

Python 3.8 or newer
Internet connection for API access (InterPro, UniProt, AlphaFold)
Graphics system capable of supporting tkinter GUI

Operating Systems

macOS (Sequoia 15.0+ supported)
Linux
Windows

Quick Installation

To get started, make sure you have Python 3.8+ installed. Open a terminal/command prompt and install ProtDomRetrieverSuite using either method:

Option 1: Direct Installation from GitHub

pip install git+https://github.com/NicoFrL/protdomretrieversuite.git

Option 2: Local Installation

# Clone the repository
git clone https://github.com/NicoFrL/protdomretrieversuite.git

# Navigate to the cloned repository
cd protdomretrieversuite

# Install the package using pip
pip install .

For detailed installation instructions, including system-specific setup and troubleshooting, see INSTALL.md. For instance, on macOS Sequoia 15.0+, "Python[XXXXX:XXXXX] +[IMKInputSession subclass]: chose IMKInputSession_Legacy", this is a harmless message.

Configuration

The application automatically saves your last used configuration (input/output paths, selected options) to config.json and restores it on next launch for a smoother workflow. An example configuration file config.json.example.json is provided in the repository.

Configuration options:

input_file: Path to input file containing UniProtKB accessions
output_dir: Directory where analysis results will be saved
enable_fasta_retrieval: Download FASTA sequences with domain positions
enable_af_download: Download AlphaFold structures
enable_pdb_trimming: Enable domain-based PDB structure trimming
accept_custom_pdbs: Allow using custom PDB files
custom_pdb_strict: Strict validation for custom PDB files
pdb_source_dir: Directory containing custom PDB files
interpro_entries: InterPro entries for domain filtering (comma-separated)

Quick Start

Launch the application in your terminal/command prompt: protdomretrieversuite
Select example input from tests/seed_test/input_test1.txt
Select an output folder tests/seed_test/output/
Enter example entries from tests/seed_test/entries_test1.txt
Choose an output directory and Press "▶ Run Analysis"!

Usage

Preparing Input Data:

Create an input file: Prepare a .txt file containing a list of UniProtKB accessions in one column with no header. These accessions should correspond to the proteins you want to analyze. UniprotKB accessions (Swiss-Prot/TrEMBL) provide a universal protein numbering system to ensure accurate identification.
Select InterPro features: Decide which type of protein features you want to analyze either from the InterPro database or its consortium member databases.

Examples of Databases and Entries

Below are examples of protein classification databases and example entry formats ProtDomRetriever accepts. Use these as a reference when specifying InterPro entries to analyze:

Database (with Link)	Entry Format (Example)
InterPro	`IPR000001`
CATH-Gene3D	`G3DSA:1.10.10.10`
CDD	`cd00001`
HAMAP	`MF_00001`
PANTHER	`PTHR10000`
Pfam	`PF00001`
PIRSF	`PIRSF000005`
PRINTS	`PR00001`
PROSITE Patterns	`PS00001`
PROSITE Profiles	`PS01031`
SMART	`SM00002`
SFLD	`SFLDF00001`
SUPERFAMILY	`SSF100879`
NCBIfam	`NF000124`

For more information about Protein Classification (family, domain, sequence feature) and Protein Signatures (patterns, profiles, fingerprints, hidden Markov models (HMMs)), please visit EMBL-EBI tutorial.

Starting the Application

protdomretrieversuite

Using the Interface

Select an input file containing UniProtKB accessions (one per line)
```
# Example input file:
Q02201
P12345
A0AA96LI61
```
Choose output directory for results (wherever you want on your computer)
Enter InterPro entries for domain filtering (as indicated, one per line or separated by comma)
```
# Example InterPro entries:
IPR018159
SSF46966

# or
IPR018159, SSF46966
```
Select optional processing steps:
- FASTA sequence retrieval
  - Download one Fasta File with domain positions in the headers
- AlphaFold structure download
- PDB structure trimming

Output Files

The suite generates several output files, depending on the selected options:

File Name	Description
`domain_analysis.tsv`	Comprehensive domain information for all input proteins in a tab-separated file.
`domain_ranges.txt`	Text file listing the start and end ranges of the detected domains.
`domain_sequences.fasta`	Contains FASTA sequences of domains if the retrieval option is enabled.
`alphafold_structures/`	A directory storing AlphaFold-predicted structures downloaded during analysis.
`trimmed_structures/`	Stores PDB files trimmed to match the specific domain ranges.
`trimming_summary.json`	Trimming info, including timestamps, sources, number processed files and their paths.

Performance Notes

Multi-threaded processing for efficient API requests
Rate limiting implemented to respect API guidelines
Memory usage scales with input size
For large datasets (>1000 proteins), consider:
- Breaking input into smaller batches
- Ensuring stable internet connection
- Having sufficient disk space for structure files

Examples

Example datasets are provided in the tests directory:

Test Dataset 1 (input_test1.txt, entries_test1.txt)
Test Dataset 2 (input_test2.txt, entries_test2.txt)
Test Dataset 3 (input_test3.txt, entries_test3.txt)

Support

If you encounter any issues or have questions, you can:

Check the log files in your output directory for detailed debugging information.
Open an issue on the GitHub repository.
Contact the developer directly through GitHub.

Author

Nicolas-Frédéric Lipp, PhD
https://github.com/NicoFrL

License

This project is distributed under a Custom Academic and Non-Commercial License.
It is free to use for educational, research, and non-profit purposes.
For commercial use, please refer to the LICENSE file or contact the author for more information.

Acknowledgments

InterPro database and API
Includes rate limiting guidelines and API details on GitHub.
AlphaFold DB
Provides access to predicted protein structures with API documentation.
UniProt database
Comprehensive protein information with programmatic access details.

Development Notes

This project was developed with assistance from AI language models to enhance code structure, adhere to best practices, and improve documentation. The scientific approach and core algorithm were entirely designed and implemented by the author.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
assets		assets
protdomretrieversuite		protdomretrieversuite
tests		tests
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
config.json.example.json		config.json.example.json
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProtDomRetrieverSuite

Keywords

Table of Contents

Features

Core Features (from ProtDomRetriever)

New Features

System Requirements

Hardware Requirements

Software Requirements

Operating Systems

Quick Installation

Option 1: Direct Installation from GitHub

Option 2: Local Installation

Configuration

Quick Start

Usage

Preparing Input Data:

Examples of Databases and Entries

Starting the Application

Using the Interface

Output Files

Performance Notes

Examples

Support

Author

License

Acknowledgments

Development Notes

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

NicoFrL/ProtDomRetrieverSuite

Folders and files

Latest commit

History

Repository files navigation

ProtDomRetrieverSuite

Keywords

Table of Contents

Features

Core Features (from ProtDomRetriever)

New Features

System Requirements

Hardware Requirements

Software Requirements

Operating Systems

Quick Installation

Option 1: Direct Installation from GitHub

Option 2: Local Installation

Configuration

Quick Start

Usage

Preparing Input Data:

Examples of Databases and Entries

Starting the Application

Using the Interface

Output Files

Performance Notes

Examples

Support

Author

License

Acknowledgments

Development Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages