Architectural decisions and rationale. For introduction, see README. For API details, see SPEC.
- Interpretability: Model decisions trace back to functional groups and structural patterns
- Modularity: Mix and match GNN architectures, chemical rules, and subgraph patterns
- Extensibility: Add new components without modifying core code
- Accessibility: Configure experiments using chemistry terminology
Decision: Express all components (data, models, knowledge) in relational logic via PyNeuraLogic.
Rationale:
- Molecular graphs map naturally to relational logic (atoms as entities, bonds as relations)
- Functional groups are inherently relational (e.g., "carbonyl is a carbon double-bonded to oxygen")
- Enables automatic differentiation through logical rules
- Provides built-in visualization of learned weights
Trade-off: Requires Java runtime; steeper learning curve for users unfamiliar with logic programming.
Decision: Provide three architecture types (BARE, CCE, CCD) for knowledge base integration.
Rationale: Different research questions require different strategies:
- BARE: Establishes baselines; measures KB contribution independently
- CCE: Tests whether chemical priors improve learning
- CCD: Tests whether learned representations align with chemical concepts
Decision: Split knowledge base into two independent components.
Rationale: This separation enables:
- Using only chemical knowledge (interpretability studies)
- Using only structural patterns (architecture comparisons)
- Combining both (maximum expressiveness)
- Independent ablation studies
Decision: Option to constrain all KB weights to dimension 1.
Rationale: With scalar weights, each functional group's contribution becomes a single interpretable number (e.g., "nitro contributes +0.3 to toxicity, hydroxyl contributes -0.1").
Trade-off: Reduced model capacity; use for interpretation, not maximum performance.
Decision: Datasets define their own atom/bond vocabularies and mappings.
Rationale: Different sources use different conventions (TUD has explicit hydrogens, SMILES may have implicit hydrogens). Encapsulating these details keeps the system data-source agnostic.
┌─────────────────────────────────────────────────────────────────────┐
│ Pipeline │
│ Orchestrates training, evaluation, and inference │
└─────────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌─────────────────────────────┐
│ Dataset │ │ Model │ │ Knowledge Base │
│ │ │ │ │ │
│ - Load data │ │ - GNN layers │ │ ┌─────────┐ ┌────────────┐ │
│ - Define │ │ - Message │ │ │Chemical │ │ Subgraph │ │
│ vocabulary │ │ passing │ │ │ Rules │ │ Patterns │ │
│ - Create │ │ - Pooling │ │ └─────────┘ └────────────┘ │
│ template │ │ │ │ │
└──────────────┘ └──────────────┘ └─────────────────────────────┘
│ │ │
└────────────────────┴──────────────────────┘
│
▼
┌──────────────┐
│ ChemTemplate │
│ Base class │
│ for logical │
│ rule sets │
└──────────────┘
- Create a class extending
Model - Implement
build_layer()to define message passing - Register in
MODEL_REGISTRY(models/models.py)
- Create a class extending
KnowledgeBase - Define patterns in
create_template()using relational predicates - Wire into
get_chem_rules()(knowledge_base/chemrules.py)
- Create a class extending
KnowledgeBase - Define structural patterns in
create_template() - Wire into
get_subgraphs()(knowledge_base/subgraphs.py)
- For standard formats: extend
Dataset, implementload_data() - For SMILES data: use
SmilesDatasetdirectly - Register in
DATASET_CLASSESif reusable (datasets/datasets.py)
- Implicit hydrogens: Some functional group patterns assume explicit hydrogens
- Single-task only: No multi-task or transfer learning support
- Binary/regression only: No multi-class classification
- Generation tasks (molecular design)
- Multi-task learning
- Attention visualization over functional groups