CausalKnowledgeTrace: Interactive Literature-Based Causal Structure Mapping, Graph Generation, Visualization, and Refinement
CausalKnowledgeTrace (CKT) helps researchers build causal knowledge graphs from published biomedical literature. The system automatically extracts and organizes causal relationships between biological concepts (genes, proteins, diseases, drugs, etc.) to support hypothesis generation and study design in observational research.
Users specify an exposure and outcome of interest. They can constrain the search by publication year, causal predicate type, and minimum number of supporting articles per relationship. CKT constructs initial partially directed acyclic graphs (PDAGs) representing causal structures between biomedical concepts. Users then edit these graphs interactively to remove unnecessary nodes and edges. CKT can export graphs and evidence from the literature for downstream analysis. A user manual is available at this link.
CKT queries SemMedDB, a database containing subject-predicate-object triples (e.g., "Smoking CAUSES Lung Cancer") extracted from 37+ million PubMed titles and abstracts using the SemRep natural language processing system. Each relationship is linked to its supporting literature, allowing users to trace claims back to primary evidence.
CKT consists of two integrated components:
- Python engine: Implements graph construction, causal structure learning algorithms, and exports results for statistical analysis
- Shiny web application: Provides interactive visualization, parameter configuration, and iterative graph refinement
-
Query Configuration: Users specify an exposure and outcome of interest using UMLS (Unified Medical Language System) identifiers or free text search. Configurable parameters include publication year range, causal predicate types (CAUSES, INHIBITS, STIMULATES, PREVENTS, DISRUPTS), minimum article support thresholds, and degrees of separation (currently limited to 3 degrees between exposure and outcome).
-
Graph Construction: CKT builds initial partially directed acyclic graphs (PDAGs) representing potential causal pathways connecting the exposure to the outcome. Edge directions are inferred from temporal precedence, biological plausibility, and semantic predicate types extracted from the literature.
-
Interactive Refinement: Users iteratively remove spurious associations, biologically implausible relationships, or irrelevant variables through the web interface. This step incorporates domain expertise to improve graph quality.
-
Export and Documentation: Refined graphs and supporting evidence (PubMed IDs, semantic predicates, citation counts) are exported for downstream causal analysis and documentation.
In development
The furtherAnalysis module performs systematic causal variable classification to support rigorous epidemiological analysis. Tools in this module:
- Classify variables as confounders, mediators, or colliders relative to the exposure-outcome relationship
- Apply graph traversal algorithms to retain variables within the causal vicinity while removing extraneous nodes
- Compute minimal sufficient adjustment sets satisfying the back-door criterion for unbiased causal effect estimation
- Identify adjustment strategies that block confounding paths while avoiding collider bias, M-bias, and butterfly bias
- Providing suggestions, given user input of measured variables, of best-match, minimally sufficient adjustment sets that may include proxy confounders
The advanced analysis tools currently function on small example graphs but encounter computational challenges on literature-derived graphs due to:
-
Cyclic relationships: Extracted literature relationships may contain feedback loops that violate the acyclic assumption required for standard causal inference algorithms. Biological systems often exhibit genuine bidirectional causation (e.g., inflammation causes oxidative stress, which further exacerbates inflammation).
-
Markov equivalence classes: Many edge orientations in literature-derived graphs are ambiguous, resulting in equivalence classes of graphs that encode identical conditional independence relationships but different causal interpretations. The number of possible orientations grows exponentially (2^k for k ambiguous edges), making computation intractable for large graphs.
Planned solutions include:
- Cycle detection and resolution: Implementing algorithms to identify feedback loops and apply domain-guided strategies for cycle breaking or collapsing cyclic components into latent variables
- Constraint-based orientation: Using temporal information, intervention evidence, and biological knowledge to reduce the equivalence class search space
- Approximate inference methods: Developing heuristic algorithms that identify near-optimal adjustment sets without exhaustive enumeration of all possible graph orientations
- User-guided disambiguation: Enabling interactive edge orientation based on expert knowledge to progressively reduce uncertainty
This framework supports rigorous causal inference from observational biomedical data by enabling:
- Systematic exploration of alternative causal hypotheses represented in published literature
- Identification of potential confounders requiring measurement and adjustment in epidemiological studies
- Sensitivity analyses examining how conclusions change under different assumptions about causal directionality
- Hypothesis generation for experimental validation of putative causal relationships
- Literature-based justification for variable selection in statistical models
- π Interactive Visualization: Web-based DAG exploration with zoom, pan, and node interaction
- π Graphical Causal Modeling: Automated assembly of causal relationships from biomedical literature given Concept Unique Identifiers in the Unified Medical Language System, or UMLS, for the Exposure and Outcome of interest
- π Evidence Analysis: PMID-based evidence tracking and strength assessment
- β‘ Performance Optimized: Binary formats, caching, and vectorized operations for large graphs
- π― Configurable Analysis: Enter multiple CUIs for the exposure and/or outcome; Examine 1st, 2nd, or 3rd degree relationships
- π Multiple Formats: R DAG files, JSON assertions, optimized binary formats
- Interactive Network Visualization: Explore DAGs with zoom, pan, and node selection capabilities
- Dynamic Node Information: Click on nodes to see detailed information and evidence
- Physics Controls: Adjust network layout parameters in real-time
- Statistics Dashboard: View network statistics and node distributions
- Color-coded Categories: Three-category system (Exposure/Outcome/Other) with optimized performance
- Flexible Data Loading: Load DAG structures from generated files or upload custom R files
- Graph Configuration Interface: Configure parameters for knowledge graph generation
- Enhanced CUI Search: Searchable interface for medical concept selection with semantic type information
- Efficient Loading: Fast loading for large graphs
- Automated Knowledge Graph Generation: Create causal graphs from SemMedDB biomedical literature
- Multiple CUI Support: Handle multiple Concept Unique Identifiers for exposures and outcomes
- K-hop Analysis: Configurable relationship depth (1-3 hops) for comprehensive graph traversal
- Markov Blanket Analysis: Advanced causal inference with Markov blanket computation
- Blacklist Filtering: Filter out generic or unwanted concepts during graph creation
- Multiple Output Formats: Generate R DAG objects, JSON assertion files, and optimized binary formats
- Performance Monitoring: Detailed timing analysis and execution metrics
The project is organized into two main components with clear separation of concerns:
CausalKnowledgeTrace/
βββ README.md # This documentation file
βββ docker-compose.yaml # Docker Compose configuration
βββ run_app.R # Launch script for Shiny application
βββ user_input.yaml # Configuration file (generated by Shiny app)
βββ .env # Database credentials (create from doc/sample.env)
β
βββ docker/ # Docker configuration files
βββ doc/ # Installation guides and setup files
β βββ DOCKER_INSTALLATION.md # Docker installation guide
β βββ MANUAL_INSTALLATION.md # Manual installation guide
β βββ sample.env # Sample environment variables (copy to .env)
β βββ environment.yaml # Conda environment specification
β βββ requirements.txt # Python dependencies
β βββ packages.R # R package installation script
β
βββ shiny_app/ # Shiny Web Application Component
β βββ app.R # Main Shiny application
β βββ modules/ # Modular UI/server components
β βββ server/ # Server-side logic
β βββ ui/ # UI components
β βββ utils/ # Utility functions
β
βββ graph_creation/ # Graph Creation Engine Component
β βββ pushkin.py # Main entry point
β βββ cli_interface.py # Command line interface
β βββ analysis_core.py # Core analysis classes
β βββ database_operations.py # Database queries
β βββ graph_operations.py # Graph construction
β βββ example/ # Example scripts
β βββ result/ # Generated output files
β
βββ furtherAnalysis/ # Advanced causal analysis tools (in development)
CausalKnowledgeTrace uses SemMedDB, a database derived from the UMLS Metathesaurus. A free UMLS license is required before installation.
Why is this required? CausalKnowledgeTrace extracts causal relationships from SemMedDB, which is derived from the UMLS (Unified Medical Language System) Metathesaurus maintained by the National Library of Medicine. The NLM requires users to obtain a free license to access UMLS-derived resources.
How to obtain your license:
- Visit the UMLS Metathesaurus License Agreement
- Create an account or sign in with existing credentials
- Complete the license application (takes ~5 minutes)
- Wait for approval (typically 1-2 business days)
- You'll receive confirmation via email
Installation note: You can complete software installation steps while waiting for license approval. However, you'll need your approved license before downloading the database.
- Disk Space: At least 50GB free (for database and dependencies)
- RAM: 8GB minimum, 16GB recommended
- Operating System: Linux, macOS, or Windows
Before proceeding with either installation method, complete these common steps:
Option A: Clone with Git (Recommended)
Git allows you to easily pull future updates to the project.
# Install Git if needed
# Linux: sudo apt-get install git
# macOS: brew install git
# Windows: https://git-scm.com/download/win
# Verify Git installation
git --version
# Should display: git version 2.x.x or higher
# Clone the repository
git clone git@github.com:unmtransinfo/CausalKnowledgeTrace.git
cd CausalKnowledgeTrace
# To get future updates later:
# git pull origin mainOption B: Download as ZIP
If you don't want to install Git:
- Download: Download ZIP from GitHub
- Extract the ZIP file
- Open terminal/command prompt and navigate to the extracted directory
Download the SemMedDB database backup file from OneDrive (requires UMLS license):
Download Link: causalehr_backup.tar.gz from OneDrive
Note: The file is approximately 25GB. Download may take several minutes depending on your internet connection. The file will typically download to your Downloads folder.
Move the downloaded file to the project directory and extract it:
# Navigate to the project directory
cd CausalKnowledgeTrace
# Move the downloaded file from Downloads folder to current directory
# On Linux/macOS:
mv ~/Downloads/causalehr_backup.tar.gz .
# On Windows (in Git Bash or PowerShell):
# mv ~/Downloads/causalehr_backup.tar.gz .
# Or simply drag and drop the file from Downloads to the CausalKnowledgeTrace folder
# Extract the backup file
tar -xzf causalehr_backup.tar.gz
# Verify the backup directory exists
ls -la causalehr_backup/You should see multiple .dat.gz files and a toc.dat file in the causalehr_backup/ directory.
Both installation methods require setting up database credentials in a .env file. Here's a quick preview:
# Copy the sample environment file
cp doc/sample.env .env
# Edit with your credentials (detailed instructions in installation guides)
nano .env # or use your preferred editorNote: Detailed instructions for configuring the .env file are provided in each installation guide below. You can complete this step now or during the installation process.
Now that you have completed the common setup steps, choose your installation method:
Best for: Quick setup, testing, and most users Time: ~20 minutes (including database restoration) Prerequisites: Docker and Docker Compose only
Docker provides a containerized environment with all dependencies pre-configured. This is the fastest and easiest way to get started.
π Complete Docker Installation Guide β
Best for: Development, customization, and advanced users Time: ~45 minutes Prerequisites: PostgreSQL 16, Conda, Python 3.11, R 4.5.1
Manual installation gives you full control over the environment and is recommended for development and production deployments.
π Complete Manual Installation Guide β
For detailed usage instructions, see: CKT Usage Instructions
For troubleshooting help, please refer to the installation guide you used:
- Docker Installation: See Docker Troubleshooting
- Manual Installation: See Manual Troubleshooting
If you encounter issues not covered in the installation guides:
- Check the logs: The application outputs detailed error messages to the console
- GitHub Issues: Open an issue with:
- Your operating system and version
- Installation method (Docker or Manual)
- Error messages (copy the full text)
- Steps you've already tried
- Email support: Contact Scott Malec (SMalec@salud.unm.edu) or Rajesh Upadhayaya (RAJESHUPADHAYAYA@salud.unm.edu) to schedule a walk-through session