INDRA BERT: Statement Extraction using Fine-tuned BERT Models

A comprehensive system for extracting structured biological statements from text using fine-tuned BERT models, designed to integrate with the INDRA knowledge assembly platform.

Overview

INDRA BERT is a three-stage pipeline that extracts structured biological statements from scientific text:

Named Entity Recognition (NER): Identifies biological entities (proteins, genes, etc.)
Statement Classification: Determines the type of relationship between entity pairs
Role Assignment: Assigns specific roles (subject, object, enzyme, etc.) to entities within statements

The system outputs structured statements in INDRA-compatible format, enabling integration with the broader INDRA ecosystem for knowledge assembly and reasoning.

Features

Multi-stage Pipeline: Combines NER, classification, and role assignment for comprehensive statement extraction
Batch Processing: Efficient processing of multiple texts with optimized batching
INDRA Integration: Native compatibility with INDRA statement format and processing pipeline
Pre-trained Models: Ready-to-use models available on Hugging Face Hub
Configurable Confidence: Adjustable confidence thresholds for quality control
Multiple Tokenization: Support for NLTK and spaCy sentence tokenization

Architecture

Core Components

IndraStructuredExtractor: Main pipeline class that orchestrates the three-stage extraction process
AgentNERModel: Named entity recognition for biological entities
IndraStmtClassifier: Statement type classification using custom BERT head
IndraAgentsTagger: Role assignment for entities within statements

Model Pipeline

Text Input → Sentence Tokenization → NER → Entity Pairing → Statement Classification → Role Assignment → Structured Output

Installation

pip install indra-bert

Dependencies

Python >= 3.9
PyTorch
Transformers
NLTK
NumPy
scikit-learn
tqdm

Quick Start

Basic Usage

from indra_bert import IndraStructuredExtractor

# Initialize the extractor with pre-trained models
extractor = IndraStructuredExtractor(
    ner_model_path="thomaslim6793/indra_bert_ner_agent_detection",
    stmt_model_path="thomaslim6793/indra_bert_indra_stmt_classifier",
    role_model_path="thomaslim6793/indra_bert_indra_stmt_agents_role_assigner",
    stmt_conf_threshold=0.95
)

# Extract statements from text
text = "AKT phosphorylates BAD at Ser136."
statements = extractor.extract_structured_statements(text)

# Get INDRA-compatible JSON format
indra_statements = extractor.get_json_indra_stmts(text)

Batch Processing

# Process multiple texts efficiently
texts = [
    "AKT phosphorylates BAD at Ser136.",
    "EGFR activates MAPK signaling pathway.",
    "p53 inhibits cell proliferation."
]

# Batch extraction
batch_statements = extractor.extract_structured_statements_batch(texts)
indra_batch = extractor.get_json_indra_stmts_batch(texts)

Model Details

Pre-trained Models

The system uses three fine-tuned BERT models:

NER Model (thomaslim6793/indra_bert_ner_agent_detection)
- Token classification for biological entity detection
- BIO tagging scheme for entity spans
Statement Classifier (thomaslim6793/indra_bert_indra_stmt_classifier)
- Custom classification head for statement types
- Supports various INDRA statement types (Phosphorylation, Activation, etc.)
Role Assigner (thomaslim6793/indra_bert_indra_stmt_agents_role_assigner)
- Token classification for role assignment
- Assigns specific roles to entities within statements

Statement Types

The system recognizes various biological statement types including:

Phosphorylation
Dephosphorylation
Activation
Inhibition
Binding
And more...

Output Format

Structured Statements

{
    'original_text': 'AKT phosphorylates BAD at Ser136.',
    'entity_pair': [
        {'text': 'AKT', 'start': 0, 'end': 3, 'label': 'PROTEIN'},
        {'text': 'BAD', 'start': 16, 'end': 19, 'label': 'PROTEIN'}
    ],
    'stmt_pred': {
        'label': 'Phosphorylation',
        'confidence': 0.98,
        'raw_output': {...}
    },
    'role_pred': {
        'roles': [
            {'role': 'enz', 'text': 'AKT', 'start': 0, 'end': 3},
            {'role': 'sub', 'text': 'BAD', 'start': 16, 'end': 19}
        ],
        'raw_output': {...}
    }
}

INDRA JSON Format

{
    "type": "Phosphorylation",
    "enz": {
        "name": "AKT",
        "db_refs": {"TEXT": "AKT"}
    },
    "sub": {
        "name": "BAD", 
        "db_refs": {"TEXT": "BAD"}
    },
    "evidence": [{
        "source_api": "indra_bert",
        "text": "AKT phosphorylates BAD at Ser136.",
        "annotations": {
            "agents": {
                "raw_text": ["AKT", "BAD"],
                "coords": [[0, 3], [16, 19]]
            }
        }
    }]
}

Training

Training Individual Models

Each component can be trained independently:

# Train NER model
cd indra_bert/ner_agent_detector
python train.py --train_data path/to/train.json --val_data path/to/val.json

# Train statement classifier
cd indra_bert/indra_stmt_classifier
python train.py --train_data path/to/train.json --val_data path/to/val.json

# Train role assigner
cd indra_bert/indra_agent_role_assigner
python train.py --train_data path/to/train.json --val_data path/to/val.json

Data Format

Training data should be in JSON format with appropriate annotations for each model type. See the individual training scripts for detailed format specifications.

Evaluation

The project includes benchmark notebooks and evaluation scripts:

benchmark.ipynb: Performance evaluation on test datasets
experiment.ipynb: Experimental analysis and comparison
stmt_classifier_basic_benchmark.py: Basic benchmarking script

Integration with INDRA

INDRA BERT is designed to integrate seamlessly with the INDRA platform:

from indra.sources.indra_bert.api import process_texts

# Process texts and get INDRA statements
texts = ["AKT phosphorylates BAD at Ser136."]
results = process_texts(texts, stmt_conf_threshold=0.95)

# Access INDRA statement objects
for result in results:
    for stmt in result.statements:
        print(f"Statement type: {stmt.__class__.__name__}")
        print(f"Agents: {stmt.agent_list()}")

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

License

This project is licensed under the BSD-2 License - see the LICENSE file for details.

Citation

If you use INDRA BERT in your research, please cite:

@software{indra_bert,
  title={INDRA BERT: Statement Extraction using Fine-tuned BERT Models},
  author={Lim, Thomas and Gyori, Benjamin M.},
  year={2025},
  url={https://github.com/gyorilab/indra_bert}
}

Acknowledgments

This work builds upon the INDRA platform and leverages state-of-the-art transformer models for biological text mining.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
indra_bert		indra_bert
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INDRA BERT: Statement Extraction using Fine-tuned BERT Models

Overview

Features

Architecture

Core Components

Model Pipeline

Installation

Dependencies

Quick Start

Basic Usage

Batch Processing

Model Details

Pre-trained Models

Statement Types

Output Format

Structured Statements

INDRA JSON Format

Training

Training Individual Models

Data Format

Evaluation

Integration with INDRA

Contributing

License

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

gyorilab/indra_bert

Folders and files

Latest commit

History

Repository files navigation

INDRA BERT: Statement Extraction using Fine-tuned BERT Models

Overview

Features

Architecture

Core Components

Model Pipeline

Installation

Dependencies

Quick Start

Basic Usage

Batch Processing

Model Details

Pre-trained Models

Statement Types

Output Format

Structured Statements

INDRA JSON Format

Training

Training Individual Models

Data Format

Evaluation

Integration with INDRA

Contributing

License

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages