-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Feature Description
Unstructured text processor – a module that processes unstructured textual documents and maps them into the existing Semantic Layer within intugle/data-tools.
This involves two key steps:
- RDF Generation – Convert unstructured or semi-structured textual content (already parsed through OCR or other text extraction models) into an RDF or RDF★ (RDF-star) graph representation capturing entities, attributes, and relationships.
- Semantic Mapping – Align or overlay the generated RDF graph with the existing Semantic Model in
intugle, allowing automatic linkage of extracted entities, relationships, and concepts to known semantic nodes, business terms, or ontologies.
The goal is to extend intugle’s ability to not just process structured data sources but also unstructured textual sources for semantic enrichment, discovery, and integration.
Problem Statement
Organizations often have vast amounts of textual information — contracts, reports, forms, invoices, research notes — that contain valuable business context. While OCR and text extraction tools can provide raw text, they do not capture meaning, relationships, or contextual structure.
This feature will enable users to:
- Convert extracted text into structured semantic triples (RDF/RDF★).
- Integrate textual insights directly into the Semantic Model.
- Enable unified querying and reasoning across structured and unstructured sources.
Without this capability, users rely on ad-hoc NLP scripts or manual tagging processes that don’t integrate with the semantic layer or metadata framework of intugle/data-tools.
Proposed Solution
Introduce a two-stage pipeline inside intugle/data-tools:
Step 1 – RDF/RDF★ Extraction
-
Input: Text blocks or page-wise extracted text from OCR or NLP pipelines.
-
Processing:
- Entity recognition and classification (NER).
- Relationship extraction and contextual linking.
- Optional co-reference resolution to unify mentions.
-
Output: RDF or RDF★ triples representing subjects, predicates, and objects.
- Example RDF triple:
(Invoice_123, hasAmount, 5400)
(Invoice_123, issuedBy, Vendor_A) - RDF★ allows annotation of triples with metadata (e.g., provenance, confidence).
- Example RDF triple:
-
Configuration Options:
- Support for multiple text parsers or NLP backends (spaCy, Hugging Face, etc.).
- Pluggable ontology templates for domain-specific schemas.
- Control over granularity (sentence-level vs. document-level triples).
Step 2 – Semantic Mapping
- Input: RDF or RDF★ graph generated from Step 1.
- Processing:
- Match entities and relationships to existing semantic nodes in the model.
- Auto-create or suggest new nodes when unmapped entities are found.
- Allow configurable thresholds for matching (string similarity, embeddings).
- Output:
- Enhanced Semantic Model with linked or new concepts integrated.
- Visualization of new connections and confidence scores.
Integration Points
- Can be exposed as:
- A Python API (
intugle.semantic.unstructuredor similar). - A CLI command (
intugle text-to-semantic ...).
- A Python API (
- Output compatible with existing semantic features (
SemanticModel,DataProduct, etc.). - Optional storage/export of RDF graphs as Turtle, JSON-LD, or Parquet for downstream analytics.
Use Case
Domain: Cross-domain (Finance,Retail, Healthcare, Research)
Workflow:
- OCR pipeline extracts text from scanned reports or documents. (Assume existing. DON'T Build)
- Text blocks are passed to the new
text_to_semantic()function. - RDF triples are generated from named entities and relationships in text.
- RDF is then mapped into the active Semantic Model:
- Linking new entities (e.g., organizations, people, identifiers) to known nodes.
- Enriching the model with contextual knowledge derived from the document.
- Analysts can now run semantic queries or discover insights across both tabular and text-based datasets.
Alternative Solutions
- Standalone NLP pipelines (spaCy, CoreNLP, LlamaIndex) – useful for extraction, but lack semantic integration or model awareness.
- Knowledge graph builders (Neo4j, GraphDB) – handle RDF well but not integrated with
intugle’s semantic and data product ecosystem. - Manual ontology tagging – time-consuming and error-prone.
The proposed feature unifies text understanding and semantic modeling within a single consistent framework.
Examples
from intugle import TextToSemanticProcessor, SemanticModel
# Step 1: Convert unstructured text to RDF triples
text_input = """
Invoice 123 was issued by Vendor A on March 4, 2024 for an amount of $5,400.
"""
processor = TextToSemanticProcessor(model="en_core_web_lg", output_format="rdf_star")
rdf_graph = processor.parse(text_input)
# Step 2: Map RDF to the existing semantic model
semantic_model = SemanticModel.load("finance_semantic_model.json")
semantic_model.overlay(rdf_graph, match_threshold=0.85)
semantic_model.save("finance_semantic_model_enriched.json")