[FEATURE] Unstructured text processor

## Feature Description
Unstructured text processor – a module that processes unstructured textual documents and maps them into the existing **Semantic Layer** within `intugle/data-tools`.  
This involves two key steps:

1. **RDF Generation** – Convert unstructured or semi-structured textual content (already parsed through OCR or other text extraction models) into an RDF or RDF★ (RDF-star) graph representation capturing entities, attributes, and relationships.
2. **Semantic Mapping** – Align or overlay the generated RDF graph with the existing Semantic Model in `intugle`, allowing automatic linkage of extracted entities, relationships, and concepts to known semantic nodes, business terms, or ontologies.

The goal is to extend `intugle`’s ability to not just process structured data sources but also **unstructured textual sources** for semantic enrichment, discovery, and integration.

## Problem Statement
Organizations often have vast amounts of textual information — contracts, reports, forms, invoices, research notes — that contain valuable business context. While OCR and text extraction tools can provide raw text, they do not capture **meaning**, **relationships**, or **contextual structure**.

This feature will enable users to:
- Convert extracted text into structured semantic triples (RDF/RDF★).
- Integrate textual insights directly into the **Semantic Model**.
- Enable unified querying and reasoning across structured and unstructured sources.

Without this capability, users rely on ad-hoc NLP scripts or manual tagging processes that don’t integrate with the semantic layer or metadata framework of `intugle/data-tools`.

## Proposed Solution
Introduce a two-stage pipeline inside `intugle/data-tools`:

### Step 1 – RDF/RDF★ Extraction
- **Input**: Text blocks or page-wise extracted text from OCR or NLP pipelines.
- **Processing**:
  - Entity recognition and classification (NER).
  - Relationship extraction and contextual linking.
  - Optional co-reference resolution to unify mentions.
- **Output**: RDF or RDF★ triples representing subjects, predicates, and objects.
  - Example RDF triple:  
    `(Invoice_123, hasAmount, 5400)`  
    `(Invoice_123, issuedBy, Vendor_A)`  
  - RDF★ allows annotation of triples with metadata (e.g., provenance, confidence).

- **Configuration Options**:
  - Support for multiple text parsers or NLP backends (spaCy, Hugging Face, etc.).
  - Pluggable ontology templates for domain-specific schemas.
  - Control over granularity (sentence-level vs. document-level triples).

### Step 2 – Semantic Mapping
- **Input**: RDF or RDF★ graph generated from Step 1.
- **Processing**:
  - Match entities and relationships to existing semantic nodes in the model.
  - Auto-create or suggest new nodes when unmapped entities are found.
  - Allow configurable thresholds for matching (string similarity, embeddings).
- **Output**:
  - Enhanced Semantic Model with linked or new concepts integrated.
  - Visualization of new connections and confidence scores.

### Integration Points
- Can be exposed as:
  - A **Python API** (`intugle.semantic.unstructured` or similar).
  - A **CLI command** (`intugle text-to-semantic ...`).
- Output compatible with existing semantic features (`SemanticModel`, `DataProduct`, etc.).
- Optional storage/export of RDF graphs as Turtle, JSON-LD, or Parquet for downstream analytics.

## Use Case
**Domain**: Cross-domain (Finance,Retail, Healthcare, Research)

**Workflow**:
1. OCR pipeline extracts text from scanned reports or documents. (Assume existing. DON'T Build)
2. Text blocks are passed to the new `text_to_semantic()` function.
3. RDF triples are generated from named entities and relationships in text.
4. RDF is then mapped into the active Semantic Model:
   - Linking new entities (e.g., organizations, people, identifiers) to known nodes.
   - Enriching the model with contextual knowledge derived from the document.
5. Analysts can now run semantic queries or discover insights across both tabular and text-based datasets.

## Alternative Solutions
- Standalone NLP pipelines (spaCy, CoreNLP, LlamaIndex) – useful for extraction, but lack semantic integration or model awareness.
- Knowledge graph builders (Neo4j, GraphDB) – handle RDF well but not integrated with `intugle`’s semantic and data product ecosystem.
- Manual ontology tagging – time-consuming and error-prone.

The proposed feature unifies **text understanding** and **semantic modeling** within a single consistent framework.

## Examples
```python
from intugle import TextToSemanticProcessor, SemanticModel

# Step 1: Convert unstructured text to RDF triples
text_input = """
Invoice 123 was issued by Vendor A on March 4, 2024 for an amount of $5,400.
"""
processor = TextToSemanticProcessor(model="en_core_web_lg", output_format="rdf_star")
rdf_graph = processor.parse(text_input)

# Step 2: Map RDF to the existing semantic model
semantic_model = SemanticModel.load("finance_semantic_model.json")
semantic_model.overlay(rdf_graph, match_threshold=0.85)

semantic_model.save("finance_semantic_model_enriched.json")


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] Unstructured text processor #108

Feature Description

Problem Statement

Proposed Solution

Step 1 – RDF/RDF★ Extraction

Step 2 – Semantic Mapping

Integration Points

Use Case

Alternative Solutions

Examples

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Unstructured text processor #108

Description

Feature Description

Problem Statement

Proposed Solution

Step 1 – RDF/RDF★ Extraction

Step 2 – Semantic Mapping

Integration Points

Use Case

Alternative Solutions

Examples

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions