[FEATURE] NoSQL Parser

## Feature Description
NoSQL database parser – a way to extract collections and parse them into parent–child hierarchical tables with relationships preserved. The goal is to translate one or more NoSQL collections into a relational model that is suitable for tabular/analytical storage formats such as Parquet and Apache Iceberg.

Key capabilities:
- Introspect NoSQL collections and infer their structure
- Flatten nested documents and arrays into normalized tables
- Preserve parent–child relationships via generated keys/foreign keys
- Emit a relational schema and data suitable for downstream analytical engines and lakehouse formats

## Problem Statement
Many analytical and BI tools work best with relational/tabular data, while application data is often stored in NoSQL systems with deeply nested, schema-flexible documents. Today, mapping those collections into a clean relational model requires custom, one-off ETL logic that is hard to maintain and reuse.

This feature will allow NoSQL collections to be mapped into a larger analytical model that can be used by downstream data scientists and business teams for reporting, experimentation, and decision-making. Having a reusable parser inside `intugle/data-tools` will standardize this process and reduce repeated ETL work across projects.

## Proposed Solution
Introduce a **NoSQL → Relational parser** module within `intugle/data-tools` with:

1. **Input connectors**
   - Pluggable adapters for common NoSQL sources (e.g., MongoDB, DynamoDB, Firestore, generic JSON/NDJSON files).
   - Ability to work from:
     - A live connection (sampling documents), or
     - A stored sample export.

2. **Schema inference & modeling**
   - Infer field types from sample documents with configurable sample size.
   - Detect nested objects and arrays and:
     - Create separate child tables for repeated/nested structures.
     - Generate synthetic primary keys when none exist.
     - Generate foreign keys to preserve parent–child relationships.
   - Handle:
     - Optional / sparse fields
     - Polymorphic subdocuments (union of fields with nulls where absent)
     - Simple type coercion (e.g., string → timestamp when patterns match).

3. **Relational output**
   - Produce a **relational schema description** (JSON/YAML model or SQL DDL).
   - Optionally materialize:
     - Parquet files per table.
     - Iceberg table definitions targeting a given catalog/warehouse.
   - Provide mapping metadata:
     - Original JSON path → table/column
     - Relationship graph (parent table, child table, join keys).

4. **Configuration & overrides**
   - User-configurable rules:
     - Explicit primary key fields (when they exist in the NoSQL data).
     - Naming conventions for tables/columns.
     - Rules for flattening vs. normalizing arrays.
     - Type overrides (e.g., force `amount` to `DECIMAL(18,2)`).
   - Support for a declarative config file (YAML/JSON) so runs are reproducible.

5. **APIs & CLI**
   - **Python API** for programmatic use in data pipelines.
   - **CLI command** (e.g., `intugle nosql-to-relational ...`) that:
     - Connects to a source
     - Infers the schema
     - Exports relational tables to Parquet / Iceberg and/or emits DDL.

## Use Case
**Domain**: Analytics / Data Engineering for applications backed by NoSQL stores

**Workflow**:
- Application data is stored in a NoSQL database (e.g., MongoDB) with nested documents.
  - Example: `users` collection with embedded `addresses`, `orders`, and `order_items`.
- For analytics, the team wants:
  - Clean relational tables (`users`, `addresses`, `orders`, `order_items`) in Parquet/Iceberg.
  - Stable schemas that BI tools and data scientists can query with SQL.
- Current approach:
  - Custom Python jobs or ad-hoc scripts flatten/normalize documents for each project.
  - Logic is duplicated, and changes in the source schema are hard to propagate.
- With this feature:
  - A data engineer runs the NoSQL parser in `intugle/data-tools` against the NoSQL collections.
  - The parser:
    - Infers structure and types.
    - Emits a relational model and materializes Parquet/Iceberg tables.
  - Downstream:
    - Data scientists query the resulting tables directly.
    - Business users build dashboards without needing to understand the original NoSQL structure.

## Alternative Solutions
- **Custom ETL scripts**: Handwritten Python/Scala jobs using generic JSON processing to flatten documents.  
  - Downsides: Non-standard, brittle, and time-consuming to maintain.
- **Vendor-specific tools** (e.g., cloud warehouse native JSON flattening):
  - Often tied to a specific platform.
  - Limited control over how nested arrays become child tables and how relationships are modeled.
- **Manual modeling in BI tools**:
  - Import raw JSON exports and model relationships manually.
  - Does not scale and easily becomes inconsistent.

None of these provide a **reusable, open, and configurable parser** integrated into `intugle/data-tools` with explicit relationship modeling and Iceberg/Parquet-friendly outputs.

## Examples
```python
from intugle import NoSQLToRelationalParser, MongoSource
from intugle import IcebergTarget, ParquetTarget
# exact module layout TBD – following the existing `from intugle import SemanticModel` style


# 1. Define the NoSQL source
source = MongoSource(
    uri="mongodb://user:pass@host:27017",
    database="app_db",
    collection="orders",
    sample_size=10_000,
)

# 2. Configure output targets
parquet_target = ParquetTarget(
    output_path="s3://my-bucket/analytics/orders_relational/"
)

iceberg_target = IcebergTarget(
    catalog_name="main",
    namespace="analytics",
    table_prefix="orders_",
)

# 3. Optional: configuration overrides
config = {
    "primary_keys": {
        "orders": ["order_id"],
        "order_items": ["order_id", "item_id"],
    },
    "type_overrides": {
        "orders.order_date": "TIMESTAMP",
        "order_items.price": "DECIMAL(18,2)",
    },
    "naming": {
        "table_case": "snake",
        "column_case": "snake",
    },
    "arrays": {
        # Example: treat `orders.items` as its own child table
        "orders.items": {"mode": "child_table", "table_name": "order_items"},
    },
}

# 4. Run the parser
parser = NoSQLToRelationalParser(source=source, config=config)

# Infer relational model (tables, columns, relationships)
relational_model = parser.infer_model()

# 5. Materialize tables
parser.write(parquet_target)
parser.write(iceberg_target)

# 'relational_model' can also be exported as JSON/YAML for inspection or versioning


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] NoSQL Parser #107

Feature Description

Problem Statement

Proposed Solution

Use Case

Alternative Solutions

Examples

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] NoSQL Parser #107

Description

Feature Description

Problem Statement

Proposed Solution

Use Case

Alternative Solutions

Examples

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions