Skip to content

[FEATURE] NoSQL Parser #107

@prinkanintugle

Description

@prinkanintugle

Feature Description

NoSQL database parser – a way to extract collections and parse them into parent–child hierarchical tables with relationships preserved. The goal is to translate one or more NoSQL collections into a relational model that is suitable for tabular/analytical storage formats such as Parquet and Apache Iceberg.

Key capabilities:

  • Introspect NoSQL collections and infer their structure
  • Flatten nested documents and arrays into normalized tables
  • Preserve parent–child relationships via generated keys/foreign keys
  • Emit a relational schema and data suitable for downstream analytical engines and lakehouse formats

Problem Statement

Many analytical and BI tools work best with relational/tabular data, while application data is often stored in NoSQL systems with deeply nested, schema-flexible documents. Today, mapping those collections into a clean relational model requires custom, one-off ETL logic that is hard to maintain and reuse.

This feature will allow NoSQL collections to be mapped into a larger analytical model that can be used by downstream data scientists and business teams for reporting, experimentation, and decision-making. Having a reusable parser inside intugle/data-tools will standardize this process and reduce repeated ETL work across projects.

Proposed Solution

Introduce a NoSQL → Relational parser module within intugle/data-tools with:

  1. Input connectors

    • Pluggable adapters for common NoSQL sources (e.g., MongoDB, DynamoDB, Firestore, generic JSON/NDJSON files).
    • Ability to work from:
      • A live connection (sampling documents), or
      • A stored sample export.
  2. Schema inference & modeling

    • Infer field types from sample documents with configurable sample size.
    • Detect nested objects and arrays and:
      • Create separate child tables for repeated/nested structures.
      • Generate synthetic primary keys when none exist.
      • Generate foreign keys to preserve parent–child relationships.
    • Handle:
      • Optional / sparse fields
      • Polymorphic subdocuments (union of fields with nulls where absent)
      • Simple type coercion (e.g., string → timestamp when patterns match).
  3. Relational output

    • Produce a relational schema description (JSON/YAML model or SQL DDL).
    • Optionally materialize:
      • Parquet files per table.
      • Iceberg table definitions targeting a given catalog/warehouse.
    • Provide mapping metadata:
      • Original JSON path → table/column
      • Relationship graph (parent table, child table, join keys).
  4. Configuration & overrides

    • User-configurable rules:
      • Explicit primary key fields (when they exist in the NoSQL data).
      • Naming conventions for tables/columns.
      • Rules for flattening vs. normalizing arrays.
      • Type overrides (e.g., force amount to DECIMAL(18,2)).
    • Support for a declarative config file (YAML/JSON) so runs are reproducible.
  5. APIs & CLI

    • Python API for programmatic use in data pipelines.
    • CLI command (e.g., intugle nosql-to-relational ...) that:
      • Connects to a source
      • Infers the schema
      • Exports relational tables to Parquet / Iceberg and/or emits DDL.

Use Case

Domain: Analytics / Data Engineering for applications backed by NoSQL stores

Workflow:

  • Application data is stored in a NoSQL database (e.g., MongoDB) with nested documents.
    • Example: users collection with embedded addresses, orders, and order_items.
  • For analytics, the team wants:
    • Clean relational tables (users, addresses, orders, order_items) in Parquet/Iceberg.
    • Stable schemas that BI tools and data scientists can query with SQL.
  • Current approach:
    • Custom Python jobs or ad-hoc scripts flatten/normalize documents for each project.
    • Logic is duplicated, and changes in the source schema are hard to propagate.
  • With this feature:
    • A data engineer runs the NoSQL parser in intugle/data-tools against the NoSQL collections.
    • The parser:
      • Infers structure and types.
      • Emits a relational model and materializes Parquet/Iceberg tables.
    • Downstream:
      • Data scientists query the resulting tables directly.
      • Business users build dashboards without needing to understand the original NoSQL structure.

Alternative Solutions

  • Custom ETL scripts: Handwritten Python/Scala jobs using generic JSON processing to flatten documents.
    • Downsides: Non-standard, brittle, and time-consuming to maintain.
  • Vendor-specific tools (e.g., cloud warehouse native JSON flattening):
    • Often tied to a specific platform.
    • Limited control over how nested arrays become child tables and how relationships are modeled.
  • Manual modeling in BI tools:
    • Import raw JSON exports and model relationships manually.
    • Does not scale and easily becomes inconsistent.

None of these provide a reusable, open, and configurable parser integrated into intugle/data-tools with explicit relationship modeling and Iceberg/Parquet-friendly outputs.

Examples

from intugle import NoSQLToRelationalParser, MongoSource
from intugle import IcebergTarget, ParquetTarget
# exact module layout TBD – following the existing `from intugle import SemanticModel` style


# 1. Define the NoSQL source
source = MongoSource(
    uri="mongodb://user:pass@host:27017",
    database="app_db",
    collection="orders",
    sample_size=10_000,
)

# 2. Configure output targets
parquet_target = ParquetTarget(
    output_path="s3://my-bucket/analytics/orders_relational/"
)

iceberg_target = IcebergTarget(
    catalog_name="main",
    namespace="analytics",
    table_prefix="orders_",
)

# 3. Optional: configuration overrides
config = {
    "primary_keys": {
        "orders": ["order_id"],
        "order_items": ["order_id", "item_id"],
    },
    "type_overrides": {
        "orders.order_date": "TIMESTAMP",
        "order_items.price": "DECIMAL(18,2)",
    },
    "naming": {
        "table_case": "snake",
        "column_case": "snake",
    },
    "arrays": {
        # Example: treat `orders.items` as its own child table
        "orders.items": {"mode": "child_table", "table_name": "order_items"},
    },
}

# 4. Run the parser
parser = NoSQLToRelationalParser(source=source, config=config)

# Infer relational model (tables, columns, relationships)
relational_model = parser.infer_model()

# 5. Materialize tables
parser.write(parquet_target)
parser.write(iceberg_target)

# 'relational_model' can also be exported as JSON/YAML for inspection or versioning

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions