-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Feature Description
NoSQL database parser – a way to extract collections and parse them into parent–child hierarchical tables with relationships preserved. The goal is to translate one or more NoSQL collections into a relational model that is suitable for tabular/analytical storage formats such as Parquet and Apache Iceberg.
Key capabilities:
- Introspect NoSQL collections and infer their structure
- Flatten nested documents and arrays into normalized tables
- Preserve parent–child relationships via generated keys/foreign keys
- Emit a relational schema and data suitable for downstream analytical engines and lakehouse formats
Problem Statement
Many analytical and BI tools work best with relational/tabular data, while application data is often stored in NoSQL systems with deeply nested, schema-flexible documents. Today, mapping those collections into a clean relational model requires custom, one-off ETL logic that is hard to maintain and reuse.
This feature will allow NoSQL collections to be mapped into a larger analytical model that can be used by downstream data scientists and business teams for reporting, experimentation, and decision-making. Having a reusable parser inside intugle/data-tools will standardize this process and reduce repeated ETL work across projects.
Proposed Solution
Introduce a NoSQL → Relational parser module within intugle/data-tools with:
-
Input connectors
- Pluggable adapters for common NoSQL sources (e.g., MongoDB, DynamoDB, Firestore, generic JSON/NDJSON files).
- Ability to work from:
- A live connection (sampling documents), or
- A stored sample export.
-
Schema inference & modeling
- Infer field types from sample documents with configurable sample size.
- Detect nested objects and arrays and:
- Create separate child tables for repeated/nested structures.
- Generate synthetic primary keys when none exist.
- Generate foreign keys to preserve parent–child relationships.
- Handle:
- Optional / sparse fields
- Polymorphic subdocuments (union of fields with nulls where absent)
- Simple type coercion (e.g., string → timestamp when patterns match).
-
Relational output
- Produce a relational schema description (JSON/YAML model or SQL DDL).
- Optionally materialize:
- Parquet files per table.
- Iceberg table definitions targeting a given catalog/warehouse.
- Provide mapping metadata:
- Original JSON path → table/column
- Relationship graph (parent table, child table, join keys).
-
Configuration & overrides
- User-configurable rules:
- Explicit primary key fields (when they exist in the NoSQL data).
- Naming conventions for tables/columns.
- Rules for flattening vs. normalizing arrays.
- Type overrides (e.g., force
amounttoDECIMAL(18,2)).
- Support for a declarative config file (YAML/JSON) so runs are reproducible.
- User-configurable rules:
-
APIs & CLI
- Python API for programmatic use in data pipelines.
- CLI command (e.g.,
intugle nosql-to-relational ...) that:- Connects to a source
- Infers the schema
- Exports relational tables to Parquet / Iceberg and/or emits DDL.
Use Case
Domain: Analytics / Data Engineering for applications backed by NoSQL stores
Workflow:
- Application data is stored in a NoSQL database (e.g., MongoDB) with nested documents.
- Example:
userscollection with embeddedaddresses,orders, andorder_items.
- Example:
- For analytics, the team wants:
- Clean relational tables (
users,addresses,orders,order_items) in Parquet/Iceberg. - Stable schemas that BI tools and data scientists can query with SQL.
- Clean relational tables (
- Current approach:
- Custom Python jobs or ad-hoc scripts flatten/normalize documents for each project.
- Logic is duplicated, and changes in the source schema are hard to propagate.
- With this feature:
- A data engineer runs the NoSQL parser in
intugle/data-toolsagainst the NoSQL collections. - The parser:
- Infers structure and types.
- Emits a relational model and materializes Parquet/Iceberg tables.
- Downstream:
- Data scientists query the resulting tables directly.
- Business users build dashboards without needing to understand the original NoSQL structure.
- A data engineer runs the NoSQL parser in
Alternative Solutions
- Custom ETL scripts: Handwritten Python/Scala jobs using generic JSON processing to flatten documents.
- Downsides: Non-standard, brittle, and time-consuming to maintain.
- Vendor-specific tools (e.g., cloud warehouse native JSON flattening):
- Often tied to a specific platform.
- Limited control over how nested arrays become child tables and how relationships are modeled.
- Manual modeling in BI tools:
- Import raw JSON exports and model relationships manually.
- Does not scale and easily becomes inconsistent.
None of these provide a reusable, open, and configurable parser integrated into intugle/data-tools with explicit relationship modeling and Iceberg/Parquet-friendly outputs.
Examples
from intugle import NoSQLToRelationalParser, MongoSource
from intugle import IcebergTarget, ParquetTarget
# exact module layout TBD – following the existing `from intugle import SemanticModel` style
# 1. Define the NoSQL source
source = MongoSource(
uri="mongodb://user:pass@host:27017",
database="app_db",
collection="orders",
sample_size=10_000,
)
# 2. Configure output targets
parquet_target = ParquetTarget(
output_path="s3://my-bucket/analytics/orders_relational/"
)
iceberg_target = IcebergTarget(
catalog_name="main",
namespace="analytics",
table_prefix="orders_",
)
# 3. Optional: configuration overrides
config = {
"primary_keys": {
"orders": ["order_id"],
"order_items": ["order_id", "item_id"],
},
"type_overrides": {
"orders.order_date": "TIMESTAMP",
"order_items.price": "DECIMAL(18,2)",
},
"naming": {
"table_case": "snake",
"column_case": "snake",
},
"arrays": {
# Example: treat `orders.items` as its own child table
"orders.items": {"mode": "child_table", "table_name": "order_items"},
},
}
# 4. Run the parser
parser = NoSQLToRelationalParser(source=source, config=config)
# Infer relational model (tables, columns, relationships)
relational_model = parser.infer_model()
# 5. Materialize tables
parser.write(parquet_target)
parser.write(iceberg_target)
# 'relational_model' can also be exported as JSON/YAML for inspection or versioning