Skip to content

Data Quality Test Generator (dbt, Great Expectations, Soda, SQL) #110

@Empreiteiro

Description

@Empreiteiro

Summary

Automatically generate data quality test configurations from source profiling data. Output tests in multiple formats: dbt schema.yml tests, Great Expectations suites, Soda checks, or raw SQL assertions. Users copy these into their existing testing frameworks.

Problem

  • The platform computes rich profiling (nulls, ranges, unique counts, top values, distributions) but this data is only used for LLM context — never for test generation
  • Writing data quality tests manually is tedious and error-prone
  • Teams often skip quality testing because the setup cost is too high
  • Different teams use different testing frameworks (dbt, Great Expectations, Soda) — each has its own config syntax

Proposed Solution

Test Types Generated

Profiling Signal Generated Test
0 nulls in column not_null
Unique count = row count unique
Low cardinality (< 20 values) accepted_values with list
Numeric range (min/max) range_check (value between min-max)
All values match pattern regex_match (email, phone, URL)
FK candidate detected relationships test
Date column present freshness check
Row count baseline row_count threshold

Output Formats

  1. dbt schema.yml: Native dbt test syntax with tests: blocks
  2. dbt custom SQL tests: tests/ directory SQL files for complex checks
  3. Great Expectations: JSON suite with expectation configs
  4. Soda Checks: YAML check definitions for Soda Core
  5. Raw SQL: Standalone SQL assertions (SELECT COUNT(*) WHERE violation)

How It Works

  1. User selects source and output format
  2. Platform reads sample_profile and relationship metadata
  3. LLM enriches with context-aware suggestions (e.g., email pattern detection, PII flagging)
  4. User reviews and adjusts suggested tests
  5. Export as copy-friendly text or downloadable config files

Technical Notes

  • Input: Source.metadata_.sample_profile (nulls, types, numeric stats, top_values) from crud.py lines 66-94
  • Relationship tests: Use suggest_source_relationships() from sql_utils.py for FK-based tests
  • LLM enhancement: Beyond rule-based generation, LLM can suggest semantic tests (e.g., "this looks like an email column — add format validation")
  • New endpoint: POST /api/sources/{id}/generate-quality-tests with params: format (dbt/ge/soda/sql), strictness (lenient/moderate/strict)
  • No execution: Tests are text output — the platform does not run them

Acceptance Criteria

  • Auto-detect not_null, unique, accepted_values, range checks from profiling
  • Generate tests in at least 3 formats (dbt, Great Expectations, Soda)
  • LLM suggests semantic tests beyond rule-based (email format, date patterns, PII)
  • Relationship/FK tests generated from detected cross-source relationships
  • Strictness levels: lenient (critical only), moderate, strict (all possible tests)
  • Export as individual files or bundled config
  • Preview mode: show which tests would be generated before exporting

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions