Summary
Automatically generate data quality test configurations from source profiling data. Output tests in multiple formats: dbt schema.yml tests, Great Expectations suites, Soda checks, or raw SQL assertions. Users copy these into their existing testing frameworks.
Problem
- The platform computes rich profiling (nulls, ranges, unique counts, top values, distributions) but this data is only used for LLM context — never for test generation
- Writing data quality tests manually is tedious and error-prone
- Teams often skip quality testing because the setup cost is too high
- Different teams use different testing frameworks (dbt, Great Expectations, Soda) — each has its own config syntax
Proposed Solution
Test Types Generated
| Profiling Signal |
Generated Test |
| 0 nulls in column |
not_null |
| Unique count = row count |
unique |
| Low cardinality (< 20 values) |
accepted_values with list |
| Numeric range (min/max) |
range_check (value between min-max) |
| All values match pattern |
regex_match (email, phone, URL) |
| FK candidate detected |
relationships test |
| Date column present |
freshness check |
| Row count baseline |
row_count threshold |
Output Formats
- dbt schema.yml: Native dbt test syntax with
tests: blocks
- dbt custom SQL tests:
tests/ directory SQL files for complex checks
- Great Expectations: JSON suite with expectation configs
- Soda Checks: YAML check definitions for Soda Core
- Raw SQL: Standalone SQL assertions (SELECT COUNT(*) WHERE violation)
How It Works
- User selects source and output format
- Platform reads
sample_profile and relationship metadata
- LLM enriches with context-aware suggestions (e.g., email pattern detection, PII flagging)
- User reviews and adjusts suggested tests
- Export as copy-friendly text or downloadable config files
Technical Notes
- Input:
Source.metadata_.sample_profile (nulls, types, numeric stats, top_values) from crud.py lines 66-94
- Relationship tests: Use
suggest_source_relationships() from sql_utils.py for FK-based tests
- LLM enhancement: Beyond rule-based generation, LLM can suggest semantic tests (e.g., "this looks like an email column — add format validation")
- New endpoint:
POST /api/sources/{id}/generate-quality-tests with params: format (dbt/ge/soda/sql), strictness (lenient/moderate/strict)
- No execution: Tests are text output — the platform does not run them
Acceptance Criteria
Summary
Automatically generate data quality test configurations from source profiling data. Output tests in multiple formats: dbt schema.yml tests, Great Expectations suites, Soda checks, or raw SQL assertions. Users copy these into their existing testing frameworks.
Problem
Proposed Solution
Test Types Generated
not_nulluniqueaccepted_valueswith listrange_check(value between min-max)regex_match(email, phone, URL)relationshipstestfreshnesscheckrow_countthresholdOutput Formats
tests:blockstests/directory SQL files for complex checksHow It Works
sample_profileand relationship metadataTechnical Notes
Source.metadata_.sample_profile(nulls, types, numeric stats, top_values) fromcrud.pylines 66-94suggest_source_relationships()fromsql_utils.pyfor FK-based testsPOST /api/sources/{id}/generate-quality-testswith params:format(dbt/ge/soda/sql),strictness(lenient/moderate/strict)Acceptance Criteria