Data Quality Test Generator (dbt, Great Expectations, Soda, SQL)

## Summary

Automatically generate data quality test configurations from source profiling data. Output tests in multiple formats: dbt schema.yml tests, Great Expectations suites, Soda checks, or raw SQL assertions. Users copy these into their existing testing frameworks.

## Problem

- The platform computes rich profiling (nulls, ranges, unique counts, top values, distributions) but this data is only used for LLM context — never for test generation
- Writing data quality tests manually is tedious and error-prone
- Teams often skip quality testing because the setup cost is too high
- Different teams use different testing frameworks (dbt, Great Expectations, Soda) — each has its own config syntax

## Proposed Solution

### Test Types Generated

| Profiling Signal | Generated Test |
|---|---|
| 0 nulls in column | `not_null` |
| Unique count = row count | `unique` |
| Low cardinality (< 20 values) | `accepted_values` with list |
| Numeric range (min/max) | `range_check` (value between min-max) |
| All values match pattern | `regex_match` (email, phone, URL) |
| FK candidate detected | `relationships` test |
| Date column present | `freshness` check |
| Row count baseline | `row_count` threshold |

### Output Formats

1. **dbt schema.yml**: Native dbt test syntax with `tests:` blocks
2. **dbt custom SQL tests**: `tests/` directory SQL files for complex checks
3. **Great Expectations**: JSON suite with expectation configs
4. **Soda Checks**: YAML check definitions for Soda Core
5. **Raw SQL**: Standalone SQL assertions (SELECT COUNT(*) WHERE violation)

### How It Works

1. User selects source and output format
2. Platform reads `sample_profile` and relationship metadata
3. LLM enriches with context-aware suggestions (e.g., email pattern detection, PII flagging)
4. User reviews and adjusts suggested tests
5. Export as copy-friendly text or downloadable config files

## Technical Notes

- **Input**: `Source.metadata_.sample_profile` (nulls, types, numeric stats, top_values) from `crud.py` lines 66-94
- **Relationship tests**: Use `suggest_source_relationships()` from `sql_utils.py` for FK-based tests
- **LLM enhancement**: Beyond rule-based generation, LLM can suggest semantic tests (e.g., "this looks like an email column — add format validation")
- **New endpoint**: `POST /api/sources/{id}/generate-quality-tests` with params: `format` (dbt/ge/soda/sql), `strictness` (lenient/moderate/strict)
- **No execution**: Tests are text output — the platform does not run them

## Acceptance Criteria

- [ ] Auto-detect not_null, unique, accepted_values, range checks from profiling
- [ ] Generate tests in at least 3 formats (dbt, Great Expectations, Soda)
- [ ] LLM suggests semantic tests beyond rule-based (email format, date patterns, PII)
- [ ] Relationship/FK tests generated from detected cross-source relationships
- [ ] Strictness levels: lenient (critical only), moderate, strict (all possible tests)
- [ ] Export as individual files or bundled config
- [ ] Preview mode: show which tests would be generated before exporting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Quality Test Generator (dbt, Great Expectations, Soda, SQL) #110

Summary

Problem

Proposed Solution

Test Types Generated

Output Formats

How It Works

Technical Notes

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Profiling Signal	Generated Test
0 nulls in column	`not_null`
Unique count = row count	`unique`
Low cardinality (< 20 values)	`accepted_values` with list
Numeric range (min/max)	`range_check` (value between min-max)
All values match pattern	`regex_match` (email, phone, URL)
FK candidate detected	`relationships` test
Date column present	`freshness` check
Row count baseline	`row_count` threshold

Data Quality Test Generator (dbt, Great Expectations, Soda, SQL) #110

Description

Summary

Problem

Proposed Solution

Test Types Generated

Output Formats

How It Works

Technical Notes

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions