Nyctea

Polars-optimized data validation library with Pydantic schemas based on packages like Pandera, Patito and Dataframely.

Features

Flexible validation profiles: Switch between strict validation and lenient sanitization
Observable outcomes: Comprehensive validation reports track what was fixed, what failed, and what passed
Two orthogonal controls: Separate nullable (data quality) from on_failure (processing behavior)
Auto-enforced nullability: nullable=False is now automatically enforced
Predictable pipeline: Fixed order ensures checks run on final dtypes

Installation

pip install nyctea

Quick Start

Basic Validation (Strict Mode)

import polars as pl
from nyctea import validate, SchemaModel
from nyctea.functions import FunctionRegistry

# Define schema
schema = SchemaModel.from_dict({
    "columns": {
        "age": {
            "dtype": "Int64",
            "nullable": False,  # Auto-enforced!
            "checks": [{"name": "positive"}]
        },
        "name": {
            "dtype": "Utf8",
            "nullable": False
        }
    }
})

# Create function registry
registry = FunctionRegistry()

@registry.column_check(name="positive")
def positive(col: pl.Expr) -> pl.Expr:
    return col.gt(0)

# Validate data
df = pl.DataFrame({
    "age": [25, -5, 30],
    "name": ["Alice", "Bob", None]
})

result = validate(df, schema, registry)

# Check results
print(result.errors)  # Shows validation failures
print(result.report.summary())  # Human-readable summary

Validation Pipeline

Nyctea uses a fixed, predictable pipeline order:

1. Column resolution (synonym mapping)
2. Count original nulls
3. Frame parsers (DataFrame-level transformations)
4. Column parsers (string transformations)
5. COERCE (dtype casting) ← Moved before checks for cleaner semantics
6. Frame checks
7. Column checks (validation on coerced dtype)
   - Auto-inject non_null check if nullable=False
8. Build error report (captures ALL failures)
9. NULLIFY failures where on_failure="null"
10. Final nullable check (safety assertion)
11. Build validation report
12. Return (data + errors + report)

Key insight: Coercion happens BEFORE checks, so checks run on the final dtype (e.g., check age > 0 on Int64, not on string).

Validation Profiles

Nyctea supports three validation profiles:

Strict Profile (default)

All failures raise errors. This is the traditional validation behavior.

profile: strict  # Can omit, it's the default
columns:
  age:
    dtype: Int64
    nullable: false  # Now actually enforced!
    checks:
      - name: positive

Clean Profile

Lenient mode: nullify failures for nullable columns, raise for non-nullable.

profile: clean
columns:
  age:
    dtype: Int64
    nullable: true  # Failures → null
    checks:
      - name: positive
  id:
    dtype: Utf8
    nullable: false  # Failures → error (even in clean mode)
    checks:
      - name: non_empty

Audit Profile

Like strict, but with enhanced reporting (for tracking data quality issues).

profile: audit
columns:
  age:
    dtype: Int64
    nullable: false
    checks:
      - name: positive

Column-Level Control

Override profile defaults on a per-column basis using on_failure:

profile: strict  # Default is strict
columns:
  age:
    dtype: Int64
    nullable: true
    on_failure: null  # Override: lenient for this column only
    checks:
      - name: positive
  name:
    dtype: Utf8
    nullable: false  # Follows profile: strict

Configuration Rules

on_failure: "raise" - Stop validation and report errors (default for strict)
on_failure: "null" - Set failing values to null and continue (requires nullable: true)
on_failure: null - Inherit from schema profile

Important: You cannot have on_failure: "null" with nullable: false - this is validated at schema parse time.

Validation Reports

Every validation returns both an error DataFrame and a comprehensive report:

result = validate(df, schema, registry)

# Traditional error DataFrame (backward compatible)
print(result.errors)
# ┌──────────┬───────────┬───────┐
# │ column   │ check     │ count │
# │ ---      │ ---       │ ---   │
# │ str      │ str       │ u32   │
# ╞══════════╪═══════════╪═══════╡
# │ age      │ positive  │ 5     │
# └──────────┴───────────┴───────┘

# New validation report
print(result.report.summary())
# Validation Report (Profile: clean)
# Rows: 95/100 valid (95.0%)
#
# Column Issues:
#   age:
#     Check failures: 5
#     Nullified: 5
#     Final nulls: 5

# Per-column stats
for col, stats in result.report.columns.items():
    print(f"{col}:")
    print(f"  Coercion failures: {stats.coercion_failures}")
    print(f"  Check failures: {stats.check_failures}")
    print(f"  Nullified: {stats.nullified}")
    print(f"  Final nulls: {stats.final_null_count}")

Examples

Example 1: Strict Validation

schema = SchemaModel.from_dict({
    "profile": "strict",
    "columns": {
        "patient_id": {
            "dtype": "Utf8",
            "nullable": False,
            "checks": [{"name": "unique"}]
        },
        "age": {
            "dtype": "Int64",
            "nullable": False,
            "checks": [{"name": "between", "args": {"min": 0, "max": 120}}]
        }
    }
})

# This will raise on any validation failures
result = validate(df, schema, registry)

Example 2: Lenient Sanitization

schema = SchemaModel.from_dict({
    "profile": "clean",
    "columns": {
        "optional_age": {
            "dtype": "Int64",
            "nullable": True,  # Required for lenient mode
            "checks": [{"name": "positive"}]
        },
        "required_id": {
            "dtype": "Utf8",
            "nullable": False,  # Still strict for this column
        }
    }
})

result = validate(df, schema, registry)

# Negative ages → null (lenient)
# Missing IDs → error (strict)
print(result.report.summary())

Example 3: Mixed Mode

schema = SchemaModel.from_dict({
    "profile": "strict",  # Default is strict
    "columns": {
        "sensor_reading": {
            "dtype": "Float64",
            "nullable": True,
            "on_failure": "null",  # Override: lenient
            "checks": [{"name": "in_range", "args": {"min": -40, "max": 85}}]
        },
        "timestamp": {
            "dtype": "Datetime",
            "nullable": False,  # Follows profile: strict
        }
    }
})

# Out-of-range sensor readings → null
# Invalid timestamps → error
result = validate(df, schema, registry)

Example 4: Coercion Strategy

# Strict coercion: raise on failures
result = validate(df, schema, registry, coerce_strategy="strict")

# Lenient coercion: nullify on failures
result = validate(df, schema, registry, coerce_strategy="null_on_failure")

# Check what failed to coerce
for col, stats in result.report.columns.items():
    if stats.coercion_failures > 0:
        print(f"{col}: {stats.coercion_failures} coercion failures")

Backward Compatibility

Default profile="strict" preserves existing validation behavior
Default on_failure=None inherits from profile
Existing schemas work unchanged

Breaking change (justified):

nullable=False is now automatically enforced (was documented but not implemented)
This closes a gap between documentation and implementation

Advanced Usage

Custom Parsers

@registry.column_parser(name="trim_whitespace")
def trim(col: pl.Expr) -> pl.Expr:
    return col.str.strip_chars()

@registry.column_parser(name="to_uppercase")
def uppercase(col: pl.Expr) -> pl.Expr:
    return col.str.to_uppercase()

Custom Checks

@registry.column_check(name="email_format")
def is_email(col: pl.Expr) -> pl.Expr:
    return col.str.contains(r"^[\w\.-]+@[\w\.-]+\.\w+$")

@registry.column_check(name="in_list")
def in_list(col: pl.Expr, values: list) -> pl.Expr:
    return col.is_in(values)

Frame-Level Operations

@registry.frame_parser(name="deduplicate")
def dedup(lf: pl.LazyFrame) -> pl.LazyFrame:
    return lf.unique()

@registry.frame_check(name="min_rows")
def min_rows(lf: pl.LazyFrame, count: int) -> pl.LazyFrame:
    if lf.collect().height < count:
        raise ValueError(f"Frame has fewer than {count} rows")
    return lf

Error Reporting Modes

Control error detail level:

from nyctea.engine import ErrorReportConfig

# Summary mode (default) - just counts
config = ErrorReportConfig(mode="summary")
result = validate(df, schema, registry, error_report=config)

# Rows mode - counts + row indices
config = ErrorReportConfig(mode="rows", limit=100)
result = validate(df, schema, registry, error_report=config)

# Cells mode - individual failing cells with values
config = ErrorReportConfig(mode="cells", include_values=True, limit=10)
result = validate(df, schema, registry, error_report=config)

Design Principles

Two orthogonal knobs: nullable (data quality requirement) ⊥ on_failure (processing behavior)
Profile-based defaults: Common patterns get names (strict/clean/audit)
Column overrides profile: Explicit on_failure at column level wins
Observable outcomes: Track what was nullified, what failed, what passed
Predictable pipeline: Fixed order, coercion before checks

License

MIT

Contributing

Contributions are welcome! Please open an issue or pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
scripts		scripts
src/nyctea		src/nyctea
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock
zensical.toml		zensical.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nyctea

Features

Installation

Quick Start

Basic Validation (Strict Mode)

Validation Pipeline

Validation Profiles

Strict Profile (default)

Clean Profile

Audit Profile

Column-Level Control

Configuration Rules

Validation Reports

Examples

Example 1: Strict Validation

Example 2: Lenient Sanitization

Example 3: Mixed Mode

Example 4: Coercion Strategy

Backward Compatibility

Advanced Usage

Custom Parsers

Custom Checks

Frame-Level Operations

Error Reporting Modes

Design Principles

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Nyctea

Features

Installation

Quick Start

Basic Validation (Strict Mode)

Validation Pipeline

Validation Profiles

Strict Profile (default)

Clean Profile

Audit Profile

Column-Level Control

Configuration Rules

Validation Reports

Examples

Example 1: Strict Validation

Example 2: Lenient Sanitization

Example 3: Mixed Mode

Example 4: Coercion Strategy

Backward Compatibility

Advanced Usage

Custom Parsers

Custom Checks

Frame-Level Operations

Error Reporting Modes

Design Principles

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages