Polars-optimized data validation library with Pydantic schemas based on packages like Pandera, Patito and Dataframely.
- Flexible validation profiles: Switch between strict validation and lenient sanitization
- Observable outcomes: Comprehensive validation reports track what was fixed, what failed, and what passed
- Two orthogonal controls: Separate
nullable(data quality) fromon_failure(processing behavior) - Auto-enforced nullability:
nullable=Falseis now automatically enforced - Predictable pipeline: Fixed order ensures checks run on final dtypes
pip install nycteaimport polars as pl
from nyctea import validate, SchemaModel
from nyctea.functions import FunctionRegistry
# Define schema
schema = SchemaModel.from_dict({
"columns": {
"age": {
"dtype": "Int64",
"nullable": False, # Auto-enforced!
"checks": [{"name": "positive"}]
},
"name": {
"dtype": "Utf8",
"nullable": False
}
}
})
# Create function registry
registry = FunctionRegistry()
@registry.column_check(name="positive")
def positive(col: pl.Expr) -> pl.Expr:
return col.gt(0)
# Validate data
df = pl.DataFrame({
"age": [25, -5, 30],
"name": ["Alice", "Bob", None]
})
result = validate(df, schema, registry)
# Check results
print(result.errors) # Shows validation failures
print(result.report.summary()) # Human-readable summaryNyctea uses a fixed, predictable pipeline order:
1. Column resolution (synonym mapping)
2. Count original nulls
3. Frame parsers (DataFrame-level transformations)
4. Column parsers (string transformations)
5. COERCE (dtype casting) ← Moved before checks for cleaner semantics
6. Frame checks
7. Column checks (validation on coerced dtype)
- Auto-inject non_null check if nullable=False
8. Build error report (captures ALL failures)
9. NULLIFY failures where on_failure="null"
10. Final nullable check (safety assertion)
11. Build validation report
12. Return (data + errors + report)
Key insight: Coercion happens BEFORE checks, so checks run on the final dtype (e.g., check age > 0 on Int64,
not on string).
Nyctea supports three validation profiles:
All failures raise errors. This is the traditional validation behavior.
profile: strict # Can omit, it's the default
columns:
age:
dtype: Int64
nullable: false # Now actually enforced!
checks:
- name: positiveLenient mode: nullify failures for nullable columns, raise for non-nullable.
profile: clean
columns:
age:
dtype: Int64
nullable: true # Failures → null
checks:
- name: positive
id:
dtype: Utf8
nullable: false # Failures → error (even in clean mode)
checks:
- name: non_emptyLike strict, but with enhanced reporting (for tracking data quality issues).
profile: audit
columns:
age:
dtype: Int64
nullable: false
checks:
- name: positiveOverride profile defaults on a per-column basis using on_failure:
profile: strict # Default is strict
columns:
age:
dtype: Int64
nullable: true
on_failure: null # Override: lenient for this column only
checks:
- name: positive
name:
dtype: Utf8
nullable: false # Follows profile: stricton_failure: "raise"- Stop validation and report errors (default for strict)on_failure: "null"- Set failing values to null and continue (requiresnullable: true)on_failure: null- Inherit from schema profile
Important: You cannot have on_failure: "null" with nullable: false - this is validated at schema parse time.
Every validation returns both an error DataFrame and a comprehensive report:
result = validate(df, schema, registry)
# Traditional error DataFrame (backward compatible)
print(result.errors)
# ┌──────────┬───────────┬───────┐
# │ column │ check │ count │
# │ --- │ --- │ --- │
# │ str │ str │ u32 │
# ╞══════════╪═══════════╪═══════╡
# │ age │ positive │ 5 │
# └──────────┴───────────┴───────┘
# New validation report
print(result.report.summary())
# Validation Report (Profile: clean)
# Rows: 95/100 valid (95.0%)
#
# Column Issues:
# age:
# Check failures: 5
# Nullified: 5
# Final nulls: 5
# Per-column stats
for col, stats in result.report.columns.items():
print(f"{col}:")
print(f" Coercion failures: {stats.coercion_failures}")
print(f" Check failures: {stats.check_failures}")
print(f" Nullified: {stats.nullified}")
print(f" Final nulls: {stats.final_null_count}")schema = SchemaModel.from_dict({
"profile": "strict",
"columns": {
"patient_id": {
"dtype": "Utf8",
"nullable": False,
"checks": [{"name": "unique"}]
},
"age": {
"dtype": "Int64",
"nullable": False,
"checks": [{"name": "between", "args": {"min": 0, "max": 120}}]
}
}
})
# This will raise on any validation failures
result = validate(df, schema, registry)schema = SchemaModel.from_dict({
"profile": "clean",
"columns": {
"optional_age": {
"dtype": "Int64",
"nullable": True, # Required for lenient mode
"checks": [{"name": "positive"}]
},
"required_id": {
"dtype": "Utf8",
"nullable": False, # Still strict for this column
}
}
})
result = validate(df, schema, registry)
# Negative ages → null (lenient)
# Missing IDs → error (strict)
print(result.report.summary())schema = SchemaModel.from_dict({
"profile": "strict", # Default is strict
"columns": {
"sensor_reading": {
"dtype": "Float64",
"nullable": True,
"on_failure": "null", # Override: lenient
"checks": [{"name": "in_range", "args": {"min": -40, "max": 85}}]
},
"timestamp": {
"dtype": "Datetime",
"nullable": False, # Follows profile: strict
}
}
})
# Out-of-range sensor readings → null
# Invalid timestamps → error
result = validate(df, schema, registry)# Strict coercion: raise on failures
result = validate(df, schema, registry, coerce_strategy="strict")
# Lenient coercion: nullify on failures
result = validate(df, schema, registry, coerce_strategy="null_on_failure")
# Check what failed to coerce
for col, stats in result.report.columns.items():
if stats.coercion_failures > 0:
print(f"{col}: {stats.coercion_failures} coercion failures")- Default
profile="strict"preserves existing validation behavior - Default
on_failure=Noneinherits from profile - Existing schemas work unchanged
Breaking change (justified):
nullable=Falseis now automatically enforced (was documented but not implemented)- This closes a gap between documentation and implementation
@registry.column_parser(name="trim_whitespace")
def trim(col: pl.Expr) -> pl.Expr:
return col.str.strip_chars()
@registry.column_parser(name="to_uppercase")
def uppercase(col: pl.Expr) -> pl.Expr:
return col.str.to_uppercase()@registry.column_check(name="email_format")
def is_email(col: pl.Expr) -> pl.Expr:
return col.str.contains(r"^[\w\.-]+@[\w\.-]+\.\w+$")
@registry.column_check(name="in_list")
def in_list(col: pl.Expr, values: list) -> pl.Expr:
return col.is_in(values)@registry.frame_parser(name="deduplicate")
def dedup(lf: pl.LazyFrame) -> pl.LazyFrame:
return lf.unique()
@registry.frame_check(name="min_rows")
def min_rows(lf: pl.LazyFrame, count: int) -> pl.LazyFrame:
if lf.collect().height < count:
raise ValueError(f"Frame has fewer than {count} rows")
return lfControl error detail level:
from nyctea.engine import ErrorReportConfig
# Summary mode (default) - just counts
config = ErrorReportConfig(mode="summary")
result = validate(df, schema, registry, error_report=config)
# Rows mode - counts + row indices
config = ErrorReportConfig(mode="rows", limit=100)
result = validate(df, schema, registry, error_report=config)
# Cells mode - individual failing cells with values
config = ErrorReportConfig(mode="cells", include_values=True, limit=10)
result = validate(df, schema, registry, error_report=config)- Two orthogonal knobs:
nullable(data quality requirement) ⊥on_failure(processing behavior) - Profile-based defaults: Common patterns get names (
strict/clean/audit) - Column overrides profile: Explicit
on_failureat column level wins - Observable outcomes: Track what was nullified, what failed, what passed
- Predictable pipeline: Fixed order, coercion before checks
MIT
Contributions are welcome! Please open an issue or pull request.