ValidateLite User Guide

A practical tool for checking data quality and validating type conversions.

Overview
Installation
Quick Start
Data Sources
Command Overview
- vlite check command
- vlite schema command
Using check command
Using schema command
Use Cases
Output and Results
Configuration
Troubleshooting

Overview

ValidateLite is a Python command-line tool designed for data quality validation. It provides two main validation approaches:

Quick validation with vlite check

Perfect for ad-hoc data checks and exploration
Single rule validation with immediate feedback
Great for debugging and development

Schema-based validation with vlite schema

Comprehensive validation using JSON schema files
Batch processing for multiple rules and tables
Features the powerful Desired Type functionality for type conversion validation

What makes ValidateLite special?

The standout feature is Desired Type validation - it doesn't just check if your data fits a schema, it tells you whether your data can be safely converted to a different type. This is invaluable for:

Data migration planning
System upgrades
ETL process validation
Data quality assessment before transformations

Supported data sources:

Files: CSV, Excel, JSON
Databases: MySQL, PostgreSQL, SQLite

Installation

Install from PyPI (Recommended)

pip install validatelite

Install from Source

git clone https://github.com/litedatum/validatelite.git
cd validatelite
pip install -e .

Verify Installation

vlite --version

Dependencies

ValidateLite works with:

Python 3.8+
pandas (for Excel/CSV processing)
SQLAlchemy (for database connections)
Click (for CLI interface)

Database drivers are optional:

MySQL: pip install pymysql
PostgreSQL: pip install psycopg2-binary
SQLite: Built into Python

Quick Start

Here are some simple examples to get you started:

Basic Data Check

# Check for missing email addresses
vlite check --conn customers.csv --table customers --rule "not_null(email)"

Multiple Checks

# Run several checks at once
vlite check --conn data.csv --table data \
  --rule "not_null(id)" \
  --rule "unique(email)" \
  --rule "range(age, 18, 99)"

Schema Validation with Type Conversion

# Check if string data can be converted to proper types
vlite schema --conn messy_data.csv --rules cleanup_schema.json

Sample schema file (cleanup_schema.json):

{
  "rules": [
    {
      "field": "user_id",
      "type": "string",
      "desired_type": "integer",
      "required": true
    },
    {
      "field": "salary",
      "type": "string",
      "desired_type": "float(10,2)",
      "required": true
    }
  ]
}

This will tell you exactly which records can't be converted from string to integer/float.

Data Sources

ValidateLite connects to various data sources with a simple connection string approach.

File Sources

CSV Files:

--conn data.csv
--conn /path/to/data.csv
--conn file://data.csv

Excel Files:

--conn report.xlsx
--conn /path/to/report.xlsx

# For multi-sheet Excel files, specify the sheet
--conn report.xlsx --table "Sheet1"

JSON Files:

--conn data.json
--conn /path/to/data.json

Database Sources

MySQL:

--conn "mysql://username:password@host:port/database"
--conn "mysql://user:pass@localhost:3306/sales"

PostgreSQL:

--conn "postgresql://username:password@host:port/database"
--conn "postgres://user:pass@localhost:5432/analytics"

SQLite:

--conn "sqlite:///path/to/database.db"
--conn "sqlite:///data/local.db"

Environment Variables

Keep sensitive connection details out of your commands:

# Set environment variables
export DB_HOST="localhost"
export DB_USER="analyst"
export DB_PASSWORD="secret123"
export DB_NAME="sales"

# Build connection string
export MYSQL_URL="mysql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:3306/${DB_NAME}"

# Use in commands
vlite check --conn "$MYSQL_URL" --table users --rule "not_null(email)"

Command Overview

ValidateLite offers two commands for different validation needs.

vlite check command

Quick data quality checks for immediate feedback:

vlite check --conn <data_source> --table <table_name> --rule "<rule>" [options]

Key features:

Instant validation without config files
Flexible inline rule definitions
Fast feedback for development and debugging
One rule at a time execution

Best for:

Development phase testing
Data exploration and analysis
Quick data quality checks
Debugging and troubleshooting

vlite schema command

Comprehensive validation using schema files:

# Single table validation
vlite schema --conn <data_source> --table <table_name> --rules <schema.json> [options]

# Multi-table validation (tables defined in schema)
vlite schema --conn <data_source> --rules <schema.json> [options]

Key features:

Schema-driven with JSON schema files
Batch validation for multiple tables and rules
Type conversion analysis with Desired Type functionality
Structured configuration for reuse and version control

Best for:

Production data quality monitoring
Pre-migration data validation
ETL pipeline data validation
Automated testing in CI/CD

Schema file syntax differences:

When using --table parameter, your schema should contain field-level rules:

{
  "rules": [
    {
      "field": "email",
      "type": "string(255)",
      "desired_type": "string(100)",
      "required": true
    }
  ]
}

When not using --table parameter, your schema should contain table-level definitions:

{
  "tables": [
    {
      "name": "users",
      "fields": [
        {
          "field": "email",
          "type": "string(255)",
          "desired_type": "string(100)",
          "required": true
        }
      ]
    }
  ]
}

Using check command

ValidateLite provides comprehensive validation rules covering all aspects of data quality.

Rule Types

Category	Rule Type	Purpose
Completeness	NOT_NULL	Check for missing values
Uniqueness	UNIQUE	Find duplicate values
Format	REGEX	Validate patterns
Format	DATE_FORMAT	Check date formats
Value	ENUM	Validate against allowed values
Range	RANGE	Check numeric ranges

Completeness Rules

Check for missing values:

# Basic not-null check
--rule "not_null(email)"

# With custom message
--rule "not_null(customer_id, 'Customer ID is required')"

# Check multiple columns
--rule "not_null(first_name)"
--rule "not_null(last_name)"
--rule "not_null(email)"

Uniqueness Rules

Find duplicate records:

# Check for duplicate emails
--rule "unique(email)"

# Check for duplicate combinations
--rule "unique(first_name, last_name, birth_date)"

# Check with filter conditions
--rule "unique(username) WHERE status = 'active'"

Format Validation Rules

REGEX pattern validation:

# Email format validation
--rule "regex(email, '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')"

# Phone number format
--rule "regex(phone, '^\\+?1?[0-9]{10,14}$')"

# Product code format
--rule "regex(product_code, '^[A-Z]{2}[0-9]{4}$')"

DATE_FORMAT validation:

# Basic syntax
--rule "date_format(column_name, 'format_pattern')"

Supported date format patterns:

Pattern	Example	Description
`YYYY-MM-DD`	2023-12-25	ISO date format
`MM/DD/YYYY`	12/25/2023	US date format
`DD/MM/YYYY`	25/12/2023	European date format
`YYYYMMDD`	20231225	Compact date format
`DD.MM.YYYY`	25.12.2023	German date format
`YYYY-MM-DD HH:MI:SS`	2023-12-25 14:30:00	DateTime format
`DD-MMM-YYYY`	25-Dec-2023	Month abbreviation format
`YYYY/MM/DD`	2023/12/25	Slash-separated format

Format components:

YYYY or yyyy - Four-digit year
MM or mm - Two-digit month (01-12)
DD or dd - Two-digit day (01-31)
HH or hh - Two-digit hour (00-23)
MI or mi - Two-digit minute (00-59)
SS or ss - Two-digit second (00-59)

# Examples
--rule "date_format(created_at, 'YYYY-MM-DD HH:MI:SS')"
--rule "date_format(birth_date, 'MM/DD/YYYY')"
--rule "date_format(event_date, 'DD.MM.YYYY')"

Database support:

MySQL: Native support for all formats
PostgreSQL: Uses regex pre-validation + Python verification
SQLite: Uses custom function validation

Value Validation Rules

ENUM (allowed values) validation:

# Status field validation
--rule "enum(status, 'active', 'inactive', 'pending')"

# Priority levels
--rule "enum(priority, 'low', 'medium', 'high', 'critical')"

# Boolean-like values
--rule "enum(is_verified, 'true', 'false', '1', '0')"

Range Validation Rules

Numeric range validation:

# Age validation
--rule "range(age, 0, 120)"

# Price validation with decimals
--rule "range(price, 0.01, 999999.99)"

# Percentage validation
--rule "range(completion_rate, 0.0, 100.0)"

# Year validation
--rule "range(birth_year, 1900, 2024)"

JSON Rule Files

For complex validation scenarios, use JSON rule files:

Basic rule file (validation_rules.json):

{
  "rules": [
    {
      "name": "email_required",
      "type": "NOT_NULL",
      "target": {
        "database": "sales_db",
        "table": "customers",
        "column": "email"
      },
      "severity": "HIGH"
    },
    {
      "name": "unique_customer_email",
      "type": "UNIQUE",
      "target": {
        "database": "sales_db",
        "table": "customers",
        "column": "email"
      },
      "severity": "HIGH"
    },
    {
      "name": "valid_age_range",
      "type": "RANGE",
      "target": {
        "database": "sales_db",
        "table": "customers",
        "column": "age"
      },
      "parameters": {
        "min_value": 18,
        "max_value": 99
      },
      "severity": "MEDIUM"
    }
  ]
}

Using rule files:

vlite check --conn "mysql://user:pass@host:3306/sales_db" \
  --table customers --rules validation_rules.json

Using schema command

This is where ValidateLite really shines! ValidateLite provides industry-leading type system and data conversion validation capabilities.

Basic Type System

ValidateLite supports these fundamental data types:

Type	Description	Examples
`string`	Text data	"John", "Hello World"
`integer`	Whole numbers	42, -17, 0
`float`	Decimal numbers	3.14, -0.5, 100.00
`boolean`	True/false values	true, false
`date`	Date values	2023-12-25
`datetime`	Date and time values	2023-12-25 14:30:00

Data Type Definition Syntax

ValidateLite provides intuitive data type definition syntax with precise type constraints:

String Type Definitions

{
  "field": "username",
  "type": "string(50)",          // Max length 50 characters
  "required": true
}

String type definition syntax:

string(100) - Max length 100 characters
string(10,50) - Length between 10-50 characters
string - No length restrictions

Float Type Definitions

{
  "field": "price",
  "type": "float(10,2)",         // Precision 10, scale 2
  "required": true
}

Float type definition syntax:

float(10,2) - Precision 10, scale 2 decimal places
float(8,3) - Precision 8, scale 3 decimal places
float - Standard float

DateTime Type Definitions

{
  "field": "created_at",
  "type": "datetime('YYYY-MM-DD HH:MI:SS')",    // Specific datetime format
  "required": true
}

DateTime type definition syntax:

datetime('YYYY-MM-DD HH:MI:SS') - Specific datetime format
date('YYYY-MM-DD') - Specific date format
datetime - Standard datetime format

Desired Type Feature

Desired Type is ValidateLite's most valuable feature! It lets you validate whether data can be safely converted to a target type, which is crucial for data migration, system upgrades, and data cleaning scenarios.

Why Desired Type Matters

Traditional validation just checks if data matches a schema. Desired Type goes further - it tells you if your messy string data can actually be converted to proper types like integers or dates.

Example scenario: You have a CSV file where everything is stored as strings:

user_id: "123" (should be integer)
salary: "75000.50" (should be float)
join_date: "2023-01-15" (should be date)

Desired Type validation will tell you exactly which records can be converted and which ones will cause problems.

Using Desired Type

Desired Type uses the same type definition syntax for precise validation:

{
  "transactions": {
    "rules": [
      {
        "field": "amount",
        "type": "string",                         // Current: string data
        "desired_type": "float(12,2)",           // Target: decimal with 12 precision, 2 scale
        "required": true
      },
      {
        "field": "transaction_date",
        "type": "string",                         // Current: string data
        "desired_type": "datetime('YYYY-MM-DD')",   // Target: specific datetime format
        "required": true
      },
      {
        "field": "description",
        "type": "string(500)",                   // Current: long strings
        "desired_type": "string(200)",           // Target: shorter strings
        "required": true
      }
    ]
  }
}

Application in Desired Type

Desired Type supports the same type definition syntax for precise validation:

{
  "migration_analysis": {
    "rules": [
      {
        "field": "legacy_id",
        "type": "string(50)",                    // Current: string with max 50 chars
        "desired_type": "integer",               // Target: integer
        "required": true
      },
      {
        "field": "legacy_amount",
        "type": "string",                        // Current: free-form string
        "desired_type": "float(10,2)",           // Target: precise decimal
        "required": true
      },
      {
        "field": "legacy_timestamp",
        "type": "string",                        // Current: string timestamp
        "desired_type": "datetime('YYYY-MM-DD HH:MI:SS')", // Target: structured datetime
        "required": true
      }
    ]
  }
}

What you get from Desired Type validation:

Count of records that can be converted successfully
Count of problematic records that would fail conversion
Sample data showing exactly what the problems are
Conversion feasibility percentage
Specific error patterns in your data

Type Compatibility Analysis

ValidateLite analyzes type conversion compatibility and reports three possible outcomes:

Compatible Conversion

All data can be safely converted to the desired type.

Example:

Field: user_id
Current Type: string → Desired Type: integer
Result: ✅ COMPATIBLE (500/500 records can be converted)

Partial Conversion

Some data can be converted, but some records have issues.

Example:

Field: salary
Current Type: string → Desired Type: float(10,2)
Result: ⚠️ PARTIAL (487/500 records can be converted)
Issues: 13 records contain non-numeric characters

Incompatible Conversion

Most or all data cannot be converted to the desired type.

Example:

Field: comments
Current Type: string → Desired Type: integer
Result: ❌ INCOMPATIBLE (0/500 records can be converted)
Issues: Text data cannot be converted to integers

Conversion Validation Strategy

ValidateLite uses smart conversion validation strategies:

String to Numeric Conversion

Removes common formatting (spaces, commas, currency symbols)
Handles scientific notation
Validates decimal precision and scale
Checks for overflow conditions

String to Date/DateTime Conversion

Attempts multiple common date formats
Validates actual date values (no Feb 31st)
Handles timezone considerations
Checks for impossible dates

String Length Validation

Measures actual character length
Considers UTF-8 encoding
Validates against target length constraints

Type Downgrading Validation

Checks if larger types can fit into smaller ones
Validates precision/scale requirements for decimals
Ensures no data loss during conversion

Comprehensive validation output: When you run Desired Type validation, you get detailed information about:

Which fields can be safely converted
Which data needs cleaning
Specific failure samples and suggested fixes

Use Cases

This section provides complete usage scenarios showcasing Desired Type functionality.

Case 1: Customer Data Quality Check

Background: You have a customer database that's been collecting data for years. Data quality has declined and you need to assess what can be cleaned up.

Dataset: Customer table with mixed data quality

customer_id,name,email,phone,age,registration_date,is_premium
1,John Smith,john@email.com,555-1234,25,2023-01-15,true
2,"Jane, Doe",jane@email.com,,35,01/15/2023,1
3,Bob Johnson,invalid-email,555-ABCD,age_unknown,2023/1/15,yes
4,"Mike Wilson",mike@email.com,5551234567,45,2023-01-15,false

Quick validation with check command:

# Check for basic data quality issues
vlite check --conn customers.csv --table customers \
  --rule "not_null(customer_id)" \
  --rule "unique(email)" \
  --rule "regex(email, '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')" \
  --rule "not_null(phone)" \
  --verbose

Schema validation for cleanup planning:

Create customer_cleanup.json:

{
  "rules": [
    {
      "field": "customer_id",
      "type": "string",
      "desired_type": "integer",
      "required": true
    },
    {
      "field": "age",
      "type": "string",
      "desired_type": "integer",
      "required": false,
      "min": 18,
      "max": 100
    },
    {
      "field": "registration_date",
      "type": "string",
      "desired_type": "date('YYYY-MM-DD')",
      "required": true
    },
    {
      "field": "is_premium",
      "type": "string",
      "desired_type": "boolean",
      "required": true
    }
  ]
}

# Analyze what can be cleaned up
vlite schema --conn customers.csv --rules customer_cleanup.json --verbose

This tells you exactly which customer records have data quality issues and what types of problems exist.

Case 2: E-commerce Order Validation

Background: Validate daily order data before processing payments and shipments.

# Comprehensive order validation
vlite check --conn "mysql://user:pass@db:3306/ecommerce" --table orders \
  --rule "not_null(order_id)" \
  --rule "unique(order_id)" \
  --rule "not_null(customer_id)" \
  --rule "range(total_amount, 0.01, 999999.99)" \
  --rule "enum(status, 'pending', 'paid', 'shipped', 'delivered', 'cancelled')" \
  --rule "date_format(created_at, 'YYYY-MM-DD HH:MI:SS')" \
  --verbose

Case 3: Excel Financial Report Validation

Background: Monthly financial reports come in Excel format and need validation before importing into the accounting system.

Excel file structure (monthly_report.xlsx):

Sheet: "Revenue"
Columns: transaction_id, amount, currency, transaction_date, category

Multi-sheet validation:

First, check what sheets are available:

vlite schema --conn monthly_report.xlsx --rules basic_schema.json

Then validate specific sheets:

# Validate Revenue sheet
vlite schema --conn monthly_report.xlsx --table "Revenue" --rules revenue_schema.json

# Validate Expenses sheet
vlite schema --conn monthly_report.xlsx --table "Expenses" --rules expense_schema.json

Revenue validation schema (revenue_schema.json):

{
  "rules": [
    {
      "field": "transaction_id",
      "type": "string",
      "desired_type": "string(20)",
      "required": true
    },
    {
      "field": "amount",
      "type": "string",
      "desired_type": "float(15,2)",
      "required": true,
      "min": 0.01
    },
    {
      "field": "transaction_date",
      "type": "string",
      "desired_type": "date('YYYY-MM-DD')",
      "required": true
    }
  ],
  "strict_mode": true
}

Case 4: Pre-migration Data Validation

Background: Before migrating from a legacy system to a modern database, you need to validate that all data can be properly converted and identify cleanup requirements.

Legacy system data characteristics:

Everything stored as VARCHAR
Inconsistent date formats
Mixed boolean representations
Unreliable numeric formatting

Migration readiness schema (migration_readiness.json):

{
  "users": {
    "rules": [
      {
        "field": "user_id",
        "type": "string(50)",
        "desired_type": "integer",
        "required": true
      },
      {
        "field": "email",
        "type": "string(500)",
        "desired_type": "string(255)",
        "required": true
      },
      {
        "field": "created_date",
        "type": "string",
        "desired_type": "date('YYYY-MM-DD')",      // Target: standard date format
        "required": true
      },
      {
        "field": "last_login",
        "type": "string",
        "desired_type": "datetime('YYYY-MM-DD HH:MI:SS')", // Target: standard datetime
        "required": false
      },
      {
        "field": "is_active",
        "type": "string",
        "desired_type": "boolean",
        "required": true
      }
    ],
    "strict_mode": false
  }
}

# Analyze migration readiness
vlite schema --conn "mysql://legacy:pass@old-db:3306/legacy_db" \
  --rules migration_readiness.json --output json > migration_report.json

# Get detailed conversion analysis
vlite schema --conn "mysql://legacy:pass@old-db:3306/legacy_db" \
  --rules migration_readiness.json --verbose

Expected output:

Migration Readiness Report
==========================

Table: users
Total records: 10,543

Type conversion analysis:
┌─────────────────┬──────────┬──────────┬──────────┬─────────────────┐
│ Field           │ From     │ To       │ Status   │ Issues          │
├─────────────────┼──────────┼──────────┼──────────┼─────────────────┤
│ user_id         │ string   │ integer  │ ✅ OK     │ -               │
│ email           │ string   │ string   │ ⚠️ WARN   │ 12 too long     │
│ created_date    │ string   │ date     │ ⚠️ WARN   │ 45 bad formats  │
│ last_login      │ string   │ datetime │ ❌ ISSUES │ 234 bad formats │
│ is_active       │ string   │ boolean  │ ⚠️ WARN   │ 8 unclear values│
└─────────────────┴──────────┴──────────┴──────────┴─────────────────┘

Field: created_date
  ✓ Field exists (string)
  ✓ Non-null constraint
  ✗ Type conversion validation (string → date('YYYY-MM-DD')): 156 incompatible records

Failure samples:
  Row 12: "2023/12/25"    (slash format, needs standardization)
  Row 34: "Dec 25, 2023"  (English format)
  Row 67: "25.12.2023"    (European format)

Recommended cleanup:
1. Standardize date formats to YYYY-MM-DD
2. Trim email fields that exceed 255 characters
3. Normalize boolean values (true/false only)
4. Fix malformed datetime values

This gives you a complete roadmap for data cleanup before migration.

Case 5: Legacy System Data Cleanup

Background: You inherit a legacy system with years of accumulated data quality issues. You need to understand the scope of cleanup required.

Legacy data issues:

Mixed encodings
Inconsistent data entry
No validation for years
Multiple date formats
Currency symbols in numeric fields

Cleanup assessment schema (legacy_cleanup.json):

{
  "rules": [
    {
      "field": "customer_id",
      "type": "string",
      "desired_type": "integer",
      "required": true
    },
    {
      "field": "first_name",
      "type": "string(1000)",
      "desired_type": "string(50)",
      "required": true
    },
    {
      "field": "salary",
      "type": "string",
      "desired_type": "float(10,2)",
      "required": false,
      "min": 0
    },
    {
      "field": "hire_date",
      "type": "string",
      "desired_type": "date('YYYY-MM-DD')",
      "required": true
    },
    {
      "field": "department_id",
      "type": "string",
      "desired_type": "integer",
      "required": true
    }
  ],
  "strict_mode": false
}

Cleanup process:

# Step 1: Assess current state
vlite schema --conn legacy_data.csv --rules legacy_cleanup.json \
  --output json > cleanup_assessment.json

# Step 2: Get detailed samples
vlite schema --conn legacy_data.csv --rules legacy_cleanup.json \
  --verbose > cleanup_details.txt

# Step 3: Validate after initial cleanup
# (after running data cleaning scripts)
vlite schema --conn cleaned_data.csv --rules legacy_cleanup.json \
  --verbose

Sample output showing improvement:

Before cleanup:
  salary field: 1,234 records with currency symbols ($, €, £)
  hire_date field: 567 records with inconsistent formats

After cleanup:
  salary field: 23 records still need manual review
  hire_date field: 12 records still need manual review

Case 6: API Data Interface Validation

Background: Validate data received from external APIs before processing.

API validation schema (api_validation.json):

{
  "rules": [
    {
      "field": "user_id",
      "type": "string",
      "desired_type": "integer",
      "required": true
    },
    {
      "field": "timestamp",
      "type": "string",
      "desired_type": "datetime('YYYY-MM-DD HH:MI:SS')", // Internal: standard format
      "required": true
    },
    {
      "field": "amount",
      "type": "string",
      "desired_type": "float(12,2)",
      "required": true,
      "min": 0
    }
  ]
}

# Validate API response data
vlite schema --conn api_response.json --rules api_validation.json

Case 7: Batch File Validation

Background: Process multiple files in a batch operation.

#!/bin/bash
# validate_batch.sh

for file in data_files/*.csv; do
  echo "Validating $file..."
  vlite schema --conn "$file" --rules batch_schema.json \
    --output json > "reports/$(basename "$file" .csv)_report.json"
done

echo "Validation complete. Check reports/ directory for results."

Case 8: Data Validation in CI/CD

Background: Integrate data quality checks into your CI/CD pipeline to catch data compatibility issues before they reach production.

Create .github/workflows/data-validation.yml:

name: Data Quality and Type Conversion Validation
on:
  push:
    paths:
      - 'data/**'
      - 'schemas/**'
  pull_request:
    paths:
      - 'data/**'
      - 'schemas/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'

      - name: Install ValidateLite
        run: pip install validatelite

      - name: Basic data quality validation
        run: |
          vlite check --conn data/customers.csv --table customers \
            --rules schemas/customer_rules.json

      - name: Type conversion feasibility analysis
        run: |
          vlite schema --conn data/legacy_data.xlsx \
            --rules schemas/modernization_schema.json \
            --output json > type_conversion_report.json

      - name: Check conversion compatibility
        run: |
          # Check for incompatible type conversions
          python scripts/check_conversion_feasibility.py type_conversion_report.json

      - name: Upload validation reports
        uses: actions/upload-artifact@v2
        with:
          name: validation-reports
          path: |
            type_conversion_report.json
            validation_*.log

Helper script (scripts/check_conversion_feasibility.py):

#!/usr/bin/env python3
import json
import sys

def check_conversion_feasibility(report_file):
    """Check type conversion feasibility"""
    with open(report_file, 'r') as f:
        report = json.load(f)

    failed_conversions = []
    for result in report.get('results', []):
        if result.get('rule_type') == 'DESIRED_TYPE' and result.get('status') == 'FAILED':
            failed_conversions.append({
                'field': result.get('column'),
                'failed_count': result.get('failed_count'),
                'total_count': result.get('total_count'),
                'failure_rate': result.get('failed_count', 0) / result.get('total_count', 1)
            })

    if failed_conversions:
        print("❌ Type conversion issues found:")
        for conversion in failed_conversions:
            print(f"  - Field {conversion['field']}: {conversion['failed_count']}/{conversion['total_count']} "
                  f"records cannot convert ({conversion['failure_rate']:.1%})")

        # Block merge if failure rate exceeds threshold
        max_failure_rate = max(c['failure_rate'] for c in failed_conversions)
        if max_failure_rate > 0.05:  # 5% threshold
            print(f"❌ Type conversion failure rate {max_failure_rate:.1%} exceeds 5% threshold. Blocking merge.")
            sys.exit(1)
        else:
            print(f"⚠️  Type conversion failure rate {max_failure_rate:.1%} is within acceptable range.")
    else:
        print("✅ All type conversion validations passed.")

if __name__ == '__main__':
    if len(sys.argv) != 2:
        print("Usage: python check_conversion_feasibility.py <report.json>")
        sys.exit(1)

    check_conversion_feasibility(sys.argv[1])

This CI/CD pipeline provides:

Early problem detection - Find data compatibility issues before code merge
Automated validation - No manual data quality checks needed
Block problematic merges - Prevent incompatible data changes from reaching main branch
Detailed reporting - Help developers understand specific issues

Case 9: Data Science Preprocessing Validation

Background: Data scientists need to preprocess raw data including cleaning, type conversion, and format standardization. Before starting model development, it's crucial to validate data quality and assess conversion feasibility.

Raw survey dataset (raw_survey_data.csv):

id,age,income,satisfaction_score,join_date,is_premium,location
1,25.5,50000.0,8.2,2023-01-15,True,New York
2,,"60K",7.8,15/01/2023,1,California
3,thirty,75000,nine,2023-1-20,yes,Texas
4,45,$85000,6.5,2023/01/22,0,Florida
5,52,95000.50,4.9,Jan 25 2023,false,Washington

Preprocessing requirements:

Age field needs conversion to integer (handle text and decimals)
Income field needs standardization to numeric (remove currency symbols and letters)
Satisfaction scores need conversion to standard numeric values
Date formats need standardization
Boolean fields need standardization
Geographic locations need standardization

Create preprocessing validation schema (preprocessing_schema.json):

{
  "rules": [
    {
      "field": "id",
      "type": "string",
      "desired_type": "integer",
      "required": true,
      "description": "Unique user identifier"
    },
    {
      "field": "age",
      "type": "string",
      "desired_type": "integer",
      "required": true,
      "min": 18,
      "max": 100,
      "description": "Age needs conversion to integer, range 18-100"
    },
    {
      "field": "income",
      "type": "string",
      "desired_type": "float(10,2)",
      "required": true,
      "min": 0,
      "description": "Income needs conversion to numeric, remove non-digit characters"
    },
    {
      "field": "satisfaction_score",
      "type": "string",
      "desired_type": "float(3,1)",
      "required": true,
      "min": 1.0,
      "max": 10.0,
      "description": "Satisfaction score, 1-10 scale"
    },
    {
      "field": "join_date",
      "type": "string",
      "desired_type": "date('YYYY-MM-DD')",
      "required": true,
      "description": "Join date, standardize to YYYY-MM-DD format"
    },
    {
      "field": "is_premium",
      "type": "string",
      "desired_type": "boolean",
      "required": true,
      "description": "Premium status, standardize to true/false"
    },
    {
      "field": "location",
      "type": "string(50)",
      "desired_type": "string(20)",
      "required": true,
      "description": "Geographic location, standardize length"
    }
  ],
  "strict_mode": false,
  "case_insensitive": true
}

Run preprocessing validation:

# Step 1: Check data quality and conversion feasibility
vlite schema --conn raw_survey_data.csv \
  --rules preprocessing_schema.json \
  --output json > preprocessing_report.json

# Step 2: Analyze conversion issues
vlite schema --conn raw_survey_data.csv \
  --rules preprocessing_schema.json \
  --verbose

Expected output:

Data Preprocessing Validation Report
====================================

Table: raw_survey_data
Total records: 5

Conversion validation results:
┌─────────────────────┬──────────┬──────────┬──────────┬────────────────┐
│ Field               │ From     │ To       │ Status   │ Issues         │
├─────────────────────┼──────────┼──────────┼──────────┼────────────────┤
│ id                  │ string   │ integer  │ ✅ OK     │ -              │
│ age                 │ string   │ integer  │ ⚠️ PARTIAL│ 2 text values  │
│ income              │ string   │ float    │ ⚠️ PARTIAL│ Format issues  │
│ satisfaction_score  │ string   │ float    │ ⚠️ PARTIAL│ 1 text value   │
│ join_date          │ string   │ date     │ ❌ ISSUES │ Multiple formats│
│ is_premium         │ string   │ boolean  │ ⚠️ PARTIAL│ Format issues  │
│ location           │ string   │ string   │ ✅ OK     │ -              │
└─────────────────────┴──────────┴──────────┴──────────┴────────────────┘

Detailed issue analysis:
• age field: Row 2 (empty), Row 3 ("thirty") cannot convert to integer
• income field: Row 2 ("60K"), Row 4 ("$85000") contain non-numeric characters
• satisfaction_score field: Row 3 ("nine") cannot convert to numeric
• join_date field: Detected 3 different date formats, needs standardization
• is_premium field: Multiple boolean representations (True/1/yes/0/false)

Data cleaning recommendations:
1. Establish missing value strategy for age field
2. Standardize income field format, remove symbols and units
3. Create text-to-numeric mapping rules (nine→9)
4. Standardize date format parsing rules
5. Unify boolean value representation standards

Create data cleaning script (clean_data.py):

import pandas as pd
import re
from datetime import datetime

def clean_survey_data(input_file, output_file):
    """Clean survey data"""
    df = pd.read_csv(input_file)

    # Clean age field
    def clean_age(age):
        if pd.isna(age):
            return None
        if str(age).lower() == 'thirty':
            return 30
        try:
            return int(float(str(age)))
        except:
            return None

    # Clean income field
    def clean_income(income):
        if pd.isna(income):
            return None
        # Remove all non-digit characters (except decimal point)
        cleaned = re.sub(r'[^\d.]', '', str(income))
        try:
            return float(cleaned)
        except:
            return None

    # Clean satisfaction score
    def clean_satisfaction(score):
        if pd.isna(score):
            return None
        if str(score).lower() == 'nine':
            return 9.0
        try:
            return float(score)
        except:
            return None

    # Clean date field
    def clean_date(date_str):
        if pd.isna(date_str):
            return None

        # Try multiple date formats
        formats = ['%Y-%m-%d', '%d/%m/%Y', '%Y-%m-%d', '%Y/%m/%d', '%b %d %Y']
        for fmt in formats:
            try:
                return datetime.strptime(str(date_str), fmt).strftime('%Y-%m-%d')
            except:
                continue
        return None

    # Clean boolean field
    def clean_boolean(value):
        if pd.isna(value):
            return False
        str_val = str(value).lower()
        return str_val in ['true', '1', 'yes', 'y']

    # Apply cleaning rules
    df['age'] = df['age'].apply(clean_age)
    df['income'] = df['income'].apply(clean_income)
    df['satisfaction_score'] = df['satisfaction_score'].apply(clean_satisfaction)
    df['join_date'] = df['join_date'].apply(clean_date)
    df['is_premium'] = df['is_premium'].apply(clean_boolean)
    df['location'] = df['location'].str.strip()

    # Save cleaned data
    df.to_csv(output_file, index=False)
    print(f"Cleaning complete, results saved to {output_file}")

if __name__ == '__main__':
    clean_survey_data('raw_survey_data.csv', 'cleaned_survey_data.csv')

Validate cleaned data:

# Validate cleaned data
vlite schema --conn cleaned_survey_data.csv \
  --rules preprocessing_schema.json \
  --verbose

# Output should show all conversion validations passing

Workflow script (data_preprocessing_workflow.sh):

#!/bin/bash

echo "Starting data preprocessing workflow..."

# 1. Initial data quality assessment
echo "Step 1: Assess raw data quality"
vlite schema --conn raw_survey_data.csv \
  --rules preprocessing_schema.json \
  --output json > initial_assessment.json

# 2. Execute data cleaning
echo "Step 2: Execute data cleaning"
python clean_data.py

# 3. Validate cleaning results
echo "Step 3: Validate cleaning results"
vlite schema --conn cleaned_survey_data.csv \
  --rules preprocessing_schema.json \
  --output json > final_validation.json

# 4. Generate data quality report
echo "Step 4: Generate data quality report"
python generate_quality_report.py initial_assessment.json final_validation.json

echo "Data preprocessing workflow complete!"

This scenario shows data scientists how to use ValidateLite for:

Data quality assessment - Understanding raw data issues
Conversion feasibility analysis - Evaluating cleaning strategy effectiveness
Cleaning validation - Ensuring processed data meets modeling requirements
Automated workflow - Standardized data preprocessing pipeline

Output and Results

ValidateLite provides two main output formats: table format and JSON format. Understanding the output helps you quickly identify data quality issues.

Table Output Format

Default table output provides a clear overview:

Data Validation Results
=======================

Connection: customers.csv
Table: customers
Rules executed: 5
Validation time: 1.23s

┌─────────────────┬──────────┬──────────┬──────────┬─────────────────┐
│ Rule            │ Type     │ Status   │ Failed   │ Details         │
├─────────────────┼──────────┼──────────┼──────────┼─────────────────┤
│ email_required  │ NOT_NULL │ ✅ PASS   │ 0/1000   │ All records OK  │
│ unique_email    │ UNIQUE   │ ❌ FAIL   │ 12/1000  │ 12 duplicates   │
│ valid_age       │ RANGE    │ ⚠️ WARN   │ 3/1000   │ 3 out of range  │
│ phone_format    │ REGEX    │ ✅ PASS   │ 0/1000   │ All valid       │
│ status_enum     │ ENUM     │ ❌ FAIL   │ 25/1000  │ Invalid values  │
└─────────────────┴──────────┴──────────┴──────────┴─────────────────┘

Overall Status: FAILED (2 rules failed)

Verbose table output includes sample data:

vlite check --conn data.csv --table users --rule "unique(email)" --verbose

Validation Results (Verbose)
============================

Rule: unique_email
Type: UNIQUE
Status: ❌ FAILED
Failed records: 12 out of 1000 total

Sample failures:
┌─────┬─────────────────────┬─────────────┐
│ Row │ Email               │ Occurrences │
├─────┼─────────────────────┼─────────────┤
│ 145 │ john@email.com      │ 3           │
│ 298 │ mary@email.com      │ 2           │
│ 456 │ bob@company.com     │ 2           │
│ 789 │ admin@system.com    │ 5           │
└─────┴─────────────────────┴─────────────┘

Recommendation: Review duplicate email addresses and decide on deduplication strategy.

JSON Output Format

JSON output is perfect for automation and integration:

vlite schema --conn data.csv --rules schema.json --output json

{
  "validation_summary": {
    "connection": "data.csv",
    "table": "users",
    "total_rules": 5,
    "passed_rules": 3,
    "failed_rules": 2,
    "warning_rules": 0,
    "validation_time": "1.23s",
    "overall_status": "FAILED"
  },
  "results": [
    {
      "rule_id": "email_required",
      "rule_type": "NOT_NULL",
      "column": "email",
      "status": "PASSED",
      "total_count": 1000,
      "failed_count": 0,
      "failure_rate": 0.0,
      "message": "All records have non-null email values"
    },
    {
      "rule_id": "email_unique",
      "rule_type": "UNIQUE",
      "column": "email",
      "status": "FAILED",
      "total_count": 1000,
      "failed_count": 12,
      "failure_rate": 0.012,
      "message": "Found 12 duplicate email addresses",
      "sample_data": [
        {"row": 145, "email": "john@email.com", "occurrences": 3},
        {"row": 298, "email": "mary@email.com", "occurrences": 2}
      ]
    },
    {
      "rule_id": "salary_conversion",
      "rule_type": "DESIRED_TYPE",
      "column": "salary",
      "status": "FAILED",
      "current_type": "string",
      "desired_type": "float(10,2)",
      "total_count": 1000,
      "failed_count": 45,
      "failure_rate": 0.045,
      "message": "45 records cannot be converted from string to float(10,2)",
      "conversion_analysis": {
        "compatible_records": 955,
        "incompatible_records": 45,
        "common_issues": [
          "Currency symbols ($, €, £)",
          "Thousands separators (,)",
          "Text values (N/A, TBD)"
        ]
      }
    }
  ]
}

Status Codes

ValidateLite uses clear exit codes for automation:

Exit Code	Meaning	Description
0	Success	All validations passed
1	Validation Failed	One or more rules failed
2	Usage Error	Invalid command line arguments
3	Connection Error	Cannot connect to data source
4	File Error	File not found or permission issues
5	Configuration Error	Invalid schema or rule format

Using exit codes in scripts:

#!/bin/bash

vlite check --conn data.csv --table users --rule "not_null(email)"
exit_code=$?

case $exit_code in
  0)
    echo "✅ Data validation passed"
    ;;
  1)
    echo "❌ Data validation failed - check the output above"
    exit 1
    ;;
  *)
    echo "💥 Validation error (code: $exit_code)"
    exit $exit_code
    ;;
esac

Output Redirection

Save results to files:

# Save table output
vlite check --conn data.csv --table users --rule "unique(email)" > validation_report.txt

# Save JSON output
vlite schema --conn data.csv --rules schema.json --output json > results.json

# Save both stdout and stderr
vlite check --conn data.csv --table users --rule "unique(email)" &> full_output.log

# Append to existing files
vlite check --conn data.csv --table users --rule "range(age, 0, 120)" >> daily_checks.log

Parse JSON results:

import json

# Load validation results
with open('results.json', 'r') as f:
    results = json.load(f)

# Check overall status
if results['validation_summary']['overall_status'] == 'FAILED':
    print("Validation failed!")

    # Get failed rules
    failed_rules = [r for r in results['results'] if r['status'] == 'FAILED']
    for rule in failed_rules:
        print(f"Rule {rule['rule_id']}: {rule['failed_count']} failures")

Configuration

ValidateLite supports various configuration methods, from simple command-line parameters to complex configuration files for different usage scenarios.

Environment Variables

Database connections:

# MySQL connection
export DB_HOST="production-db.company.com"
export DB_USER="data_analyst"
export DB_PASSWORD="secure_password"
export DB_NAME="analytics"
export MYSQL_URL="mysql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:3306/${DB_NAME}"

# PostgreSQL connection
export PG_HOST="warehouse.company.com"
export PG_USER="reporting_user"
export PG_PASSWORD="another_secure_password"
export PG_NAME="data_warehouse"
export POSTGRES_URL="postgresql://${PG_USER}:${PG_PASSWORD}@${PG_HOST}:5432/${PG_NAME}"

# Use in commands
vlite check --conn "$MYSQL_URL" --table customers --rule "not_null(email)"

Performance tuning:

# Query timeouts (in seconds)
export VLITE_QUERY_TIMEOUT=300
export VLITE_CONNECTION_TIMEOUT=30

# Memory limits
export VLITE_MAX_SAMPLE_SIZE=1000
export VLITE_BATCH_SIZE=10000

# Parallel processing
export VLITE_MAX_WORKERS=4

Connection Strings

Advanced connection string options:

# MySQL with SSL
--conn "mysql://user:pass@host:3306/db?ssl_ca=/path/to/ca.pem&ssl_cert=/path/to/cert.pem"

# PostgreSQL with connection pool
--conn "postgresql://user:pass@host:5432/db?pool_size=10&max_overflow=20"

# SQLite with custom timeout
--conn "sqlite:///data.db?timeout=20"

Connection string with table specification:

# Include table name in connection string
--conn "mysql://user:pass@host:3306/database.table_name"

# Override with command line parameter
--conn "mysql://user:pass@host:3306/database.table_name" --table "different_table"

Performance Settings

For large datasets:

{
  "performance": {
    "query_timeout": 600,
    "sample_size": 5000,
    "batch_size": 50000,
    "parallel_workers": 8,
    "memory_limit": "2GB"
  },
  "rules": [
    {
      "field": "user_id",
      "type": "string",
      "desired_type": "integer",
      "required": true
    }
  ]
}

For development/testing:

{
  "performance": {
    "query_timeout": 30,
    "sample_size": 100,
    "batch_size": 1000,
    "parallel_workers": 2
  }
}

Troubleshooting

This section helps you solve common issues when using ValidateLite, especially with type conversion validation.

Common Errors

Connection Issues

Error Message	Possible Cause	Solution
`Connection timeout`	Database unreachable	Check host, port, and network connectivity
`Authentication failed`	Wrong credentials	Verify username and password
`Database not found`	Wrong database name	Check database name in connection string
`File not found: data.csv`	Wrong file path	Use absolute path or check current directory
`Permission denied`	File access rights	Check file permissions or run with proper rights

Schema and Rule Errors

Error Message	Possible Cause	Solution
`Invalid JSON schema`	Malformed JSON	Validate JSON syntax with a JSON validator
`Unknown rule type: INVALID`	Typo in rule type	Use valid rule types: NOT_NULL, UNIQUE, RANGE, etc.
`Missing required field: field`	Schema missing field name	Add "field" property to rule definition
`Table 'users' not found`	Wrong table name	Check table name and database connection

Type Conversion Errors

Error Message	Possible Cause	Solution
`Invalid type syntax: float(10)`	Wrong type definition format	Use correct format: `float(10,2)`
`Conflicting conversion: datetime to integer`	Impossible type conversion	Check desired_type setting for reasonableness
`Type conversion timeout`	Conversion validation timeout	Increase `conversion_timeout` config or reduce data size
`Precision must be greater than scale`	Wrong float precision config	Ensure precision > scale

Connection Issues

Debug connection problems:

# Test basic connectivity
vlite check --conn "mysql://user:pass@host:3306/db" --table "information_schema.tables" --rule "not_null(table_name)"

# Verbose connection debugging
vlite check --conn data.csv --table nonexistent --rule "not_null(id)" --verbose

Common connection string fixes:

# Wrong: Missing protocol
--conn "user:pass@host:3306/database"
# Right: Include protocol
--conn "mysql://user:pass@host:3306/database"

# Wrong: Incorrect port for PostgreSQL
--conn "postgresql://user:pass@host:3306/database"
# Right: Use PostgreSQL default port
--conn "postgresql://user:pass@host:5432/database"

# Wrong: Relative path issues
--conn "data/file.csv"
# Right: Use absolute path
--conn "/full/path/to/data/file.csv"

Type Conversion Errors

Debug type conversion issues:

# Check what types are detected
vlite schema --conn data.csv --rules schema.json --verbose

# Test conversion with smaller sample
vlite schema --conn data.csv --rules schema.json --sample-size 100

Common type conversion fixes:

// Wrong: Impossible conversion
{
  "field": "description",
  "type": "string",
  "desired_type": "integer"  // Text cannot become numbers
}

// Right: Reasonable conversion
{
  "field": "description",
  "type": "string(1000)",
  "desired_type": "string(500)"  // Truncate long text
}

// Wrong: Invalid precision/scale
{
  "field": "amount",
  "type": "string",
  "desired_type": "float(2,10)"  // Scale > precision
}

// Right: Valid precision/scale
{
  "field": "amount",
  "type": "string",
  "desired_type": "float(12,2)"  // Precision > scale
}

Handle problematic data:

# Script to identify problematic records
import json

with open('validation_results.json') as f:
    results = json.load(f)

for result in results['results']:
    if result['rule_type'] == 'DESIRED_TYPE' and result['status'] == 'FAILED':
        print(f"Field: {result['column']}")
        print(f"Conversion: {result['current_type']} → {result['desired_type']}")
        print(f"Failed: {result['failed_count']}/{result['total_count']}")

        if 'sample_data' in result:
            print("Sample problematic values:")
            for sample in result['sample_data'][:5]:
                print(f"  Row {sample['row']}: {sample['value']}")
        print()

Get help:

# Show command help
vlite check --help
vlite schema --help

# Show version
vlite --version

# Test with minimal example
vlite check --conn /dev/null --table test --rule "not_null(id)" 2>&1

If you're still having issues, the most common problems are:

Connection strings - Double-check your database connection details
File paths - Use absolute paths when in doubt
Type definitions - Make sure your desired_type conversions make sense
JSON syntax - Validate your schema files with a JSON checker

ValidateLite is designed to give you clear error messages, so read them carefully - they usually point directly to the problem!

FilesExpand file tree

USAGE.md

Latest commit

History

USAGE.md

File metadata and controls

ValidateLite User Guide

Table of Contents

Overview

Installation

Install from PyPI (Recommended)

Install from Source

Verify Installation

Dependencies

Quick Start

Basic Data Check

Multiple Checks

Schema Validation with Type Conversion

Data Sources

File Sources

Database Sources

Environment Variables

Command Overview

vlite check command

vlite schema command

Using check command

Rule Types

Completeness Rules

Uniqueness Rules

Format Validation Rules

Value Validation Rules

Range Validation Rules

JSON Rule Files

Using schema command

Basic Type System

Data Type Definition Syntax

String Type Definitions

Float Type Definitions

DateTime Type Definitions

Desired Type Feature

Why Desired Type Matters

Using Desired Type

Application in Desired Type

Type Compatibility Analysis

Compatible Conversion

Partial Conversion

Incompatible Conversion

Conversion Validation Strategy

String to Numeric Conversion

String to Date/DateTime Conversion

String Length Validation

Type Downgrading Validation

Use Cases

Case 1: Customer Data Quality Check

Case 2: E-commerce Order Validation

Case 3: Excel Financial Report Validation

Case 4: Pre-migration Data Validation

Case 5: Legacy System Data Cleanup

Case 6: API Data Interface Validation

Case 7: Batch File Validation

Case 8: Data Validation in CI/CD

Case 9: Data Science Preprocessing Validation

Output and Results

Table Output Format

JSON Output Format

Status Codes

Output Redirection

Configuration

Environment Variables

Connection Strings

Performance Settings

Troubleshooting

Common Errors

Connection Issues

Schema and Rule Errors

Type Conversion Errors

Connection Issues

Type Conversion Errors