A practical tool for checking data quality and validating type conversions.
- Overview
- Installation
- Quick Start
- Data Sources
- Command Overview
- Using check command
- Using schema command
- Use Cases
- Case 1: Customer Data Quality Check
- Case 2: E-commerce Order Validation
- Case 3: Excel Financial Report Validation
- Case 4: Pre-migration Data Validation
- Case 5: Legacy System Data Cleanup
- Case 6: API Data Interface Validation
- Case 7: Batch File Validation
- Case 8: Data Validation in CI/CD
- Case 9: Data Science Preprocessing Validation
- Output and Results
- Configuration
- Troubleshooting
ValidateLite is a Python command-line tool designed for data quality validation. It provides two main validation approaches:
Quick validation with vlite check
- Perfect for ad-hoc data checks and exploration
- Single rule validation with immediate feedback
- Great for debugging and development
Schema-based validation with vlite schema
- Comprehensive validation using JSON schema files
- Batch processing for multiple rules and tables
- Features the powerful Desired Type functionality for type conversion validation
What makes ValidateLite special?
The standout feature is Desired Type validation - it doesn't just check if your data fits a schema, it tells you whether your data can be safely converted to a different type. This is invaluable for:
- Data migration planning
- System upgrades
- ETL process validation
- Data quality assessment before transformations
Supported data sources:
- Files: CSV, Excel, JSON
- Databases: MySQL, PostgreSQL, SQLite
pip install validatelitegit clone https://github.com/litedatum/validatelite.git
cd validatelite
pip install -e .vlite --versionValidateLite works with:
- Python 3.8+
- pandas (for Excel/CSV processing)
- SQLAlchemy (for database connections)
- Click (for CLI interface)
Database drivers are optional:
- MySQL:
pip install pymysql - PostgreSQL:
pip install psycopg2-binary - SQLite: Built into Python
Here are some simple examples to get you started:
# Check for missing email addresses
vlite check --conn customers.csv --table customers --rule "not_null(email)"# Run several checks at once
vlite check --conn data.csv --table data \
--rule "not_null(id)" \
--rule "unique(email)" \
--rule "range(age, 18, 99)"# Check if string data can be converted to proper types
vlite schema --conn messy_data.csv --rules cleanup_schema.jsonSample schema file (cleanup_schema.json):
{
"rules": [
{
"field": "user_id",
"type": "string",
"desired_type": "integer",
"required": true
},
{
"field": "salary",
"type": "string",
"desired_type": "float(10,2)",
"required": true
}
]
}This will tell you exactly which records can't be converted from string to integer/float.
ValidateLite connects to various data sources with a simple connection string approach.
CSV Files:
--conn data.csv
--conn /path/to/data.csv
--conn file://data.csvExcel Files:
--conn report.xlsx
--conn /path/to/report.xlsx
# For multi-sheet Excel files, specify the sheet
--conn report.xlsx --table "Sheet1"JSON Files:
--conn data.json
--conn /path/to/data.jsonMySQL:
--conn "mysql://username:password@host:port/database"
--conn "mysql://user:pass@localhost:3306/sales"PostgreSQL:
--conn "postgresql://username:password@host:port/database"
--conn "postgres://user:pass@localhost:5432/analytics"SQLite:
--conn "sqlite:///path/to/database.db"
--conn "sqlite:///data/local.db"Keep sensitive connection details out of your commands:
# Set environment variables
export DB_HOST="localhost"
export DB_USER="analyst"
export DB_PASSWORD="secret123"
export DB_NAME="sales"
# Build connection string
export MYSQL_URL="mysql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:3306/${DB_NAME}"
# Use in commands
vlite check --conn "$MYSQL_URL" --table users --rule "not_null(email)"ValidateLite offers two commands for different validation needs.
Quick data quality checks for immediate feedback:
vlite check --conn <data_source> --table <table_name> --rule "<rule>" [options]Key features:
- Instant validation without config files
- Flexible inline rule definitions
- Fast feedback for development and debugging
- One rule at a time execution
Best for:
- Development phase testing
- Data exploration and analysis
- Quick data quality checks
- Debugging and troubleshooting
Comprehensive validation using schema files:
# Single table validation
vlite schema --conn <data_source> --table <table_name> --rules <schema.json> [options]
# Multi-table validation (tables defined in schema)
vlite schema --conn <data_source> --rules <schema.json> [options]Key features:
- Schema-driven with JSON schema files
- Batch validation for multiple tables and rules
- Type conversion analysis with Desired Type functionality
- Structured configuration for reuse and version control
Best for:
- Production data quality monitoring
- Pre-migration data validation
- ETL pipeline data validation
- Automated testing in CI/CD
Schema file syntax differences:
When using --table parameter, your schema should contain field-level rules:
{
"rules": [
{
"field": "email",
"type": "string(255)",
"desired_type": "string(100)",
"required": true
}
]
}When not using --table parameter, your schema should contain table-level definitions:
{
"tables": [
{
"name": "users",
"fields": [
{
"field": "email",
"type": "string(255)",
"desired_type": "string(100)",
"required": true
}
]
}
]
}ValidateLite provides comprehensive validation rules covering all aspects of data quality.
| Category | Rule Type | Purpose |
|---|---|---|
| Completeness | NOT_NULL | Check for missing values |
| Uniqueness | UNIQUE | Find duplicate values |
| Format | REGEX | Validate patterns |
| Format | DATE_FORMAT | Check date formats |
| Value | ENUM | Validate against allowed values |
| Range | RANGE | Check numeric ranges |
Check for missing values:
# Basic not-null check
--rule "not_null(email)"
# With custom message
--rule "not_null(customer_id, 'Customer ID is required')"
# Check multiple columns
--rule "not_null(first_name)"
--rule "not_null(last_name)"
--rule "not_null(email)"Find duplicate records:
# Check for duplicate emails
--rule "unique(email)"
# Check for duplicate combinations
--rule "unique(first_name, last_name, birth_date)"
# Check with filter conditions
--rule "unique(username) WHERE status = 'active'"REGEX pattern validation:
# Email format validation
--rule "regex(email, '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')"
# Phone number format
--rule "regex(phone, '^\\+?1?[0-9]{10,14}$')"
# Product code format
--rule "regex(product_code, '^[A-Z]{2}[0-9]{4}$')"DATE_FORMAT validation:
# Basic syntax
--rule "date_format(column_name, 'format_pattern')"Supported date format patterns:
| Pattern | Example | Description |
|---|---|---|
YYYY-MM-DD |
2023-12-25 | ISO date format |
MM/DD/YYYY |
12/25/2023 | US date format |
DD/MM/YYYY |
25/12/2023 | European date format |
YYYYMMDD |
20231225 | Compact date format |
DD.MM.YYYY |
25.12.2023 | German date format |
YYYY-MM-DD HH:MI:SS |
2023-12-25 14:30:00 | DateTime format |
DD-MMM-YYYY |
25-Dec-2023 | Month abbreviation format |
YYYY/MM/DD |
2023/12/25 | Slash-separated format |
Format components:
YYYYoryyyy- Four-digit yearMMormm- Two-digit month (01-12)DDordd- Two-digit day (01-31)HHorhh- Two-digit hour (00-23)MIormi- Two-digit minute (00-59)SSorss- Two-digit second (00-59)
# Examples
--rule "date_format(created_at, 'YYYY-MM-DD HH:MI:SS')"
--rule "date_format(birth_date, 'MM/DD/YYYY')"
--rule "date_format(event_date, 'DD.MM.YYYY')"Database support:
- MySQL: Native support for all formats
- PostgreSQL: Uses regex pre-validation + Python verification
- SQLite: Uses custom function validation
ENUM (allowed values) validation:
# Status field validation
--rule "enum(status, 'active', 'inactive', 'pending')"
# Priority levels
--rule "enum(priority, 'low', 'medium', 'high', 'critical')"
# Boolean-like values
--rule "enum(is_verified, 'true', 'false', '1', '0')"Numeric range validation:
# Age validation
--rule "range(age, 0, 120)"
# Price validation with decimals
--rule "range(price, 0.01, 999999.99)"
# Percentage validation
--rule "range(completion_rate, 0.0, 100.0)"
# Year validation
--rule "range(birth_year, 1900, 2024)"For complex validation scenarios, use JSON rule files:
Basic rule file (validation_rules.json):
{
"rules": [
{
"name": "email_required",
"type": "NOT_NULL",
"target": {
"database": "sales_db",
"table": "customers",
"column": "email"
},
"severity": "HIGH"
},
{
"name": "unique_customer_email",
"type": "UNIQUE",
"target": {
"database": "sales_db",
"table": "customers",
"column": "email"
},
"severity": "HIGH"
},
{
"name": "valid_age_range",
"type": "RANGE",
"target": {
"database": "sales_db",
"table": "customers",
"column": "age"
},
"parameters": {
"min_value": 18,
"max_value": 99
},
"severity": "MEDIUM"
}
]
}Using rule files:
vlite check --conn "mysql://user:pass@host:3306/sales_db" \
--table customers --rules validation_rules.jsonThis is where ValidateLite really shines! ValidateLite provides industry-leading type system and data conversion validation capabilities.
ValidateLite supports these fundamental data types:
| Type | Description | Examples |
|---|---|---|
string |
Text data | "John", "Hello World" |
integer |
Whole numbers | 42, -17, 0 |
float |
Decimal numbers | 3.14, -0.5, 100.00 |
boolean |
True/false values | true, false |
date |
Date values | 2023-12-25 |
datetime |
Date and time values | 2023-12-25 14:30:00 |
ValidateLite provides intuitive data type definition syntax with precise type constraints:
{
"field": "username",
"type": "string(50)", // Max length 50 characters
"required": true
}String type definition syntax:
string(100)- Max length 100 charactersstring(10,50)- Length between 10-50 charactersstring- No length restrictions
{
"field": "price",
"type": "float(10,2)", // Precision 10, scale 2
"required": true
}Float type definition syntax:
float(10,2)- Precision 10, scale 2 decimal placesfloat(8,3)- Precision 8, scale 3 decimal placesfloat- Standard float
{
"field": "created_at",
"type": "datetime('YYYY-MM-DD HH:MI:SS')", // Specific datetime format
"required": true
}DateTime type definition syntax:
datetime('YYYY-MM-DD HH:MI:SS')- Specific datetime formatdate('YYYY-MM-DD')- Specific date formatdatetime- Standard datetime format
Desired Type is ValidateLite's most valuable feature! It lets you validate whether data can be safely converted to a target type, which is crucial for data migration, system upgrades, and data cleaning scenarios.
Traditional validation just checks if data matches a schema. Desired Type goes further - it tells you if your messy string data can actually be converted to proper types like integers or dates.
Example scenario: You have a CSV file where everything is stored as strings:
user_id: "123"(should be integer)salary: "75000.50"(should be float)join_date: "2023-01-15"(should be date)
Desired Type validation will tell you exactly which records can be converted and which ones will cause problems.
Desired Type uses the same type definition syntax for precise validation:
{
"transactions": {
"rules": [
{
"field": "amount",
"type": "string", // Current: string data
"desired_type": "float(12,2)", // Target: decimal with 12 precision, 2 scale
"required": true
},
{
"field": "transaction_date",
"type": "string", // Current: string data
"desired_type": "datetime('YYYY-MM-DD')", // Target: specific datetime format
"required": true
},
{
"field": "description",
"type": "string(500)", // Current: long strings
"desired_type": "string(200)", // Target: shorter strings
"required": true
}
]
}
}Desired Type supports the same type definition syntax for precise validation:
{
"migration_analysis": {
"rules": [
{
"field": "legacy_id",
"type": "string(50)", // Current: string with max 50 chars
"desired_type": "integer", // Target: integer
"required": true
},
{
"field": "legacy_amount",
"type": "string", // Current: free-form string
"desired_type": "float(10,2)", // Target: precise decimal
"required": true
},
{
"field": "legacy_timestamp",
"type": "string", // Current: string timestamp
"desired_type": "datetime('YYYY-MM-DD HH:MI:SS')", // Target: structured datetime
"required": true
}
]
}
}What you get from Desired Type validation:
- Count of records that can be converted successfully
- Count of problematic records that would fail conversion
- Sample data showing exactly what the problems are
- Conversion feasibility percentage
- Specific error patterns in your data
ValidateLite analyzes type conversion compatibility and reports three possible outcomes:
All data can be safely converted to the desired type.
Example:
Field: user_id
Current Type: string → Desired Type: integer
Result: ✅ COMPATIBLE (500/500 records can be converted)
Some data can be converted, but some records have issues.
Example:
Field: salary
Current Type: string → Desired Type: float(10,2)
Result: ⚠️ PARTIAL (487/500 records can be converted)
Issues: 13 records contain non-numeric characters
Most or all data cannot be converted to the desired type.
Example:
Field: comments
Current Type: string → Desired Type: integer
Result: ❌ INCOMPATIBLE (0/500 records can be converted)
Issues: Text data cannot be converted to integers
ValidateLite uses smart conversion validation strategies:
- Removes common formatting (spaces, commas, currency symbols)
- Handles scientific notation
- Validates decimal precision and scale
- Checks for overflow conditions
- Attempts multiple common date formats
- Validates actual date values (no Feb 31st)
- Handles timezone considerations
- Checks for impossible dates
- Measures actual character length
- Considers UTF-8 encoding
- Validates against target length constraints
- Checks if larger types can fit into smaller ones
- Validates precision/scale requirements for decimals
- Ensures no data loss during conversion
Comprehensive validation output: When you run Desired Type validation, you get detailed information about:
- Which fields can be safely converted
- Which data needs cleaning
- Specific failure samples and suggested fixes
This section provides complete usage scenarios showcasing Desired Type functionality.
Background: You have a customer database that's been collecting data for years. Data quality has declined and you need to assess what can be cleaned up.
Dataset: Customer table with mixed data quality
customer_id,name,email,phone,age,registration_date,is_premium
1,John Smith,john@email.com,555-1234,25,2023-01-15,true
2,"Jane, Doe",jane@email.com,,35,01/15/2023,1
3,Bob Johnson,invalid-email,555-ABCD,age_unknown,2023/1/15,yes
4,"Mike Wilson",mike@email.com,5551234567,45,2023-01-15,falseQuick validation with check command:
# Check for basic data quality issues
vlite check --conn customers.csv --table customers \
--rule "not_null(customer_id)" \
--rule "unique(email)" \
--rule "regex(email, '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')" \
--rule "not_null(phone)" \
--verboseSchema validation for cleanup planning:
Create customer_cleanup.json:
{
"rules": [
{
"field": "customer_id",
"type": "string",
"desired_type": "integer",
"required": true
},
{
"field": "age",
"type": "string",
"desired_type": "integer",
"required": false,
"min": 18,
"max": 100
},
{
"field": "registration_date",
"type": "string",
"desired_type": "date('YYYY-MM-DD')",
"required": true
},
{
"field": "is_premium",
"type": "string",
"desired_type": "boolean",
"required": true
}
]
}# Analyze what can be cleaned up
vlite schema --conn customers.csv --rules customer_cleanup.json --verboseThis tells you exactly which customer records have data quality issues and what types of problems exist.
Background: Validate daily order data before processing payments and shipments.
# Comprehensive order validation
vlite check --conn "mysql://user:pass@db:3306/ecommerce" --table orders \
--rule "not_null(order_id)" \
--rule "unique(order_id)" \
--rule "not_null(customer_id)" \
--rule "range(total_amount, 0.01, 999999.99)" \
--rule "enum(status, 'pending', 'paid', 'shipped', 'delivered', 'cancelled')" \
--rule "date_format(created_at, 'YYYY-MM-DD HH:MI:SS')" \
--verboseBackground: Monthly financial reports come in Excel format and need validation before importing into the accounting system.
Excel file structure (monthly_report.xlsx):
- Sheet: "Revenue"
- Columns: transaction_id, amount, currency, transaction_date, category
Multi-sheet validation:
First, check what sheets are available:
vlite schema --conn monthly_report.xlsx --rules basic_schema.jsonThen validate specific sheets:
# Validate Revenue sheet
vlite schema --conn monthly_report.xlsx --table "Revenue" --rules revenue_schema.json
# Validate Expenses sheet
vlite schema --conn monthly_report.xlsx --table "Expenses" --rules expense_schema.jsonRevenue validation schema (revenue_schema.json):
{
"rules": [
{
"field": "transaction_id",
"type": "string",
"desired_type": "string(20)",
"required": true
},
{
"field": "amount",
"type": "string",
"desired_type": "float(15,2)",
"required": true,
"min": 0.01
},
{
"field": "transaction_date",
"type": "string",
"desired_type": "date('YYYY-MM-DD')",
"required": true
}
],
"strict_mode": true
}Background: Before migrating from a legacy system to a modern database, you need to validate that all data can be properly converted and identify cleanup requirements.
Legacy system data characteristics:
- Everything stored as VARCHAR
- Inconsistent date formats
- Mixed boolean representations
- Unreliable numeric formatting
Migration readiness schema (migration_readiness.json):
{
"users": {
"rules": [
{
"field": "user_id",
"type": "string(50)",
"desired_type": "integer",
"required": true
},
{
"field": "email",
"type": "string(500)",
"desired_type": "string(255)",
"required": true
},
{
"field": "created_date",
"type": "string",
"desired_type": "date('YYYY-MM-DD')", // Target: standard date format
"required": true
},
{
"field": "last_login",
"type": "string",
"desired_type": "datetime('YYYY-MM-DD HH:MI:SS')", // Target: standard datetime
"required": false
},
{
"field": "is_active",
"type": "string",
"desired_type": "boolean",
"required": true
}
],
"strict_mode": false
}
}# Analyze migration readiness
vlite schema --conn "mysql://legacy:pass@old-db:3306/legacy_db" \
--rules migration_readiness.json --output json > migration_report.json
# Get detailed conversion analysis
vlite schema --conn "mysql://legacy:pass@old-db:3306/legacy_db" \
--rules migration_readiness.json --verboseExpected output:
Migration Readiness Report
==========================
Table: users
Total records: 10,543
Type conversion analysis:
┌─────────────────┬──────────┬──────────┬──────────┬─────────────────┐
│ Field │ From │ To │ Status │ Issues │
├─────────────────┼──────────┼──────────┼──────────┼─────────────────┤
│ user_id │ string │ integer │ ✅ OK │ - │
│ email │ string │ string │ ⚠️ WARN │ 12 too long │
│ created_date │ string │ date │ ⚠️ WARN │ 45 bad formats │
│ last_login │ string │ datetime │ ❌ ISSUES │ 234 bad formats │
│ is_active │ string │ boolean │ ⚠️ WARN │ 8 unclear values│
└─────────────────┴──────────┴──────────┴──────────┴─────────────────┘
Field: created_date
✓ Field exists (string)
✓ Non-null constraint
✗ Type conversion validation (string → date('YYYY-MM-DD')): 156 incompatible records
Failure samples:
Row 12: "2023/12/25" (slash format, needs standardization)
Row 34: "Dec 25, 2023" (English format)
Row 67: "25.12.2023" (European format)
Recommended cleanup:
1. Standardize date formats to YYYY-MM-DD
2. Trim email fields that exceed 255 characters
3. Normalize boolean values (true/false only)
4. Fix malformed datetime values
This gives you a complete roadmap for data cleanup before migration.
Background: You inherit a legacy system with years of accumulated data quality issues. You need to understand the scope of cleanup required.
Legacy data issues:
- Mixed encodings
- Inconsistent data entry
- No validation for years
- Multiple date formats
- Currency symbols in numeric fields
Cleanup assessment schema (legacy_cleanup.json):
{
"rules": [
{
"field": "customer_id",
"type": "string",
"desired_type": "integer",
"required": true
},
{
"field": "first_name",
"type": "string(1000)",
"desired_type": "string(50)",
"required": true
},
{
"field": "salary",
"type": "string",
"desired_type": "float(10,2)",
"required": false,
"min": 0
},
{
"field": "hire_date",
"type": "string",
"desired_type": "date('YYYY-MM-DD')",
"required": true
},
{
"field": "department_id",
"type": "string",
"desired_type": "integer",
"required": true
}
],
"strict_mode": false
}Cleanup process:
# Step 1: Assess current state
vlite schema --conn legacy_data.csv --rules legacy_cleanup.json \
--output json > cleanup_assessment.json
# Step 2: Get detailed samples
vlite schema --conn legacy_data.csv --rules legacy_cleanup.json \
--verbose > cleanup_details.txt
# Step 3: Validate after initial cleanup
# (after running data cleaning scripts)
vlite schema --conn cleaned_data.csv --rules legacy_cleanup.json \
--verboseSample output showing improvement:
Before cleanup:
salary field: 1,234 records with currency symbols ($, €, £)
hire_date field: 567 records with inconsistent formats
After cleanup:
salary field: 23 records still need manual review
hire_date field: 12 records still need manual review
Background: Validate data received from external APIs before processing.
API validation schema (api_validation.json):
{
"rules": [
{
"field": "user_id",
"type": "string",
"desired_type": "integer",
"required": true
},
{
"field": "timestamp",
"type": "string",
"desired_type": "datetime('YYYY-MM-DD HH:MI:SS')", // Internal: standard format
"required": true
},
{
"field": "amount",
"type": "string",
"desired_type": "float(12,2)",
"required": true,
"min": 0
}
]
}# Validate API response data
vlite schema --conn api_response.json --rules api_validation.jsonBackground: Process multiple files in a batch operation.
#!/bin/bash
# validate_batch.sh
for file in data_files/*.csv; do
echo "Validating $file..."
vlite schema --conn "$file" --rules batch_schema.json \
--output json > "reports/$(basename "$file" .csv)_report.json"
done
echo "Validation complete. Check reports/ directory for results."Background: Integrate data quality checks into your CI/CD pipeline to catch data compatibility issues before they reach production.
Create .github/workflows/data-validation.yml:
name: Data Quality and Type Conversion Validation
on:
push:
paths:
- 'data/**'
- 'schemas/**'
pull_request:
paths:
- 'data/**'
- 'schemas/**'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Install ValidateLite
run: pip install validatelite
- name: Basic data quality validation
run: |
vlite check --conn data/customers.csv --table customers \
--rules schemas/customer_rules.json
- name: Type conversion feasibility analysis
run: |
vlite schema --conn data/legacy_data.xlsx \
--rules schemas/modernization_schema.json \
--output json > type_conversion_report.json
- name: Check conversion compatibility
run: |
# Check for incompatible type conversions
python scripts/check_conversion_feasibility.py type_conversion_report.json
- name: Upload validation reports
uses: actions/upload-artifact@v2
with:
name: validation-reports
path: |
type_conversion_report.json
validation_*.logHelper script (scripts/check_conversion_feasibility.py):
#!/usr/bin/env python3
import json
import sys
def check_conversion_feasibility(report_file):
"""Check type conversion feasibility"""
with open(report_file, 'r') as f:
report = json.load(f)
failed_conversions = []
for result in report.get('results', []):
if result.get('rule_type') == 'DESIRED_TYPE' and result.get('status') == 'FAILED':
failed_conversions.append({
'field': result.get('column'),
'failed_count': result.get('failed_count'),
'total_count': result.get('total_count'),
'failure_rate': result.get('failed_count', 0) / result.get('total_count', 1)
})
if failed_conversions:
print("❌ Type conversion issues found:")
for conversion in failed_conversions:
print(f" - Field {conversion['field']}: {conversion['failed_count']}/{conversion['total_count']} "
f"records cannot convert ({conversion['failure_rate']:.1%})")
# Block merge if failure rate exceeds threshold
max_failure_rate = max(c['failure_rate'] for c in failed_conversions)
if max_failure_rate > 0.05: # 5% threshold
print(f"❌ Type conversion failure rate {max_failure_rate:.1%} exceeds 5% threshold. Blocking merge.")
sys.exit(1)
else:
print(f"⚠️ Type conversion failure rate {max_failure_rate:.1%} is within acceptable range.")
else:
print("✅ All type conversion validations passed.")
if __name__ == '__main__':
if len(sys.argv) != 2:
print("Usage: python check_conversion_feasibility.py <report.json>")
sys.exit(1)
check_conversion_feasibility(sys.argv[1])This CI/CD pipeline provides:
- Early problem detection - Find data compatibility issues before code merge
- Automated validation - No manual data quality checks needed
- Block problematic merges - Prevent incompatible data changes from reaching main branch
- Detailed reporting - Help developers understand specific issues
Background: Data scientists need to preprocess raw data including cleaning, type conversion, and format standardization. Before starting model development, it's crucial to validate data quality and assess conversion feasibility.
Raw survey dataset (raw_survey_data.csv):
id,age,income,satisfaction_score,join_date,is_premium,location
1,25.5,50000.0,8.2,2023-01-15,True,New York
2,,"60K",7.8,15/01/2023,1,California
3,thirty,75000,nine,2023-1-20,yes,Texas
4,45,$85000,6.5,2023/01/22,0,Florida
5,52,95000.50,4.9,Jan 25 2023,false,WashingtonPreprocessing requirements:
- Age field needs conversion to integer (handle text and decimals)
- Income field needs standardization to numeric (remove currency symbols and letters)
- Satisfaction scores need conversion to standard numeric values
- Date formats need standardization
- Boolean fields need standardization
- Geographic locations need standardization
Create preprocessing validation schema (preprocessing_schema.json):
{
"rules": [
{
"field": "id",
"type": "string",
"desired_type": "integer",
"required": true,
"description": "Unique user identifier"
},
{
"field": "age",
"type": "string",
"desired_type": "integer",
"required": true,
"min": 18,
"max": 100,
"description": "Age needs conversion to integer, range 18-100"
},
{
"field": "income",
"type": "string",
"desired_type": "float(10,2)",
"required": true,
"min": 0,
"description": "Income needs conversion to numeric, remove non-digit characters"
},
{
"field": "satisfaction_score",
"type": "string",
"desired_type": "float(3,1)",
"required": true,
"min": 1.0,
"max": 10.0,
"description": "Satisfaction score, 1-10 scale"
},
{
"field": "join_date",
"type": "string",
"desired_type": "date('YYYY-MM-DD')",
"required": true,
"description": "Join date, standardize to YYYY-MM-DD format"
},
{
"field": "is_premium",
"type": "string",
"desired_type": "boolean",
"required": true,
"description": "Premium status, standardize to true/false"
},
{
"field": "location",
"type": "string(50)",
"desired_type": "string(20)",
"required": true,
"description": "Geographic location, standardize length"
}
],
"strict_mode": false,
"case_insensitive": true
}Run preprocessing validation:
# Step 1: Check data quality and conversion feasibility
vlite schema --conn raw_survey_data.csv \
--rules preprocessing_schema.json \
--output json > preprocessing_report.json
# Step 2: Analyze conversion issues
vlite schema --conn raw_survey_data.csv \
--rules preprocessing_schema.json \
--verboseExpected output:
Data Preprocessing Validation Report
====================================
Table: raw_survey_data
Total records: 5
Conversion validation results:
┌─────────────────────┬──────────┬──────────┬──────────┬────────────────┐
│ Field │ From │ To │ Status │ Issues │
├─────────────────────┼──────────┼──────────┼──────────┼────────────────┤
│ id │ string │ integer │ ✅ OK │ - │
│ age │ string │ integer │ ⚠️ PARTIAL│ 2 text values │
│ income │ string │ float │ ⚠️ PARTIAL│ Format issues │
│ satisfaction_score │ string │ float │ ⚠️ PARTIAL│ 1 text value │
│ join_date │ string │ date │ ❌ ISSUES │ Multiple formats│
│ is_premium │ string │ boolean │ ⚠️ PARTIAL│ Format issues │
│ location │ string │ string │ ✅ OK │ - │
└─────────────────────┴──────────┴──────────┴──────────┴────────────────┘
Detailed issue analysis:
• age field: Row 2 (empty), Row 3 ("thirty") cannot convert to integer
• income field: Row 2 ("60K"), Row 4 ("$85000") contain non-numeric characters
• satisfaction_score field: Row 3 ("nine") cannot convert to numeric
• join_date field: Detected 3 different date formats, needs standardization
• is_premium field: Multiple boolean representations (True/1/yes/0/false)
Data cleaning recommendations:
1. Establish missing value strategy for age field
2. Standardize income field format, remove symbols and units
3. Create text-to-numeric mapping rules (nine→9)
4. Standardize date format parsing rules
5. Unify boolean value representation standards
Create data cleaning script (clean_data.py):
import pandas as pd
import re
from datetime import datetime
def clean_survey_data(input_file, output_file):
"""Clean survey data"""
df = pd.read_csv(input_file)
# Clean age field
def clean_age(age):
if pd.isna(age):
return None
if str(age).lower() == 'thirty':
return 30
try:
return int(float(str(age)))
except:
return None
# Clean income field
def clean_income(income):
if pd.isna(income):
return None
# Remove all non-digit characters (except decimal point)
cleaned = re.sub(r'[^\d.]', '', str(income))
try:
return float(cleaned)
except:
return None
# Clean satisfaction score
def clean_satisfaction(score):
if pd.isna(score):
return None
if str(score).lower() == 'nine':
return 9.0
try:
return float(score)
except:
return None
# Clean date field
def clean_date(date_str):
if pd.isna(date_str):
return None
# Try multiple date formats
formats = ['%Y-%m-%d', '%d/%m/%Y', '%Y-%m-%d', '%Y/%m/%d', '%b %d %Y']
for fmt in formats:
try:
return datetime.strptime(str(date_str), fmt).strftime('%Y-%m-%d')
except:
continue
return None
# Clean boolean field
def clean_boolean(value):
if pd.isna(value):
return False
str_val = str(value).lower()
return str_val in ['true', '1', 'yes', 'y']
# Apply cleaning rules
df['age'] = df['age'].apply(clean_age)
df['income'] = df['income'].apply(clean_income)
df['satisfaction_score'] = df['satisfaction_score'].apply(clean_satisfaction)
df['join_date'] = df['join_date'].apply(clean_date)
df['is_premium'] = df['is_premium'].apply(clean_boolean)
df['location'] = df['location'].str.strip()
# Save cleaned data
df.to_csv(output_file, index=False)
print(f"Cleaning complete, results saved to {output_file}")
if __name__ == '__main__':
clean_survey_data('raw_survey_data.csv', 'cleaned_survey_data.csv')Validate cleaned data:
# Validate cleaned data
vlite schema --conn cleaned_survey_data.csv \
--rules preprocessing_schema.json \
--verbose
# Output should show all conversion validations passingWorkflow script (data_preprocessing_workflow.sh):
#!/bin/bash
echo "Starting data preprocessing workflow..."
# 1. Initial data quality assessment
echo "Step 1: Assess raw data quality"
vlite schema --conn raw_survey_data.csv \
--rules preprocessing_schema.json \
--output json > initial_assessment.json
# 2. Execute data cleaning
echo "Step 2: Execute data cleaning"
python clean_data.py
# 3. Validate cleaning results
echo "Step 3: Validate cleaning results"
vlite schema --conn cleaned_survey_data.csv \
--rules preprocessing_schema.json \
--output json > final_validation.json
# 4. Generate data quality report
echo "Step 4: Generate data quality report"
python generate_quality_report.py initial_assessment.json final_validation.json
echo "Data preprocessing workflow complete!"This scenario shows data scientists how to use ValidateLite for:
- Data quality assessment - Understanding raw data issues
- Conversion feasibility analysis - Evaluating cleaning strategy effectiveness
- Cleaning validation - Ensuring processed data meets modeling requirements
- Automated workflow - Standardized data preprocessing pipeline
ValidateLite provides two main output formats: table format and JSON format. Understanding the output helps you quickly identify data quality issues.
Default table output provides a clear overview:
Data Validation Results
=======================
Connection: customers.csv
Table: customers
Rules executed: 5
Validation time: 1.23s
┌─────────────────┬──────────┬──────────┬──────────┬─────────────────┐
│ Rule │ Type │ Status │ Failed │ Details │
├─────────────────┼──────────┼──────────┼──────────┼─────────────────┤
│ email_required │ NOT_NULL │ ✅ PASS │ 0/1000 │ All records OK │
│ unique_email │ UNIQUE │ ❌ FAIL │ 12/1000 │ 12 duplicates │
│ valid_age │ RANGE │ ⚠️ WARN │ 3/1000 │ 3 out of range │
│ phone_format │ REGEX │ ✅ PASS │ 0/1000 │ All valid │
│ status_enum │ ENUM │ ❌ FAIL │ 25/1000 │ Invalid values │
└─────────────────┴──────────┴──────────┴──────────┴─────────────────┘
Overall Status: FAILED (2 rules failed)
Verbose table output includes sample data:
vlite check --conn data.csv --table users --rule "unique(email)" --verboseValidation Results (Verbose)
============================
Rule: unique_email
Type: UNIQUE
Status: ❌ FAILED
Failed records: 12 out of 1000 total
Sample failures:
┌─────┬─────────────────────┬─────────────┐
│ Row │ Email │ Occurrences │
├─────┼─────────────────────┼─────────────┤
│ 145 │ john@email.com │ 3 │
│ 298 │ mary@email.com │ 2 │
│ 456 │ bob@company.com │ 2 │
│ 789 │ admin@system.com │ 5 │
└─────┴─────────────────────┴─────────────┘
Recommendation: Review duplicate email addresses and decide on deduplication strategy.
JSON output is perfect for automation and integration:
vlite schema --conn data.csv --rules schema.json --output json{
"validation_summary": {
"connection": "data.csv",
"table": "users",
"total_rules": 5,
"passed_rules": 3,
"failed_rules": 2,
"warning_rules": 0,
"validation_time": "1.23s",
"overall_status": "FAILED"
},
"results": [
{
"rule_id": "email_required",
"rule_type": "NOT_NULL",
"column": "email",
"status": "PASSED",
"total_count": 1000,
"failed_count": 0,
"failure_rate": 0.0,
"message": "All records have non-null email values"
},
{
"rule_id": "email_unique",
"rule_type": "UNIQUE",
"column": "email",
"status": "FAILED",
"total_count": 1000,
"failed_count": 12,
"failure_rate": 0.012,
"message": "Found 12 duplicate email addresses",
"sample_data": [
{"row": 145, "email": "john@email.com", "occurrences": 3},
{"row": 298, "email": "mary@email.com", "occurrences": 2}
]
},
{
"rule_id": "salary_conversion",
"rule_type": "DESIRED_TYPE",
"column": "salary",
"status": "FAILED",
"current_type": "string",
"desired_type": "float(10,2)",
"total_count": 1000,
"failed_count": 45,
"failure_rate": 0.045,
"message": "45 records cannot be converted from string to float(10,2)",
"conversion_analysis": {
"compatible_records": 955,
"incompatible_records": 45,
"common_issues": [
"Currency symbols ($, €, £)",
"Thousands separators (,)",
"Text values (N/A, TBD)"
]
}
}
]
}ValidateLite uses clear exit codes for automation:
| Exit Code | Meaning | Description |
|---|---|---|
| 0 | Success | All validations passed |
| 1 | Validation Failed | One or more rules failed |
| 2 | Usage Error | Invalid command line arguments |
| 3 | Connection Error | Cannot connect to data source |
| 4 | File Error | File not found or permission issues |
| 5 | Configuration Error | Invalid schema or rule format |
Using exit codes in scripts:
#!/bin/bash
vlite check --conn data.csv --table users --rule "not_null(email)"
exit_code=$?
case $exit_code in
0)
echo "✅ Data validation passed"
;;
1)
echo "❌ Data validation failed - check the output above"
exit 1
;;
*)
echo "💥 Validation error (code: $exit_code)"
exit $exit_code
;;
esacSave results to files:
# Save table output
vlite check --conn data.csv --table users --rule "unique(email)" > validation_report.txt
# Save JSON output
vlite schema --conn data.csv --rules schema.json --output json > results.json
# Save both stdout and stderr
vlite check --conn data.csv --table users --rule "unique(email)" &> full_output.log
# Append to existing files
vlite check --conn data.csv --table users --rule "range(age, 0, 120)" >> daily_checks.logParse JSON results:
import json
# Load validation results
with open('results.json', 'r') as f:
results = json.load(f)
# Check overall status
if results['validation_summary']['overall_status'] == 'FAILED':
print("Validation failed!")
# Get failed rules
failed_rules = [r for r in results['results'] if r['status'] == 'FAILED']
for rule in failed_rules:
print(f"Rule {rule['rule_id']}: {rule['failed_count']} failures")ValidateLite supports various configuration methods, from simple command-line parameters to complex configuration files for different usage scenarios.
Database connections:
# MySQL connection
export DB_HOST="production-db.company.com"
export DB_USER="data_analyst"
export DB_PASSWORD="secure_password"
export DB_NAME="analytics"
export MYSQL_URL="mysql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:3306/${DB_NAME}"
# PostgreSQL connection
export PG_HOST="warehouse.company.com"
export PG_USER="reporting_user"
export PG_PASSWORD="another_secure_password"
export PG_NAME="data_warehouse"
export POSTGRES_URL="postgresql://${PG_USER}:${PG_PASSWORD}@${PG_HOST}:5432/${PG_NAME}"
# Use in commands
vlite check --conn "$MYSQL_URL" --table customers --rule "not_null(email)"Performance tuning:
# Query timeouts (in seconds)
export VLITE_QUERY_TIMEOUT=300
export VLITE_CONNECTION_TIMEOUT=30
# Memory limits
export VLITE_MAX_SAMPLE_SIZE=1000
export VLITE_BATCH_SIZE=10000
# Parallel processing
export VLITE_MAX_WORKERS=4Advanced connection string options:
# MySQL with SSL
--conn "mysql://user:pass@host:3306/db?ssl_ca=/path/to/ca.pem&ssl_cert=/path/to/cert.pem"
# PostgreSQL with connection pool
--conn "postgresql://user:pass@host:5432/db?pool_size=10&max_overflow=20"
# SQLite with custom timeout
--conn "sqlite:///data.db?timeout=20"Connection string with table specification:
# Include table name in connection string
--conn "mysql://user:pass@host:3306/database.table_name"
# Override with command line parameter
--conn "mysql://user:pass@host:3306/database.table_name" --table "different_table"For large datasets:
{
"performance": {
"query_timeout": 600,
"sample_size": 5000,
"batch_size": 50000,
"parallel_workers": 8,
"memory_limit": "2GB"
},
"rules": [
{
"field": "user_id",
"type": "string",
"desired_type": "integer",
"required": true
}
]
}For development/testing:
{
"performance": {
"query_timeout": 30,
"sample_size": 100,
"batch_size": 1000,
"parallel_workers": 2
}
}This section helps you solve common issues when using ValidateLite, especially with type conversion validation.
| Error Message | Possible Cause | Solution |
|---|---|---|
Connection timeout |
Database unreachable | Check host, port, and network connectivity |
Authentication failed |
Wrong credentials | Verify username and password |
Database not found |
Wrong database name | Check database name in connection string |
File not found: data.csv |
Wrong file path | Use absolute path or check current directory |
Permission denied |
File access rights | Check file permissions or run with proper rights |
| Error Message | Possible Cause | Solution |
|---|---|---|
Invalid JSON schema |
Malformed JSON | Validate JSON syntax with a JSON validator |
Unknown rule type: INVALID |
Typo in rule type | Use valid rule types: NOT_NULL, UNIQUE, RANGE, etc. |
Missing required field: field |
Schema missing field name | Add "field" property to rule definition |
Table 'users' not found |
Wrong table name | Check table name and database connection |
| Error Message | Possible Cause | Solution |
|---|---|---|
Invalid type syntax: float(10) |
Wrong type definition format | Use correct format: float(10,2) |
Conflicting conversion: datetime to integer |
Impossible type conversion | Check desired_type setting for reasonableness |
Type conversion timeout |
Conversion validation timeout | Increase conversion_timeout config or reduce data size |
Precision must be greater than scale |
Wrong float precision config | Ensure precision > scale |
Debug connection problems:
# Test basic connectivity
vlite check --conn "mysql://user:pass@host:3306/db" --table "information_schema.tables" --rule "not_null(table_name)"
# Verbose connection debugging
vlite check --conn data.csv --table nonexistent --rule "not_null(id)" --verboseCommon connection string fixes:
# Wrong: Missing protocol
--conn "user:pass@host:3306/database"
# Right: Include protocol
--conn "mysql://user:pass@host:3306/database"
# Wrong: Incorrect port for PostgreSQL
--conn "postgresql://user:pass@host:3306/database"
# Right: Use PostgreSQL default port
--conn "postgresql://user:pass@host:5432/database"
# Wrong: Relative path issues
--conn "data/file.csv"
# Right: Use absolute path
--conn "/full/path/to/data/file.csv"Debug type conversion issues:
# Check what types are detected
vlite schema --conn data.csv --rules schema.json --verbose
# Test conversion with smaller sample
vlite schema --conn data.csv --rules schema.json --sample-size 100Common type conversion fixes:
// Wrong: Impossible conversion
{
"field": "description",
"type": "string",
"desired_type": "integer" // Text cannot become numbers
}
// Right: Reasonable conversion
{
"field": "description",
"type": "string(1000)",
"desired_type": "string(500)" // Truncate long text
}
// Wrong: Invalid precision/scale
{
"field": "amount",
"type": "string",
"desired_type": "float(2,10)" // Scale > precision
}
// Right: Valid precision/scale
{
"field": "amount",
"type": "string",
"desired_type": "float(12,2)" // Precision > scale
}Handle problematic data:
# Script to identify problematic records
import json
with open('validation_results.json') as f:
results = json.load(f)
for result in results['results']:
if result['rule_type'] == 'DESIRED_TYPE' and result['status'] == 'FAILED':
print(f"Field: {result['column']}")
print(f"Conversion: {result['current_type']} → {result['desired_type']}")
print(f"Failed: {result['failed_count']}/{result['total_count']}")
if 'sample_data' in result:
print("Sample problematic values:")
for sample in result['sample_data'][:5]:
print(f" Row {sample['row']}: {sample['value']}")
print()Get help:
# Show command help
vlite check --help
vlite schema --help
# Show version
vlite --version
# Test with minimal example
vlite check --conn /dev/null --table test --rule "not_null(id)" 2>&1If you're still having issues, the most common problems are:
- Connection strings - Double-check your database connection details
- File paths - Use absolute paths when in doubt
- Type definitions - Make sure your desired_type conversions make sense
- JSON syntax - Validate your schema files with a JSON checker
ValidateLite is designed to give you clear error messages, so read them carefully - they usually point directly to the problem!