Skip to content

CLI-first data validation tool for engineers to quickly check files, databases, and pipelines against YAML-defined rules. Lightweight, fast, and deterministic.

License

Notifications You must be signed in to change notification settings

Squrtech/datacheck

DataCheck Logo

CLI-first data validation tool for data engineers

CI Python 3.10+ License: Apache 2.0 PyPI version

CLI GuidePython APIExamplesPyPI


What is DataCheck?

DataCheck validates files (CSV, Parquet) and databases (PostgreSQL, MySQL, SQL Server) against rule-based checks. Write validation rules in YAML, run them from the terminal or CI/CD, and fail fast when data doesn't meet expectations.

Built for speed and simplicity, DataCheck runs locally or in pipelines with no external services required. It returns clear pass/fail results with actionable error details.

What DataCheck is NOT

  • Not a data observability platform — No dashboards, SaaS, or continuous monitoring
  • Not an anomaly detection system — Rules are explicit, not ML-based
  • Not a replacement for dbt tests — Simpler setup, works outside dbt projects

Quick Start

pip install datacheck-cli

Create validation.yaml:

checks:
  - name: user_id_check
    column: user_id
    rules:
      not_null: true
      unique: true

Run validation:

datacheck validate data.csv -c validation.yaml

For detailed CLI usage, see the CLI Guide.


Use Cases

  • Pipeline gates: Validate data before loading into a warehouse
  • CI/CD checks: Fail builds when data quality drops
  • Pre-deployment validation: Check exports before sending to external systems
  • Local development: Test data quality on your laptop before pushing code

Features

Supported Data Sources

Category Formats
Files CSV, Parquet
Databases PostgreSQL, MySQL, SQL Server, SQLite, DuckDB
Cloud Storage AWS S3, Google Cloud Storage, Azure Blob

Validation Rules

Rule Description
not_null No missing/NULL values
unique No duplicate values
min / max Numeric range bounds
regex Pattern matching
allowed_values Whitelist validation
data_type Type checking (string, int, float, bool, date)
length String length constraints
custom User-defined functions

Output

  • Terminal: Color-coded, human-readable results
  • JSON: Structured output for automation
  • Slack: Real-time notifications via webhook

Performance

  • Parallel execution for large datasets (auto-enabled for 10K+ rows)
  • Row sampling: random, stratified, or top-N
  • Validates 1M rows with 10 rules in 2-3 seconds

Installation

# Basic
pip install datacheck-cli

# With database support
pip install datacheck-cli[postgresql]
pip install datacheck-cli[mysql]
pip install datacheck-cli[mssql]

# With cloud storage
pip install datacheck-cli[cloud]

# All features
pip install datacheck-cli[all]

Documentation

Guide Description
CLI Guide Complete command-line reference
Python API Using DataCheck as a Python library
Streaming Large file streaming (Experimental)
Examples Working examples by topic

CI/CD Integration

DataCheck uses standard exit codes:

Code Meaning
0 All rules passed
1 Some rules failed
2 Configuration error
3 Data loading error
4 Unexpected error

GitHub Actions example:

- name: Validate Data
  run: |
    pip install datacheck-cli
    datacheck validate data.csv -c validation.yaml

Requirements

  • Python: 3.10, 3.11, or 3.12
  • Core dependencies: typer, pandas, pyyaml, rich, pyarrow

Project Status

DataCheck is production-ready for CLI and Python API usage.

Note: The streaming module (datacheck.streaming) is experimental and uses a different rule system. See Streaming Guide for details.


Development

git clone https://github.com/squrtech/datacheck.git
cd datacheck
poetry install
poetry run pytest

See CONTRIBUTING.md for contribution guidelines.


License

Apache License 2.0 - see LICENSE for details.


Support


Built for data engineers who need fast, deterministic validation without heavy frameworks.

Copyright 2026 Squrtech

About

CLI-first data validation tool for engineers to quickly check files, databases, and pipelines against YAML-defined rules. Lightweight, fast, and deterministic.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages