CLI-first data validation tool for data engineers
CLI Guide • Python API • Examples • PyPI
DataCheck validates files (CSV, Parquet) and databases (PostgreSQL, MySQL, SQL Server) against rule-based checks. Write validation rules in YAML, run them from the terminal or CI/CD, and fail fast when data doesn't meet expectations.
Built for speed and simplicity, DataCheck runs locally or in pipelines with no external services required. It returns clear pass/fail results with actionable error details.
- Not a data observability platform — No dashboards, SaaS, or continuous monitoring
- Not an anomaly detection system — Rules are explicit, not ML-based
- Not a replacement for dbt tests — Simpler setup, works outside dbt projects
pip install datacheck-cliCreate validation.yaml:
checks:
- name: user_id_check
column: user_id
rules:
not_null: true
unique: trueRun validation:
datacheck validate data.csv -c validation.yamlFor detailed CLI usage, see the CLI Guide.
- Pipeline gates: Validate data before loading into a warehouse
- CI/CD checks: Fail builds when data quality drops
- Pre-deployment validation: Check exports before sending to external systems
- Local development: Test data quality on your laptop before pushing code
| Category | Formats |
|---|---|
| Files | CSV, Parquet |
| Databases | PostgreSQL, MySQL, SQL Server, SQLite, DuckDB |
| Cloud Storage | AWS S3, Google Cloud Storage, Azure Blob |
| Rule | Description |
|---|---|
not_null |
No missing/NULL values |
unique |
No duplicate values |
min / max |
Numeric range bounds |
regex |
Pattern matching |
allowed_values |
Whitelist validation |
data_type |
Type checking (string, int, float, bool, date) |
length |
String length constraints |
custom |
User-defined functions |
- Terminal: Color-coded, human-readable results
- JSON: Structured output for automation
- Slack: Real-time notifications via webhook
- Parallel execution for large datasets (auto-enabled for 10K+ rows)
- Row sampling: random, stratified, or top-N
- Validates 1M rows with 10 rules in 2-3 seconds
# Basic
pip install datacheck-cli
# With database support
pip install datacheck-cli[postgresql]
pip install datacheck-cli[mysql]
pip install datacheck-cli[mssql]
# With cloud storage
pip install datacheck-cli[cloud]
# All features
pip install datacheck-cli[all]| Guide | Description |
|---|---|
| CLI Guide | Complete command-line reference |
| Python API | Using DataCheck as a Python library |
| Streaming | Large file streaming (Experimental) |
| Examples | Working examples by topic |
DataCheck uses standard exit codes:
| Code | Meaning |
|---|---|
0 |
All rules passed |
1 |
Some rules failed |
2 |
Configuration error |
3 |
Data loading error |
4 |
Unexpected error |
GitHub Actions example:
- name: Validate Data
run: |
pip install datacheck-cli
datacheck validate data.csv -c validation.yaml- Python: 3.10, 3.11, or 3.12
- Core dependencies: typer, pandas, pyyaml, rich, pyarrow
DataCheck is production-ready for CLI and Python API usage.
Note: The streaming module (
datacheck.streaming) is experimental and uses a different rule system. See Streaming Guide for details.
git clone https://github.com/squrtech/datacheck.git
cd datacheck
poetry install
poetry run pytestSee CONTRIBUTING.md for contribution guidelines.
Apache License 2.0 - see LICENSE for details.
- Issues: GitHub Issues
- Changelog: CHANGELOG.md
Built for data engineers who need fast, deterministic validation without heavy frameworks.
Copyright 2026 Squrtech