Dataset quality analysis through richness metrics and information-gain based feature selection.
Author: Mitch Haile | Organization: Data Culpa, Inc. | Website: www.dataculpa.com
# Install
pip install dataculpa-snapshot
# Analyze a dataset
dataculpa-snapshot data.csv
# Or use the API
python3 -c "
from dataculpa_snapshot import analyze_dataset_richness
import pandas as pd
df = pd.read_csv('data.csv')
result = analyze_dataset_richness(df)
print('Recommended columns:', result['mixes']['lean']['columns'])
"- Overview
- Installation
- Features
- Quick Examples
- CLI Usage
- Python API
- Use Cases
- API Reference
- Contributing
- License
dataculpa-snapshot provides a comprehensive framework for analyzing dataset quality:
- Column richness (0-1): Combines fill rate with balanced entropy
- Schema analysis: Analyzes diversity in dict/JSON structures
- Dependency detection: Uses normalized mutual information
- Information-gain selection: Greedy algorithm for optimal column ordering
- Mutual exclusivity: Finds columns with non-overlapping nulls and similar dependencies
- Recommendations: Suggests "lean" (efficient) and "max" (comprehensive) column sets
- Distribution characterization: Full PMF retention, type detection, descriptive stats, numeric binning
- Dataset comparison: KL divergence, JS divergence, PSI, L1/TV distance for drift detection
pip install dataculpa-snapshotgit clone https://github.com/dataculpa/dataculpa-snapshot.git
cd dataculpa-snapshot
pip install -e .pip install -e ".[dev]"Requirements:
- Python >= 3.8
- numpy >= 1.20.0
- pandas >= 1.3.0
- scikit-learn >= 1.0.0
Optional: Install Graphviz to render dependency graphs.
Quantifies column quality with a single score (0-1):
richness = fill_rate × balanced_entropy
where balanced_entropy = 4e(1-e) peaks at 0.5 (structured but diverse).
Greedy algorithm that orders columns by incremental value:
- Start with the richest column
- Iteratively add column with max gain:
R_j × (1 - max_dep(j, S)) - Produces "lean" (90% of gain) and "max" (all columns) mixes
Finds column pairs with:
- Non-overlapping nulls: When one has a value, the other is null
- Similar dependencies: Both relate to the same columns
Use cases:
- Schema evolution (
old_email→new_email) - Mutually exclusive alternatives (
home_phonevsmobile_phone) - Conditional fields (
residential_addressvsbusiness_address)
Profile and compare column distributions across dataset versions:
- Type detection: Automatically classifies columns as categorical, numeric discrete, or numeric continuous
- PMF retention: Stores the full probability mass function (not just summary scalars)
- Descriptive stats: Mean, std, median, quantiles, skewness, kurtosis for numeric columns
- Numeric binning: Quantile or uniform binning for continuous data
- Divergence metrics: JS divergence, KL divergence, PSI, L1/total variation distance
- Category drift: Detects new/dropped categories between versions
- DataFrame-level comparison: One-call comparison across all shared columns
from dataculpa_snapshot import analyze_dataset_richness
import pandas as pd
df = pd.read_csv("data.csv")
result = analyze_dataset_richness(df, row_sample_frac=0.25)
# Get recommended columns
lean_cols = result["mixes"]["lean"]["columns"]
print(f"Lean mix ({len(lean_cols)} columns):", lean_cols)from dataculpa_snapshot import column_richness
metrics = column_richness(df["my_column"])
print(f"Richness: {metrics['richness']:.3f}")
print(f"Fill rate: {metrics['fill_rate']:.3f}")from dataculpa_snapshot import (
build_dependency_matrix,
find_mutually_exclusive_columns,
)
dep_matrix = build_dependency_matrix(df, list(df.columns))
pairs = find_mutually_exclusive_columns(
df, list(df.columns), dep_matrix,
max_null_overlap=0.3,
min_dep_similarity=0.7,
)
for pair in pairs[:5]:
print(f"{pair['col1']} <-> {pair['col2']}: score={pair['mutual_exclusivity_score']:.3f}")from dataculpa_snapshot import compare_dataframe_profiles
drift = compare_dataframe_profiles(df_v1, df_v2)
print(drift[["column", "js_divergence", "psi", "l1_distance"]])
# Flag columns that have drifted
drifted = drift[drift["js_divergence"] > 0.05]
for _, row in drifted.iterrows():
print(f"DRIFT: {row['column']} (JS={row['js_divergence']:.4f})")from dataculpa_snapshot import characterize_distribution, compare_profiles
import numpy as np
# Profile baseline
prof_baseline = characterize_distribution(df_v1["score"])
# Profile current using baseline's bin edges for alignment
prof_current = characterize_distribution(
df_v2["score"],
bin_edges=np.array(prof_baseline["bin_edges"]),
)
result = compare_profiles(prof_baseline, prof_current)
print(f"JS divergence: {result['js_divergence']:.4f}")
print(f"KL divergence: {result['kl_divergence_p_q']:.4f}")
print(f"PSI: {result['psi']:.4f}")
print(f"Mean shift: {result['mean_delta']:.2f}")dataculpa-snapshot data.csvdataculpa-snapshot data.parquet \
--row-frac 0.5 \
--lean-frac 0.85 \
--output-prefix my_analysis \
--dep-min-weight 0.3| File | Description |
|---|---|
{prefix}_column_profile.csv |
Richness metrics per column |
{prefix}_dep_matrix.csv |
Pairwise dependency matrix |
{prefix}_info_gain_path.csv |
Greedy column ordering |
{prefix}_mixes.json |
Lean/max recommendations |
{prefix}_column_deps.dot |
Dependency graph (render with Graphviz) |
{prefix}_mutual_exclusive.csv |
Mutually exclusive pairs |
--row-frac FRAC Fraction of rows to sample (default: 0.25)
--col-sample-size N Per-column sample size
--lean-frac FRAC Lean mix threshold (default: 0.9)
--output-prefix PREFIX Output file prefix
--dep-min-weight WEIGHT Min dependency for graph edges (default: 0.3)
--dep-top-k K Max edges per node in graph (default: 3)
--dep-max-edges N Max total edges in graph (default: 200)
from dataculpa_snapshot import analyze_dataset_richness
result = analyze_dataset_richness(
df,
row_sample_frac=0.25,
col_sample_size=10000,
lean_fraction=0.9,
)
# Access results
col_profile = result["column_profile"]
dep_matrix = result["dependency_matrix"]
mixes = result["mixes"]from dataculpa_snapshot import column_richness, schema_arrangement_richness
# Single column
metrics = column_richness(df["column_name"])
# Dict/JSON column
schema_metrics = schema_arrangement_richness(df["json_column"])from dataculpa_snapshot import dataframe_richness_profile
col_profile, schema_metrics = dataframe_richness_profile(df)
top_cols = col_profile.sort_values("richness", ascending=False).head(10)from dataculpa_snapshot import build_dependency_matrix, dependency_graphviz
cols = list(df.columns)
dep_matrix = build_dependency_matrix(df, cols)
# Generate graph
dot_src = dependency_graphviz(cols, dep_matrix, min_weight=0.3)
with open("deps.dot", "w") as f:
f.write(dot_src)
# Render: dot -Tpng deps.dot -o deps.pngfrom dataculpa_snapshot import (
greedy_info_gain_path,
choose_core_mixes_from_info_gain,
)
import numpy as np
richness_vec = np.array([col_profile.loc[c, "richness"] for c in cols])
path = greedy_info_gain_path(richness_vec, dep_matrix)
mixes = choose_core_mixes_from_info_gain(path, cols, lean_fraction=0.9)
print("Lean mix:", mixes["lean"]["columns"])from dataculpa_snapshot import rank_columns_given_prior
# What columns to add given you already have user_id?
ranked = rank_columns_given_prior(
cols=list(df.columns),
col_profile=col_profile,
dep_matrix=dep_matrix,
prior_col_name="user_id",
)
for rec in ranked[:5]:
print(f"{rec['column']}: {rec['conditional_richness']:.3f}")from dataculpa_snapshot import find_mutually_exclusive_columns, analyze_null_overlap
# Find mutually exclusive pairs
pairs = find_mutually_exclusive_columns(
df,
cols=list(df.columns),
dep_matrix=dep_matrix,
max_null_overlap=0.3, # Max 30% rows with both non-null
min_dep_similarity=0.7, # Min 70% dependency similarity
top_k=10,
)
for pair in pairs:
print(f"{pair['col1']} <-> {pair['col2']}")
print(f" Score: {pair['mutual_exclusivity_score']:.3f}")
print(f" Null overlap: {pair['null_overlap']:.1%}")
# Detailed analysis
detail = analyze_null_overlap(df, "col1", "col2")
print(f"Both non-null: {detail['pct_both_non_null']:.1f}%")result = analyze_dataset_richness(df)
features = result["mixes"]["lean"]["columns"]
X_train = df[features]col_profile, _ = dataframe_richness_profile(df)
low_quality = col_profile[col_profile["richness"] < 0.1]
print("Low quality columns:", low_quality.index.tolist())pairs = find_mutually_exclusive_columns(df, cols, dep_matrix)
for pair in pairs:
if pair['mutual_exclusivity_score'] > 0.8:
print(f"⚠️ Possible migration: {pair['col1']} → {pair['col2']}")col_profile, schema_metrics = dataframe_richness_profile(df)
for col, metrics in schema_metrics.items():
print(f"{col}: {metrics['unique_schema_arrangements']} unique schemas")from dataculpa_snapshot import compare_dataframe_profiles
# Compare yesterday's data to today's
drift = compare_dataframe_profiles(df_yesterday, df_today)
# Alert on significant drift
for _, row in drift.iterrows():
if row["psi"] > 0.2:
print(f"ALERT: {row['column']} PSI={row['psi']:.3f}")
if row.get("new_categories"):
print(f" New values appeared: {row['new_categories']}")column_richness(series, sample_size=None, random_state=42)
- Compute richness for a single column
- Returns: Dict with
richness,fill_rate,entropy_norm,entropy_balanced, etc.
schema_arrangement_richness(series, sample_size=None, random_state=42)
- Richness over dict/JSON schema arrangements
- Returns: Dict with schema statistics and top arrangements
dataframe_richness_profile(df, sample_size=None, random_state=42, detect_schema_cols=True)
- Profile all columns in a DataFrame
- Returns: (col_profile DataFrame, schema_metrics dict)
analyze_dataset_richness(df, row_sample_frac=0.25, col_sample_size=None, random_state=42, lean_fraction=0.9)
- Full analysis pipeline
- Returns: Dict with
column_profile,dependency_matrix,mixes, etc.
normalized_mi(x, y)
- Normalized mutual information between two series (0-1)
build_dependency_matrix(df, cols)
- Build pairwise dependency matrix using normalized MI
- Returns: numpy array (n_cols × n_cols)
dependency_graphviz(cols, dep_matrix, min_weight=0.2, top_k_per_node=3, max_edges=200, graph_name="ColumnDependencies")
- Generate Graphviz DOT string for dependency graph
greedy_info_gain_path(richness_vec, dep_matrix)
- Greedy ordering of columns by information gain
- Returns: List of dicts with cumulative gains
choose_core_mixes_from_info_gain(path, cols, lean_fraction=0.9)
- Extract lean and max mixes from greedy path
- Returns: Dict with
lean,max, andpathDataFrame
rank_columns_given_prior(cols, col_profile, dep_matrix, prior_col_name)
- Rank columns by conditional richness given a prior column
- Returns: List of dicts sorted by conditional richness
find_mutually_exclusive_columns(df, cols, dep_matrix, max_null_overlap=0.3, min_dep_similarity=0.7, top_k=20)
- Find columns with non-overlapping nulls and similar dependencies
- Returns: List of dicts with exclusivity scores
analyze_null_overlap(df, col1, col2)
- Detailed null overlap analysis between two columns
- Returns: Dict with counts, percentages, and overlap metrics
detect_column_type(series)
- Classify column as
'categorical','numeric_discrete', or'numeric_continuous'
bin_numeric_series(series, n_bins=50, strategy='quantile', bin_edges=None)
- Bin numeric data into a PMF (quantile or uniform strategy)
- Returns: (bin_edges, pmf) arrays
characterize_distribution(series, pmf_labels=None, pmf_probs=None, n_bins=50, bin_strategy='quantile', bin_edges=None)
- Full distribution profile: type, PMF, fill rate, unique count
- Numeric columns also get: mean, std, median, quantiles, skewness, kurtosis, bin edges
- Pass
bin_edgesfrom a reference profile for aligned comparison
compare_profiles(profile_p, profile_q, smoothing=1e-10)
- Compare two column profiles (from
characterize_distribution) - Returns: Dict with
js_divergence,kl_divergence_p_q,psi,l1_distance,fill_rate_delta,new_categories,dropped_categories, and numeric stat deltas
compare_dataframe_profiles(df_p, df_q, sample_size=None, random_state=42, n_bins=50)
- Compare two DataFrames column-by-column on shared columns
- Automatically aligns numeric bin edges from P to Q
- Returns: DataFrame with one row per column, all divergence metrics
Contributions welcome! Please:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Make your changes
- Run tests and linters:
pytest,black .,flake8 - Commit:
git commit -m "Add feature" - Push and create a Pull Request
git clone https://github.com/dataculpa/dataculpa-snapshot.git
cd dataculpa-snapshot
python3 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"- Follow PEP 8
- Use Black for formatting (line length: 88)
- Add type hints where appropriate
- Write docstrings for public functions
Apache License 2.0 - See LICENSE file for details.
- Website: www.dataculpa.com
- GitHub: github.com/dataculpa/dataculpa-snapshot
- Issues: GitHub Issues
@software{dataculpa_snapshot,
title = {dataculpa-snapshot: Dataset Quality Analysis and Feature Selection},
author = {Haile, Mitch},
organization = {Data Culpa, Inc.},
year = {2025},
url = {https://www.dataculpa.com}
}Version 0.2.0 | Built with ❤️ by Data Culpa, Inc.