Skip to content

Data-Culpa/richness

Repository files navigation

dataculpa-snapshot

Dataset quality analysis through richness metrics and information-gain based feature selection.

Version License Python

Author: Mitch Haile | Organization: Data Culpa, Inc. | Website: www.dataculpa.com


Quick Start

# Install
pip install dataculpa-snapshot

# Analyze a dataset
dataculpa-snapshot data.csv

# Or use the API
python3 -c "
from dataculpa_snapshot import analyze_dataset_richness
import pandas as pd
df = pd.read_csv('data.csv')
result = analyze_dataset_richness(df)
print('Recommended columns:', result['mixes']['lean']['columns'])
"

Table of Contents


Overview

dataculpa-snapshot provides a comprehensive framework for analyzing dataset quality:

  • Column richness (0-1): Combines fill rate with balanced entropy
  • Schema analysis: Analyzes diversity in dict/JSON structures
  • Dependency detection: Uses normalized mutual information
  • Information-gain selection: Greedy algorithm for optimal column ordering
  • Mutual exclusivity: Finds columns with non-overlapping nulls and similar dependencies
  • Recommendations: Suggests "lean" (efficient) and "max" (comprehensive) column sets
  • Distribution characterization: Full PMF retention, type detection, descriptive stats, numeric binning
  • Dataset comparison: KL divergence, JS divergence, PSI, L1/TV distance for drift detection

Installation

From PyPI (when published)

pip install dataculpa-snapshot

From Source

git clone https://github.com/dataculpa/dataculpa-snapshot.git
cd dataculpa-snapshot
pip install -e .

Development

pip install -e ".[dev]"

Requirements:

  • Python >= 3.8
  • numpy >= 1.20.0
  • pandas >= 1.3.0
  • scikit-learn >= 1.0.0

Optional: Install Graphviz to render dependency graphs.


Features

1. Column Richness

Quantifies column quality with a single score (0-1):

richness = fill_rate × balanced_entropy

where balanced_entropy = 4e(1-e) peaks at 0.5 (structured but diverse).

2. Information-Gain Selection

Greedy algorithm that orders columns by incremental value:

  1. Start with the richest column
  2. Iteratively add column with max gain: R_j × (1 - max_dep(j, S))
  3. Produces "lean" (90% of gain) and "max" (all columns) mixes

3. Mutual Exclusivity Detection

Finds column pairs with:

  • Non-overlapping nulls: When one has a value, the other is null
  • Similar dependencies: Both relate to the same columns

Use cases:

  • Schema evolution (old_emailnew_email)
  • Mutually exclusive alternatives (home_phone vs mobile_phone)
  • Conditional fields (residential_address vs business_address)

4. Distribution Characterization & Comparison (NEW v0.3.0)

Profile and compare column distributions across dataset versions:

  • Type detection: Automatically classifies columns as categorical, numeric discrete, or numeric continuous
  • PMF retention: Stores the full probability mass function (not just summary scalars)
  • Descriptive stats: Mean, std, median, quantiles, skewness, kurtosis for numeric columns
  • Numeric binning: Quantile or uniform binning for continuous data
  • Divergence metrics: JS divergence, KL divergence, PSI, L1/total variation distance
  • Category drift: Detects new/dropped categories between versions
  • DataFrame-level comparison: One-call comparison across all shared columns

Quick Examples

Analyze a Dataset

from dataculpa_snapshot import analyze_dataset_richness
import pandas as pd

df = pd.read_csv("data.csv")
result = analyze_dataset_richness(df, row_sample_frac=0.25)

# Get recommended columns
lean_cols = result["mixes"]["lean"]["columns"]
print(f"Lean mix ({len(lean_cols)} columns):", lean_cols)

Column Richness

from dataculpa_snapshot import column_richness

metrics = column_richness(df["my_column"])
print(f"Richness: {metrics['richness']:.3f}")
print(f"Fill rate: {metrics['fill_rate']:.3f}")

Find Mutually Exclusive Columns

from dataculpa_snapshot import (
    build_dependency_matrix,
    find_mutually_exclusive_columns,
)

dep_matrix = build_dependency_matrix(df, list(df.columns))
pairs = find_mutually_exclusive_columns(
    df, list(df.columns), dep_matrix,
    max_null_overlap=0.3,
    min_dep_similarity=0.7,
)

for pair in pairs[:5]:
    print(f"{pair['col1']} <-> {pair['col2']}: score={pair['mutual_exclusivity_score']:.3f}")

Compare Two Dataset Versions

from dataculpa_snapshot import compare_dataframe_profiles

drift = compare_dataframe_profiles(df_v1, df_v2)
print(drift[["column", "js_divergence", "psi", "l1_distance"]])

# Flag columns that have drifted
drifted = drift[drift["js_divergence"] > 0.05]
for _, row in drifted.iterrows():
    print(f"DRIFT: {row['column']} (JS={row['js_divergence']:.4f})")

Characterize and Compare a Single Column

from dataculpa_snapshot import characterize_distribution, compare_profiles
import numpy as np

# Profile baseline
prof_baseline = characterize_distribution(df_v1["score"])

# Profile current using baseline's bin edges for alignment
prof_current = characterize_distribution(
    df_v2["score"],
    bin_edges=np.array(prof_baseline["bin_edges"]),
)

result = compare_profiles(prof_baseline, prof_current)
print(f"JS divergence: {result['js_divergence']:.4f}")
print(f"KL divergence: {result['kl_divergence_p_q']:.4f}")
print(f"PSI:           {result['psi']:.4f}")
print(f"Mean shift:    {result['mean_delta']:.2f}")

CLI Usage

Basic Analysis

dataculpa-snapshot data.csv

Custom Parameters

dataculpa-snapshot data.parquet \
  --row-frac 0.5 \
  --lean-frac 0.85 \
  --output-prefix my_analysis \
  --dep-min-weight 0.3

Output Files

File Description
{prefix}_column_profile.csv Richness metrics per column
{prefix}_dep_matrix.csv Pairwise dependency matrix
{prefix}_info_gain_path.csv Greedy column ordering
{prefix}_mixes.json Lean/max recommendations
{prefix}_column_deps.dot Dependency graph (render with Graphviz)
{prefix}_mutual_exclusive.csv Mutually exclusive pairs

CLI Options

--row-frac FRAC          Fraction of rows to sample (default: 0.25)
--col-sample-size N      Per-column sample size
--lean-frac FRAC         Lean mix threshold (default: 0.9)
--output-prefix PREFIX   Output file prefix
--dep-min-weight WEIGHT  Min dependency for graph edges (default: 0.3)
--dep-top-k K           Max edges per node in graph (default: 3)
--dep-max-edges N       Max total edges in graph (default: 200)

Python API

Full Analysis Pipeline

from dataculpa_snapshot import analyze_dataset_richness

result = analyze_dataset_richness(
    df,
    row_sample_frac=0.25,
    col_sample_size=10000,
    lean_fraction=0.9,
)

# Access results
col_profile = result["column_profile"]
dep_matrix = result["dependency_matrix"]
mixes = result["mixes"]

Column-Level Analysis

from dataculpa_snapshot import column_richness, schema_arrangement_richness

# Single column
metrics = column_richness(df["column_name"])

# Dict/JSON column
schema_metrics = schema_arrangement_richness(df["json_column"])

DataFrame Profile

from dataculpa_snapshot import dataframe_richness_profile

col_profile, schema_metrics = dataframe_richness_profile(df)
top_cols = col_profile.sort_values("richness", ascending=False).head(10)

Dependency Analysis

from dataculpa_snapshot import build_dependency_matrix, dependency_graphviz

cols = list(df.columns)
dep_matrix = build_dependency_matrix(df, cols)

# Generate graph
dot_src = dependency_graphviz(cols, dep_matrix, min_weight=0.3)
with open("deps.dot", "w") as f:
    f.write(dot_src)
# Render: dot -Tpng deps.dot -o deps.png

Custom Column Selection

from dataculpa_snapshot import (
    greedy_info_gain_path,
    choose_core_mixes_from_info_gain,
)
import numpy as np

richness_vec = np.array([col_profile.loc[c, "richness"] for c in cols])
path = greedy_info_gain_path(richness_vec, dep_matrix)
mixes = choose_core_mixes_from_info_gain(path, cols, lean_fraction=0.9)

print("Lean mix:", mixes["lean"]["columns"])

Conditional Richness

from dataculpa_snapshot import rank_columns_given_prior

# What columns to add given you already have user_id?
ranked = rank_columns_given_prior(
    cols=list(df.columns),
    col_profile=col_profile,
    dep_matrix=dep_matrix,
    prior_col_name="user_id",
)

for rec in ranked[:5]:
    print(f"{rec['column']}: {rec['conditional_richness']:.3f}")

Mutual Exclusivity Detection

from dataculpa_snapshot import find_mutually_exclusive_columns, analyze_null_overlap

# Find mutually exclusive pairs
pairs = find_mutually_exclusive_columns(
    df,
    cols=list(df.columns),
    dep_matrix=dep_matrix,
    max_null_overlap=0.3,      # Max 30% rows with both non-null
    min_dep_similarity=0.7,     # Min 70% dependency similarity
    top_k=10,
)

for pair in pairs:
    print(f"{pair['col1']} <-> {pair['col2']}")
    print(f"  Score: {pair['mutual_exclusivity_score']:.3f}")
    print(f"  Null overlap: {pair['null_overlap']:.1%}")

# Detailed analysis
detail = analyze_null_overlap(df, "col1", "col2")
print(f"Both non-null: {detail['pct_both_non_null']:.1f}%")

Use Cases

Feature Selection for ML

result = analyze_dataset_richness(df)
features = result["mixes"]["lean"]["columns"]
X_train = df[features]

Data Quality Assessment

col_profile, _ = dataframe_richness_profile(df)
low_quality = col_profile[col_profile["richness"] < 0.1]
print("Low quality columns:", low_quality.index.tolist())

Schema Evolution Detection

pairs = find_mutually_exclusive_columns(df, cols, dep_matrix)
for pair in pairs:
    if pair['mutual_exclusivity_score'] > 0.8:
        print(f"⚠️ Possible migration: {pair['col1']}{pair['col2']}")

Schema Analysis for JSON Columns

col_profile, schema_metrics = dataframe_richness_profile(df)
for col, metrics in schema_metrics.items():
    print(f"{col}: {metrics['unique_schema_arrangements']} unique schemas")

Dataset Drift Detection

from dataculpa_snapshot import compare_dataframe_profiles

# Compare yesterday's data to today's
drift = compare_dataframe_profiles(df_yesterday, df_today)

# Alert on significant drift
for _, row in drift.iterrows():
    if row["psi"] > 0.2:
        print(f"ALERT: {row['column']} PSI={row['psi']:.3f}")
    if row.get("new_categories"):
        print(f"  New values appeared: {row['new_categories']}")

API Reference

Core Functions

column_richness(series, sample_size=None, random_state=42)

  • Compute richness for a single column
  • Returns: Dict with richness, fill_rate, entropy_norm, entropy_balanced, etc.

schema_arrangement_richness(series, sample_size=None, random_state=42)

  • Richness over dict/JSON schema arrangements
  • Returns: Dict with schema statistics and top arrangements

dataframe_richness_profile(df, sample_size=None, random_state=42, detect_schema_cols=True)

  • Profile all columns in a DataFrame
  • Returns: (col_profile DataFrame, schema_metrics dict)

analyze_dataset_richness(df, row_sample_frac=0.25, col_sample_size=None, random_state=42, lean_fraction=0.9)

  • Full analysis pipeline
  • Returns: Dict with column_profile, dependency_matrix, mixes, etc.

Dependency Analysis

normalized_mi(x, y)

  • Normalized mutual information between two series (0-1)

build_dependency_matrix(df, cols)

  • Build pairwise dependency matrix using normalized MI
  • Returns: numpy array (n_cols × n_cols)

dependency_graphviz(cols, dep_matrix, min_weight=0.2, top_k_per_node=3, max_edges=200, graph_name="ColumnDependencies")

  • Generate Graphviz DOT string for dependency graph

Information Gain & Selection

greedy_info_gain_path(richness_vec, dep_matrix)

  • Greedy ordering of columns by information gain
  • Returns: List of dicts with cumulative gains

choose_core_mixes_from_info_gain(path, cols, lean_fraction=0.9)

  • Extract lean and max mixes from greedy path
  • Returns: Dict with lean, max, and path DataFrame

rank_columns_given_prior(cols, col_profile, dep_matrix, prior_col_name)

  • Rank columns by conditional richness given a prior column
  • Returns: List of dicts sorted by conditional richness

Mutual Exclusivity Detection

find_mutually_exclusive_columns(df, cols, dep_matrix, max_null_overlap=0.3, min_dep_similarity=0.7, top_k=20)

  • Find columns with non-overlapping nulls and similar dependencies
  • Returns: List of dicts with exclusivity scores

analyze_null_overlap(df, col1, col2)

  • Detailed null overlap analysis between two columns
  • Returns: Dict with counts, percentages, and overlap metrics

Distribution Characterization & Comparison

detect_column_type(series)

  • Classify column as 'categorical', 'numeric_discrete', or 'numeric_continuous'

bin_numeric_series(series, n_bins=50, strategy='quantile', bin_edges=None)

  • Bin numeric data into a PMF (quantile or uniform strategy)
  • Returns: (bin_edges, pmf) arrays

characterize_distribution(series, pmf_labels=None, pmf_probs=None, n_bins=50, bin_strategy='quantile', bin_edges=None)

  • Full distribution profile: type, PMF, fill rate, unique count
  • Numeric columns also get: mean, std, median, quantiles, skewness, kurtosis, bin edges
  • Pass bin_edges from a reference profile for aligned comparison

compare_profiles(profile_p, profile_q, smoothing=1e-10)

  • Compare two column profiles (from characterize_distribution)
  • Returns: Dict with js_divergence, kl_divergence_p_q, psi, l1_distance, fill_rate_delta, new_categories, dropped_categories, and numeric stat deltas

compare_dataframe_profiles(df_p, df_q, sample_size=None, random_state=42, n_bins=50)

  • Compare two DataFrames column-by-column on shared columns
  • Automatically aligns numeric bin edges from P to Q
  • Returns: DataFrame with one row per column, all divergence metrics

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Make your changes
  4. Run tests and linters: pytest, black ., flake8
  5. Commit: git commit -m "Add feature"
  6. Push and create a Pull Request

Development Setup

git clone https://github.com/dataculpa/dataculpa-snapshot.git
cd dataculpa-snapshot
python3 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"

Code Style

  • Follow PEP 8
  • Use Black for formatting (line length: 88)
  • Add type hints where appropriate
  • Write docstrings for public functions

License

Apache License 2.0 - See LICENSE file for details.


Support


Citation

@software{dataculpa_snapshot,
  title = {dataculpa-snapshot: Dataset Quality Analysis and Feature Selection},
  author = {Haile, Mitch},
  organization = {Data Culpa, Inc.},
  year = {2025},
  url = {https://www.dataculpa.com}
}

Version 0.2.0 | Built with ❤️ by Data Culpa, Inc.

About

A tool for normalizing data quality metrics into a single measure for columns (richness) and tables (joint richness)

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages