dyxgb - Dynamic XGBoost

A flexible CLI tool for XGBoost training and prediction with support for multiple data sources, hyperparameter tuning, and both classification and regression tasks.

Unix Philosophy: dyxgb is designed for composability. Commands work as Unix filters with clean stdin/stdout contracts, predictable exit codes, and pipe-friendly defaults.

Features

Unix-Friendly: stdin/stdout support, composable commands, clean I/O contracts
Multiple Data Sources: Load data from CSV, Parquet, JSON files or databases (SQLite, DuckDB, PostgreSQL)
Classification & Regression: Support for both task types with appropriate metrics
Hyperparameter Tuning: Integrated Optuna for automated hyperparameter optimization
Interactive Mode: InquirerPy-powered prompts for exploratory workflows
Single Artifact: Model, encoder, and pipeline bundled in one .dyxgb file
Feature Importance: Extract and export feature importance scores
Evaluation Metrics: Comprehensive metrics for model evaluation

Installation

# Clone the repository
git clone https://github.com/sxndmxn/dyxgb.git
cd dyxgb

# Install with uv (recommended)
uv sync

# With optional dependencies
uv sync --all-extras

# Or with pip
pip install -e .
pip install -e ".[all]"  # with all extras

Optional Dependencies

# Interactive mode (InquirerPy prompts)
pip install -e ".[interactive]"

# Hyperparameter tuning (Optuna)
pip install -e ".[tuning]"

# PostgreSQL support
pip install -e ".[postgres]"

# Everything
pip install -e ".[all]"

CLI Contract

dyxgb follows Unix philosophy conventions:

I/O Behavior

Stream	Content
stdout	Data only (CSV, JSONL, JSON) - never logs or progress
stderr	Logs, progress, human-readable output

Exit Codes

Code	Meaning
0	Success
1	Runtime error (I/O failure, model load failure, etc.)
2	Usage error (invalid args, unsupported formats, etc.)

Default Behavior

Command	stdin	stdout	Default Format
`train`	Rejected (exit 2)	N/A	`.dyxgb` bundle file
`predict`	CSV/JSONL	Predictions	CSV
`evaluate`	N/A	Metrics	JSON object
`importance`	N/A	Feature importance	CSV

Unix Pipelines

dyxgb commands are designed for composition:

# Predict as a filter (stdin -> stdout)
dyxgb predict --model model.dyxgb < new.csv > preds.csv

# JSONL input/output
cat data.jsonl | dyxgb predict -m model.dyxgb --input-format jsonl --output-format jsonl > preds.jsonl

# Evaluate and parse with jq
dyxgb evaluate -s test.csv -m model.dyxgb --target y | jq '.accuracy'

# Get top 5 features as CSV
dyxgb importance -m model.dyxgb --top 5 > top_features.csv

# Chain with other tools
dyxgb importance -m model.dyxgb | head -6 | column -t -s,

# Combine train + evaluate
dyxgb train -s train.csv --target y -o model.dyxgb && \
  dyxgb evaluate -s test.csv -m model.dyxgb --target y > metrics.json

Quick Start

Train a Model

# Basic training (outputs model.dyxgb bundle)
dyxgb train \
  --source data/train.csv \
  --target label \
  --output model.dyxgb

# With hyperparameter tuning
dyxgb train \
  --source data/train.parquet \
  --target price \
  --task regression \
  --tune \
  --tune-trials 100

Make Predictions

# From file
dyxgb predict \
  --source data/new_data.csv \
  --model model.dyxgb \
  --output predictions.csv

# From stdin (pipe-friendly default)
dyxgb predict --model model.dyxgb < new_data.csv > preds.csv

# JSONL format
dyxgb predict -m model.dyxgb \
  --input-format jsonl \
  --output-format jsonl < data.jsonl

Evaluate Model

# Get metrics as JSON (default: stdout)
dyxgb evaluate \
  --source test.csv \
  --model model.dyxgb \
  --target label

# Save to file
dyxgb evaluate -s test.csv -m model.dyxgb --target y --output metrics.json

Feature Importance

# CSV to stdout (default)
dyxgb importance --model model.dyxgb

# Top N features
dyxgb importance -m model.dyxgb --top 10

# JSONL format
dyxgb importance -m model.dyxgb --output-format jsonl

# Save to file
dyxgb importance -m model.dyxgb --output importance.parquet

Interactive Mode

# Run interactive wizard (requires interactive extra)
dyxgb interactive

Database Sources

# PostgreSQL
dyxgb train \
  --source "postgres://user:pass@localhost:5432/mydb" \
  --query "SELECT * FROM training_data" \
  --target label

# DuckDB
dyxgb train \
  --source "duckdb:///data/analytics.duckdb" \
  --table "features" \
  --target churn

Configuration File

Create a config.yaml file for reproducible workflows:

data:
  train:
    type: file
    path: "data/train.parquet"

transforms:
  rename:
    txn_amt: amount
  cast:
    amount: float
    age: int
  missing:
    strategy: median
  features:
    - name: amount_log
      function: log
      column: amount
    - name: amount_per_age
      function: ratio
      columns: [amount, age]
  encode:
    columns:
      - category
  scale:
    method: standard
    columns:
      - amount
      - amount_log

model:
  task: classification
  target: label
  features:
    - amount
    - age
    - amount_log
    - amount_per_age
    - category
  hyperparameters:
    n_estimators: 300
    max_depth: 6
    learning_rate: 0.1

# Train with config
dyxgb train --config config.yaml

# Override config options
dyxgb train --config config.yaml --tune --tune-trials 200

See config.yaml or config.example.yaml for a complete example.

Feature Engineering

Feature engineering lives under transforms and runs in a fixed order: rename -> cast -> missing -> features -> encode -> scale. The pipeline is fitted on training data and stored in the .dyxgb bundle, so predict and evaluate reuse the exact same transforms.

Each entry in transforms.features needs a name and either:

function plus column or columns for the built-in registry (list with dyxgb functions)
expr for a raw Polars expression string (advanced; pl is in scope)

Features run after missing-value handling and before encoding/scaling, so add any derived categoricals to encode.columns and derived numerics to scale.columns (or omit scale.columns to auto-scale). model.features should reference the post-transform column names.

Output Schemas

Predictions

Classification (CSV/JSONL):

predicted_label,confidence,prob_ClassA,prob_ClassB
A,0.85,0.85,0.15
B,0.92,0.08,0.92

Regression (CSV/JSONL):

predicted_value
123.45
67.89

Evaluation Metrics (JSON)

Classification:

{
  "accuracy": 0.92,
  "precision": 0.91,
  "recall": 0.93,
  "f1": 0.92,
  "roc_auc": 0.96
}

Regression:

{
  "mse": 0.0234,
  "rmse": 0.153,
  "mae": 0.12,
  "r2": 0.94
}

Feature Importance (CSV)

feature,importance
feature_3,0.4521
feature_1,0.3234
feature_2,0.2245

Supported Data Sources

Source	URI Format	Example
CSV	File path	`data/train.csv`
Parquet	File path	`data/train.parquet`
JSON	File path	`data/train.json`
JSONL/NDJSON	File path	`data/train.jsonl`
SQLite	`sqlite:///path`	`sqlite:///data/db.sqlite`
DuckDB	`duckdb:///path`	`duckdb:///data/analytics.duckdb`
PostgreSQL	`postgres://...`	`postgres://user:pass@host:5432/db`

Legacy Compatibility

The new bundle format (.dyxgb) is preferred, but legacy separate files are still supported:

# Legacy format (model.json + encoder.joblib)
dyxgb predict \
  --source data.csv \
  --model model.json \
  --encoder encoder.joblib \
  --output predictions.csv

CLI Reference

# Show help
dyxgb --help

# Command-specific help
dyxgb train --help
dyxgb predict --help
dyxgb evaluate --help
dyxgb importance --help
dyxgb interactive --help

Development

# Install dev dependencies
uv sync --group dev

# Run tests
uv run pytest

# Run contract tests specifically
uv run pytest tests/test_cli_contracts.py -v

# Type checking
uv run mypy src/dyxgb

# Linting
uv run ruff check src/dyxgb

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
config.example.yaml		config.example.yaml
config.yaml		config.yaml
model.dyxgb		model.dyxgb
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dyxgb - Dynamic XGBoost

Features

Installation

Optional Dependencies

CLI Contract

I/O Behavior

Exit Codes

Default Behavior

Unix Pipelines

Quick Start

Train a Model

Make Predictions

Evaluate Model

Feature Importance

Interactive Mode

Database Sources

Configuration File

Feature Engineering

Output Schemas

Predictions

Evaluation Metrics (JSON)

Feature Importance (CSV)

Supported Data Sources

Legacy Compatibility

CLI Reference

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dyxgb - Dynamic XGBoost

Features

Installation

Optional Dependencies

CLI Contract

I/O Behavior

Exit Codes

Default Behavior

Unix Pipelines

Quick Start

Train a Model

Make Predictions

Evaluate Model

Feature Importance

Interactive Mode

Database Sources

Configuration File

Feature Engineering

Output Schemas

Predictions

Evaluation Metrics (JSON)

Feature Importance (CSV)

Supported Data Sources

Legacy Compatibility

CLI Reference

Development

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages