Skip to content

Add data-audit specification and dbt tooling infrastructure#80

Merged
michaelbarton merged 28 commits intomasterfrom
claude/fix-status-checks-0nymq
Mar 10, 2026
Merged

Add data-audit specification and dbt tooling infrastructure#80
michaelbarton merged 28 commits intomasterfrom
claude/fix-status-checks-0nymq

Conversation

@michaelbarton
Copy link
Owner

Summary

This PR introduces a comprehensive specification for a new data-audit Python package alongside supporting dbt analysis tooling and Neovim integration. The changes establish the foundation for LLM-powered auditing of heterogeneous data artifacts (dbt models, notebooks, flat files) with autonomous orchestration capabilities.

Key Changes

Core Specification (dbt/SPEC.md)

  • Comprehensive specification for the data-audit package covering:
    • Problem statement and goals for unified artifact auditing
    • Support for multiple artifact types: dbt models, schema files, Jupyter notebooks, Quarto documents, and flat data files
    • Architecture including core data models (Finding, AuditResult, AuditPlan, AuditTask)
    • Auditor protocol for extensible artifact handling
    • LLM backend abstraction supporting multiple providers (Anthropic, cursor-agent, litellm)
    • Orchestrator ("project manager") for autonomous follow-up audits with budget controls
    • CLI interface and programmatic API design
    • Prompt template system with conditional rendering

dbt Analysis Tools

  • dbt_batch_audit.py: Batch auditing script that:

    • Compiles dbt models and gathers context (compiled SQL, sample rows, lineage, existing tests)
    • Runs audits in parallel across multiple models and LLMs
    • Synthesizes findings into a consolidated report with cross-model impact analysis
    • Includes downstream propagation tracing for bug impact assessment
  • dbt_analyse.py: Interactive single-model analysis tool that:

    • Compiles a model and gathers full context
    • Launches cursor-agent for interactive LLM-driven analysis
    • Supports custom prompt templates
  • Prompt Templates:

    • dbt_deep_analysis.md: Comprehensive audit checklist covering schema, joins, filters, grain, data quality, performance, and test coverage
    • dbt_quick_analysis.md: Quick inline code review with improvement suggestions

Neovim Integration (nvim/lua/config/dbt.lua)

  • Comprehensive dbt keymaps and helpers:
    • <leader>dg: Jump to ref/source under cursor
    • <leader>df: Fuzzy model picker with run/build/test actions
    • <leader>do: Open compiled SQL in split
    • <leader>d/: Search across models
    • <leader>dr/dR: Run model (with/without downstream)
    • <leader>db: Build model (run + test)
    • <leader>dc: Compile model
    • <leader>dt: Test model
    • <leader>ds: Show preview
    • <leader>dp: Preview sample rows in split
    • <leader>dv: Pipe output to visidata for interactive exploration

Configuration Updates

  • nvim/lua/config/keymaps.lua: Migrate wiki search from fzf-lua to telescope for consistency
  • ansible/tasks/neovim.yml: Add dbt directory symlink alongside ftplugin
  • .github/workflows/ansible.yml: Add CI workflow for playbook syntax checking and execution
  • nvim/lua/plugins/language.lua: Add SQL to treesitter language list

Notable Implementation Details

  • Template rendering uses simple Handlebars-style syntax ({{var}}, {{#if var}}, {{^if var}}) for flexibility
  • Batch audit includes sophisticated synthesis prompt that traces bug propagation through model dependency chains
  • dbt commands use uv run for consistent Python environment management
  • Neovim integration uses toggleterm for async command execution while maintaining editor focus
  • Budget controls in orchestrator prevent runaway analysis (max_depth, max_tasks, max_wall_clock, max_tokens)

https://claude.ai/code/session_01BPxJNKYipwwrx17jvyv6rM

claude and others added 28 commits March 1, 2026 23:55
Removes the dependency on the community.general collection which
was producing a warning about not supporting the installed Ansible
version. Uses npm install --prefix directly instead.

https://claude.ai/code/session_01Vm5EEsQ5uFKoni6qWEDQd8
Runs the full playbook (minus macOS-only launch agents) on
ubuntu-latest: syntax check then an actual apply. Installs
Neovim, Python 3.11, and virtualenv so the neovim setup tasks
(pip venv, Lazy sync, treesitter) can run too.

https://claude.ai/code/session_01Vm5EEsQ5uFKoni6qWEDQd8
Add <leader>dr (dbt run), <leader>dc (dbt compile), and <leader>dt
(dbt test) keybindings that format the current SQL file with sqlfmt
via conform, save, then send the dbt command to toggleterm. Also add
sql to treesitter ensure_installed for better SQL highlighting.

https://claude.ai/code/session_016johXLfEd6P4umaT14YQEQ
- <leader>dR: dbt run -s model+ (model and all downstream dependents)
- <leader>db: dbt build (run + test in DAG order)
- <leader>ds: dbt show (preview query results without materializing)

https://claude.ai/code/session_016johXLfEd6P4umaT14YQEQ
- <leader>dg: jump to ref() or source() under cursor
- <leader>df: fzf model picker with ctrl-r/b/t to run/build/test
- <leader>do: open compiled SQL in readonly vsplit
- <leader>d/: grep across all models (find columns, CTEs, etc.)

Inspired by fzf-dbt CLI tool, adapted for neovim with fzf-lua.

https://claude.ai/code/session_016johXLfEd6P4umaT14YQEQ
<leader>da: quick analysis with sonnet — reviews the current model
and suggests improvements in non-interactive mode.

<leader>dA: deep analysis with sonnet thinking — interrogates the
duckdb database and cross-references with the current model to check
data quality, joins, types, and more.

https://claude.ai/code/session_016johXLfEd6P4umaT14YQEQ
The <leader>dA keymap now compiles the model first, then passes
the compiled SQL and first 20 rows from dbt show as extra context
to the claude agent for more informed analysis.

https://claude.ai/code/session_016johXLfEd6P4umaT14YQEQ
Prompts now live in nvim/prompts/ as markdown files with {{var}}
template placeholders. The quick analysis prompt is loaded and
substituted in Lua; the deep analysis prompt uses sed at runtime
to inject compiled SQL and sample rows before passing to claude.

https://claude.ai/code/session_016johXLfEd6P4umaT14YQEQ
- <leader>da now replaces the buffer with the annotated SQL (undo with :u)
- Updated prompt to allow brief explanations alongside suggestions
- All dbt commands (run, build, test, compile, show) now use uv run

https://claude.ai/code/session_016johXLfEd6P4umaT14YQEQ
<leader>dA now opens a new tmux window named 'dbt:<model>' that
compiles the model, gathers sample rows, then starts an interactive
claude session with that context so you can discuss the model.

https://claude.ai/code/session_016johXLfEd6P4umaT14YQEQ
Runs dbt show --limit 20 asynchronously and displays the results
in a read-only scratch buffer. Press q to close.

https://claude.ai/code/session_016johXLfEd6P4umaT14YQEQ
fzf-lua is not installed; the config uses telescope via LazyVim.
Converts wiki search, wiki insert link, dbt model finder (with
C-r/C-b/C-t actions), and dbt grep to telescope equivalents.

https://claude.ai/code/session_016johXLfEd6P4umaT14YQEQ
Pipes `dbt show --output csv --limit 500` into visidata via
toggleterm. Formats and saves the file first like other dbt commands.

https://claude.ai/code/session_016johXLfEd6P4umaT14YQEQ
Uses a dedicated toggleterm Terminal with close_on_exit and an
on_exit callback instead of the shared terminal, so focus returns
to the previous window when visidata is quit.

https://claude.ai/code/session_016johXLfEd6P4umaT14YQEQ
dbt show only supports --output json/text, not csv. Use
--output json --log-format json and pipe through a python
script that extracts the preview data and converts to CSV.

https://claude.ai/code/session_016johXLfEd6P4umaT14YQEQ
- Close file handles properly in dbt_analyse.py (use `with` statements)
- Add PEP 723 inline metadata to dbt_analyse.py for uv compatibility
- Move `import re` to module level in dbt_batch_audit.py
- Write context to a temp file in dbt_analyse.py to avoid ARG_MAX limits
- Add subprocess timeouts (900s audit, 1200s synthesis) to prevent hangs
- Order template substitutions to avoid placeholder injection

https://claude.ai/code/session_01RHSUYqsWy6xLfAC9tFTy66
- Rewrite dbt_deep_analysis.md with a comprehensive 8-section audit
  checklist covering schema/types, join correctness, filters, grain,
  data quality, performance, test coverage gaps, and upstream risks
- Add structured output format (findings table, evidence queries,
  suggested dbt test YAML snippets)
- Add conditional template sections ({{#if lineage}}, {{#if existing_tests}})
  with a lightweight Handlebars-style renderer in both scripts
- Gather model lineage (parents/children) via `dbt ls` selectors
- Scan schema.yml files for existing test definitions and include them
  so the LLM focuses on coverage gaps rather than redundant suggestions
- Add pyyaml dependency to both scripts' PEP 723 metadata

https://claude.ai/code/session_01RHSUYqsWy6xLfAC9tFTy66
- Remove unused `import json`
- Remove stub `get_data_profile` that always returned empty string
- Simplify redundant glob patterns in get_existing_tests (*.yml already
  covers schema.yml, _schema.yml, *_models.yml)

https://claude.ai/code/session_01RHSUYqsWy6xLfAC9tFTy66
Move all dbt helpers and keymaps from keymaps.lua into a dedicated
config/dbt.lua, loaded via require("config.dbt"). Keeps keymaps.lua
focused on general-purpose bindings.

https://claude.ai/code/session_01RHSUYqsWy6xLfAC9tFTy66
Covers artifact types (dbt models, notebooks, flat files, Quarto docs),
the PM orchestrator loop, structured findings model, LLM backend
abstraction, CLI/API design, and phased implementation plan.

https://claude.ai/code/session_019YdrjfrB6Lgu5QTZzrnXdb
Merge master and reformat dbt_analyse.py and dbt_batch_audit.py to
pass the CI formatting checks.

https://claude.ai/code/session_01BPxJNKYipwwrx17jvyv6rM
…hecks-0nymq

# Conflicts:
#	.github/workflows/ansible.yml
@michaelbarton michaelbarton merged commit ca41eb3 into master Mar 10, 2026
2 checks passed
@michaelbarton michaelbarton deleted the claude/fix-status-checks-0nymq branch March 10, 2026 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants