diff --git a/docs/source/workbook/_downloads/track_d_headers/bank_transactions_minimal.csv b/docs/source/workbook/_downloads/track_d_headers/bank_transactions_minimal.csv new file mode 100644 index 0000000..e8a09f8 --- /dev/null +++ b/docs/source/workbook/_downloads/track_d_headers/bank_transactions_minimal.csv @@ -0,0 +1 @@ +bank_txn_id,posted_date,description,amount diff --git a/docs/source/workbook/_downloads/track_d_headers/chart_of_accounts_minimal.csv b/docs/source/workbook/_downloads/track_d_headers/chart_of_accounts_minimal.csv new file mode 100644 index 0000000..cd195e2 --- /dev/null +++ b/docs/source/workbook/_downloads/track_d_headers/chart_of_accounts_minimal.csv @@ -0,0 +1 @@ +account_id,account_name,account_type,normal_side diff --git a/docs/source/workbook/_downloads/track_d_headers/gl_detail_minimal.csv b/docs/source/workbook/_downloads/track_d_headers/gl_detail_minimal.csv new file mode 100644 index 0000000..16c1dee --- /dev/null +++ b/docs/source/workbook/_downloads/track_d_headers/gl_detail_minimal.csv @@ -0,0 +1 @@ +txn_id,date,description,account_id,debit,credit diff --git a/docs/source/workbook/_downloads/track_d_headers/invoices_minimal.csv b/docs/source/workbook/_downloads/track_d_headers/invoices_minimal.csv new file mode 100644 index 0000000..9eedf34 --- /dev/null +++ b/docs/source/workbook/_downloads/track_d_headers/invoices_minimal.csv @@ -0,0 +1 @@ +invoice_id,invoice_date,customer_id,amount_total,status,paid_date diff --git a/docs/source/workbook/track_d.rst b/docs/source/workbook/track_d.rst index 5110b3c..5df32c0 100644 --- a/docs/source/workbook/track_d.rst +++ b/docs/source/workbook/track_d.rst @@ -142,7 +142,7 @@ By default, Track D scripts write artifacts to: * ``outputs/track_d/figures/`` — charts created by chapters that plot results For a practical walkthrough of the most common output files (CSV/JSON/PNG) and how to use them in a -write-up, see :doc:`track_d_outputs_guide`. +write-up, see :doc:`track_d_outputs_guide`. To apply the same workflow to your own exports, see :doc:`track_d_my_own_data`. A typical chapter produces: diff --git a/docs/source/workbook/track_d_lab_ta_notes.rst b/docs/source/workbook/track_d_lab_ta_notes.rst index 57830ad..dca620b 100644 --- a/docs/source/workbook/track_d_lab_ta_notes.rst +++ b/docs/source/workbook/track_d_lab_ta_notes.rst @@ -21,6 +21,7 @@ If you have 5 minutes before lab, skim these pages: * :doc:`track_d_student_edition` — the “book-style” Track D entry point * :doc:`track_d_dataset_map` — the table mental model + intentional QC “warts” * :doc:`track_d_outputs_guide` — what each generated output means and how to interpret it +* :doc:`track_d_my_own_data` — a practical "bring your own data" bridge (30-minute recipe) 1. Learning goals ================= diff --git a/docs/source/workbook/track_d_my_own_data.rst b/docs/source/workbook/track_d_my_own_data.rst index e183a01..8e1cdcd 100644 --- a/docs/source/workbook/track_d_my_own_data.rst +++ b/docs/source/workbook/track_d_my_own_data.rst @@ -4,183 +4,343 @@ Track D: Apply what you learned to your data ============================================ -This page is the “bridge” between *the Track D running case* (NSO v1 + LedgerLab) +This page is the *bridge* between the Track D running case (LedgerLab + NSO v1) and your own accounting / bookkeeping / finance data. -**Goal:** take the same analyst habits you practiced in Track D (contracts, checks, joins, -reproducible outputs, and clear communication) and apply them to real data responsibly. +**The promise of Track D:** you leave with analyst habits that transfer: -If you haven’t yet, start here: +- think in **tables** (grain + keys + joins) +- write **contracts** before analysis +- run **checks** (duplicates, missingness, reconciliations) +- produce **reproducible outputs** (tables + charts + short memo) -- :doc:`track_d_student_edition` (Student Edition entry point) -- :doc:`track_d_dataset_map` (dataset mental model: tables, keys, intentional QC “warts”) -- :doc:`track_d_outputs_guide` (what each output file means and how to read it) +If you haven’t yet, skim these pages first: -What “your own data” usually looks like -======================================= +- :doc:`track_d_student_edition` (entry point) +- :doc:`track_d_dataset_map` (what the Track D tables are) +- :doc:`track_d_outputs_guide` (what the Track D scripts write) -Most student projects look like one of these: +Two paths +========= -1) **Accounting exports** - - QuickBooks / Xero exports (GL detail, trial balance, A/R aging, A/P aging) - - Bank CSV exports (transactions, balances) - - Payroll exports (wage detail, remittances) +Most student projects fall into one of these paths. +Pick the one that matches what you *actually* have today. -2) **Operational tables that explain the numbers** - - invoices / sales orders - - purchase orders - - inventory movements - - time sheets / payroll runs +.. list-table:: Choose your path + :header-rows: 1 + :widths: 28 72 -3) **A “chart of accounts”** - - account_id + account_name + type (“Asset”, “Liability”, …) + * - Path + - When it’s a good fit + * - **Path A: exports only** + - You have bank transactions, invoices, bills, payroll exports, but **no** clean GL detail export. + You can still build a monthly "TB-style" table (a consistent monthly rollup) and do real analysis. + * - **Path B: GL export** + - You have a **General Ledger detail** (or journal export) with debits/credits by account. + You can validate fast and get to trial-balance and statement style outputs quickly. -The Track D case teaches you how to *think in tables*: -each table has a grain, keys, expected constraints, and known failure modes. +The "first 30 minutes" recipe below is intentionally minimal. +You can make it fancier later. -The rule: do NOT start with modeling — start with contracts -=========================================================== +Setup once (recommended) +======================== -Before you run any stats, write down: +Create a small project folder structure and keep raw exports untouched. -- **What is a row?** (grain) -- **What uniquely identifies a row?** (primary key) -- **What columns are required?** -- **What values are allowed?** (domain checks) -- **What must be true across tables?** (invariants) +.. code-block:: text -This is what turns you into an “accounting-data analyst” instead of a “Python runner.” + my_trackd_project/ + raw/ # untouched exports + working/ # renamed columns, cleaned values + outputs/ # your tables + charts + short memo + notes/ # assumptions, mapping notes, QA notes -A practical workflow you can reuse -================================== +**Privacy note:** remove names / emails / account numbers before sharing. -Step 0 — Make a safe copy -------------------------- +Path A: exports only (bank + invoices/bills) +============================================ + +You can do meaningful analysis without a full GL. +Your first goal is a **monthly rollup table** you can trust. +Think of it as "trial-balance style": consistent columns, consistent sign conventions, +reproducible from your exports. + +Minimum required columns +------------------------ + +**Bank transactions (CSV)** + +- ``posted_date`` (or ``date``): transaction date (parseable) +- ``description``: text description +- ``amount``: numeric amount (see sign conventions below) +- *(strongly recommended)* ``bank_txn_id``: unique id from the bank export -Work on a copy of the data, not the original export. +**Invoices / sales (CSV)** (if you have them) -- Remove names / sensitive fields if you’re sharing work. -- Keep a read-only “raw/” folder and an editable “working/” folder. -- Record the export date and the system (QuickBooks/Xero/bank). +- ``invoice_id``: unique id +- ``invoice_date`` (or ``date``): invoice date +- ``amount``: invoice total (or subtotal + tax) +- *(optional but very useful)* ``customer_id`` or ``customer`` +- *(optional)* ``paid_date`` or ``status`` -Step 1 — Normalize column names +Sign conventions (decide early) ------------------------------- -Real exports have messy headers. -Pick one naming style (snake_case is easiest) and normalize: +Pick one convention and stick to it: + +- **cash inflows are positive** +- **cash outflows are negative** + +If your bank export uses the opposite (or splits debit/credit into separate columns), +normalize it during the "working" step. + +First 30 minutes recipe +----------------------- + +1) Copy exports into ``raw/``. +2) Create cleaned versions in ``working/`` with consistent headers. +3) Produce **one table + one chart + one sanity check**. + +Example script (save as ``working/path_a_30min.py`` and run with ``python working/path_a_30min.py``): + +.. code-block:: python + + from pathlib import Path + import pandas as pd + import matplotlib.pyplot as plt + + ROOT = Path(__file__).resolve().parents[1] + raw_bank = ROOT / "raw" / "bank.csv" + out_dir = ROOT / "outputs" + out_dir.mkdir(exist_ok=True) + + bank = pd.read_csv(raw_bank) + + # --- Required columns (rename if needed) + required = {"posted_date", "description", "amount"} + missing = required - set(bank.columns) + if missing: + raise ValueError(f"Bank export missing columns: {sorted(missing)}") + + # --- Dates + bank["posted_date"] = pd.to_datetime(bank["posted_date"], errors="coerce") + if bank["posted_date"].isna().any(): + bad = bank.loc[bank["posted_date"].isna(), :].head(5) + raise ValueError(f"Some bank dates failed to parse. Examples:\n{bad}") + + # --- Amounts + bank["amount"] = pd.to_numeric(bank["amount"], errors="coerce") + if bank["amount"].isna().any(): + bad = bank.loc[bank["amount"].isna(), :].head(5) + raise ValueError(f"Some bank amounts failed to parse. Examples:\n{bad}") + + # --- Optional QA: duplicate IDs + if "bank_txn_id" in bank.columns: + dup_ids = bank.loc[bank["bank_txn_id"].duplicated(), "bank_txn_id"].unique() + if len(dup_ids) > 0: + print(f"WARNING: duplicate bank_txn_id (showing up to 10): {dup_ids[:10]}") + + # --- Month rollup + bank["month"] = bank["posted_date"].dt.to_period("M").astype(str) -- ``Invoice #`` → ``invoice_id`` -- ``Txn Date`` → ``date`` -- ``Customer`` → ``customer`` + g = bank.groupby("month")["amount"] + monthly = pd.DataFrame( + { + "cash_in": g.apply(lambda s: s[s > 0].sum()), + "cash_out": g.apply(lambda s: -s[s < 0].sum()), + "net_cash": g.sum(), + "n_txns": g.size(), + } + ).reset_index() -Tip: keep a small mapping note in your project folder (even a plain text file). + # Sanity check: net_cash = cash_in - cash_out + max_err = (monthly["cash_in"] - monthly["cash_out"] - monthly["net_cash"]).abs().max() + if max_err > 1e-6: + raise ValueError(f"Sanity check failed (cash_in - cash_out != net_cash). max_err={max_err}") -Step 2 — Convert money fields to numeric ----------------------------------------- + # Save the table + monthly.to_csv(out_dir / "monthly_cash_summary.csv", index=False) -Accounting exports often include: + # One chart: net cash by month + plt.figure() + plt.plot(monthly["month"], monthly["net_cash"], marker="o") + plt.xticks(rotation=45, ha="right") + plt.title("Net cash flow by month") + plt.tight_layout() + plt.savefig(out_dir / "net_cash_by_month.png", dpi=150) -- currency symbols (``$1,234.56``) -- parentheses for negatives (``(123.45)``) -- commas as thousands separators + print("Wrote:") + print(" -", out_dir / "monthly_cash_summary.csv") + print(" -", out_dir / "net_cash_by_month.png") -Convert these into numeric columns early, and confirm: +What you just did is the Track D pattern: -- missing values are handled -- signs are correct -- totals match the source system +- define a contract (required columns) +- normalize types (dates, numeric) +- run one QA check (duplicates) +- produce a reproducible table + chart -Step 3 — Build “checkpoint” tables ----------------------------------- +From there, you can grow into richer questions (cash forecasting, seasonality, outliers, +customer payment behavior if you have invoices). -The Track D case repeatedly uses “checkpoint” tables: +Path B: GL export (journal / general ledger detail) +=================================================== -- **GL journal** (debits/credits by txn) -- **Trial balance** (ending balances by account) -- **Statements** (IS/BS/CF rollups by month) +If you have GL detail, you can validate and build trial-balance style tables fast. -For your own data, try to create at least one checkpoint table that you trust -(e.g., a Trial Balance export) so you can validate your reconstruction. +Minimum required columns +------------------------ -Step 4 — Run the minimum set of QA checks ------------------------------------------ +At minimum: -Borrow these habits from Track D: +- ``date``: transaction date +- ``account_id`` (or ``account``): account identifier +- ``debit``: numeric debit amount (0 if none) +- ``credit``: numeric credit amount (0 if none) -- **Uniqueness checks:** keys unique (no duplicate IDs) -- **Completeness:** required columns non-null -- **Range checks:** dates in expected window, amounts not absurd -- **Reconciliation checks:** totals tie out -- **Invariants:** debits = credits; Assets = Liabilities + Equity (when applicable) +Strongly recommended: -Document failures. In accounting analytics, failures are often the most useful results. +- ``txn_id`` (or ``journal_entry_id``): lets you check balance per transaction +- ``description`` -Step 5 — Only then: analysis + story ------------------------------------- +Even better (optional): -Once the data is clean enough, the “stats” becomes meaningful: +- a Chart of Accounts export you can join on ``account_id`` with ``account_name`` and ``account_type`` -- trends (month-over-month) -- segmentation (customer/vendor/product) -- variability (outliers, tails, seasonality) -- drivers (regression with careful interpretation) -- risk (probability of cash shortfalls, DSO tails) +First 30 minutes recipe +----------------------- -How to translate Track D tables to your tables -============================================== +Example script (save as ``working/path_b_30min.py`` and run with ``python working/path_b_30min.py``): -Use the mental model page as your guide: +.. code-block:: python -- :doc:`track_d_dataset_map` + from pathlib import Path + import pandas as pd + import matplotlib.pyplot as plt -**You don’t need every table.** -Pick the tables that let you answer your question *and* support validity checks. + ROOT = Path(__file__).resolve().parents[1] + raw_gl = ROOT / "raw" / "gl_detail.csv" + out_dir = ROOT / "outputs" + out_dir.mkdir(exist_ok=True) -Examples: + gl = pd.read_csv(raw_gl) -- If your goal is **cash forecasting**, you need bank transactions + expected inflows/outflows. -- If your goal is **A/R collection analysis**, you need invoices + collections + customer IDs. -- If your goal is **profitability diagnostics**, you need revenue + COGS + operating expenses, ideally by month. + required = {"date", "account_id", "debit", "credit"} + missing = required - set(gl.columns) + if missing: + raise ValueError(f"GL export missing columns: {sorted(missing)}") -What to do when the data is “weird” -=================================== + gl["date"] = pd.to_datetime(gl["date"], errors="coerce") + if gl["date"].isna().any(): + bad = gl.loc[gl["date"].isna(), :].head(5) + raise ValueError(f"Some GL dates failed to parse. Examples:\n{bad}") -Track D intentionally includes “warts” to train you. + for col in ["debit", "credit"]: + gl[col] = pd.to_numeric(gl[col], errors="coerce").fillna(0.0) -Real data has the same issues, and the response is the same: + gl["month"] = gl["date"].dt.to_period("M").astype(str) -- Identify the anomaly -- Explain why it matters -- Decide whether to fix, exclude, or model it explicitly -- Document the decision + # Check 1: debits = credits per month (basic reconciliation) + monthly_dc = gl.groupby("month")[["debit", "credit"]].sum().reset_index() + monthly_dc["diff"] = monthly_dc["debit"] - monthly_dc["credit"] -Common “weird” patterns: + max_diff = monthly_dc["diff"].abs().max() + if max_diff > 1e-6: + print(monthly_dc) + raise ValueError( + "Debits != credits by month. This often means the export was filtered, " + "or some lines are missing. Fix this before you trust any analysis." + ) -- duplicate transaction IDs -- negative inventory (timing, stockouts, backorders) -- negative A/R (credits, misapplied payments) -- month boundary issues (late postings, backdated entries) + # Optional Check 2: debits = credits per txn_id (stronger) + if "txn_id" in gl.columns: + per_txn = gl.groupby("txn_id")[["debit", "credit"]].sum() + per_txn["diff"] = per_txn["debit"] - per_txn["credit"] + bad = per_txn[per_txn["diff"].abs() > 1e-6] + if len(bad) > 0: + raise ValueError( + f"Found {len(bad)} unbalanced txn_id rows. " + "Your export may be incomplete or mis-keyed." + ) -A responsible sharing checklist -=============================== + # Build a monthly TB-style table: debit, credit, net (debit-positive convention) + tb = gl.groupby(["month", "account_id"], as_index=False)[["debit", "credit"]].sum() + tb["net"] = tb["debit"] - tb["credit"] -Before you share a report, make sure you can answer: + tb.to_csv(out_dir / "tb_monthly.csv", index=False) -- What data source(s) did you use and when was it exported? -- What transformations did you apply? -- What checks did you run, and what failed? -- What limitations remain? -- What decisions might change if the data is incomplete? + # One chart: total debits and credits by month (a quick integrity picture) + plt.figure() + plt.plot(monthly_dc["month"], monthly_dc["debit"], marker="o", label="debit") + plt.plot(monthly_dc["month"], monthly_dc["credit"], marker="o", label="credit") + plt.xticks(rotation=45, ha="right") + plt.title("GL totals by month (should match)") + plt.legend() + plt.tight_layout() + plt.savefig(out_dir / "gl_debits_credits_by_month.png", dpi=150) + + print("Wrote:") + print(" -", out_dir / "tb_monthly.csv") + print(" -", out_dir / "gl_debits_credits_by_month.png") + +You now have: + +- a **reconciliation check** (debits == credits) +- a **monthly TB-style table** you can analyze +- a **chart** you can put in a memo + +From here, you can add one join (Chart of Accounts) and start doing Track D-style +statement rollups or account-level diagnostics. + +Common pitfalls (and what to do) +================================ + +Date parsing +------------ + +- Mixed formats (``MM/DD/YYYY`` vs ``YYYY-MM-DD``) cause silent errors. +- Always use ``pd.to_datetime(..., errors="coerce")`` and *fail fast* if you get nulls. + +Signs and "negative" exports +---------------------------- + +- Some systems export credits as negative numbers instead of a separate credit column. +- Decide your convention, normalize once, and document it in ``notes/assumptions.txt``. + +Duplicate IDs +------------- + +- Bank feeds can duplicate rows (Track D includes this on purpose). +- If you have IDs, check duplicates and decide: remove exact duplicates, or keep and flag. + +Missing months / incomplete exports +----------------------------------- + +- If debits != credits by month, the export is likely filtered or incomplete. +- Fix the export first. Don’t "patch" it with analysis code. + +Month boundaries +---------------- + +- Accruals, backdated entries, and late postings create confusing month swings. +- Track D teaches the right response: *describe it, measure it, and document the limitation.* + +Template header pack (copy/paste) +================================= + +If you want a starting point without hunting for column names, these tiny CSV templates +are safe to download and copy into your own project. + +- :download:`gl_detail_minimal.csv <_downloads/track_d_headers/gl_detail_minimal.csv>` +- :download:`chart_of_accounts_minimal.csv <_downloads/track_d_headers/chart_of_accounts_minimal.csv>` +- :download:`bank_transactions_minimal.csv <_downloads/track_d_headers/bank_transactions_minimal.csv>` +- :download:`invoices_minimal.csv <_downloads/track_d_headers/invoices_minimal.csv>` Next steps ========== -If you want a guided assignment structure, see the upcoming Track D assignment pages -(coming as separate docs PRs) that include: - -- a “my own data” project template -- a rubric that rewards good contracts + checks + communication -- example memos and charts that avoid common mistakes +Once you can produce one table + one chart + one check, you are ready for: -For now, you can still practice the core workflow by using Track D’s reproducible case -and mirroring the same steps with your own exports. +- a short "executive summary" memo (what you found + what limits confidence) +- deeper Track D-style joins (drivers, segmentation) +- classroom labs and rubrics (the next Track D docs pages) diff --git a/docs/source/workbook/track_d_student_edition.rst b/docs/source/workbook/track_d_student_edition.rst index 145b22f..cc669da 100644 --- a/docs/source/workbook/track_d_student_edition.rst +++ b/docs/source/workbook/track_d_student_edition.rst @@ -52,6 +52,7 @@ These pages live inside the workbook documentation subtree (they build cleanly o - :doc:`track_d` — the Track D workbook page (run list + where to start) - :doc:`track_d_dataset_map` — the **dataset mental model** (what each table is, how they relate, and why some rows are “warts” on purpose) - :doc:`track_d_outputs_guide` — how to read the outputs folders and key CSV artifacts +- :doc:`track_d_my_own_data` — bring your own exports: a 30-minute "one table + one chart + one check" recipe - :doc:`track_d_lab_ta_notes` — a lab handout + TA notes (walkthrough + interpretation) What you are building (the pipeline)