diff --git a/docs/source/workbook/index.rst b/docs/source/workbook/index.rst index 9ab6452..5ecbcc2 100644 --- a/docs/source/workbook/index.rst +++ b/docs/source/workbook/index.rst @@ -15,5 +15,7 @@ PyStatsV1 Workbook track_c track_d_student_edition track_d + track_d_dataset_map track_d_outputs_guide + track_d_my_own_data track_d_lab_ta_notes \ No newline at end of file diff --git a/docs/source/workbook/track_d.rst b/docs/source/workbook/track_d.rst index 3f92b4c..5110b3c 100644 --- a/docs/source/workbook/track_d.rst +++ b/docs/source/workbook/track_d.rst @@ -111,6 +111,10 @@ and are the same every time (seed=123). * key files include event logs (AR/AP/payroll/inventory), a bank statement feed, a generated general ledger, and monthly statements and trial balances +Before you run deeper chapters, skim: + +* :doc:`track_d_dataset_map` — what each table is, how tables relate, and why some rows look “weird” on purpose + If you want to see the full data dictionary for NSO v1: * `NSO v1 data dictionary cheat sheet (main docs) `_ diff --git a/docs/source/workbook/track_d_dataset_map.rst b/docs/source/workbook/track_d_dataset_map.rst new file mode 100644 index 0000000..fb791a0 --- /dev/null +++ b/docs/source/workbook/track_d_dataset_map.rst @@ -0,0 +1,307 @@ +Track D Dataset Map +=================== + +This page gives you a **dataset mental model** for Track D. + +If Track D is “learn to analyze accounting data,” then this page is the **map**: + +* what each table is, +* how tables relate, +* and why some rows look “wrong” on purpose (because real data is messy). + +.. tip:: + + Run ``pystatsv1 workbook run d00_peek_data`` first. + It prints a quick preview of the tables and writes a short summary markdown file + under ``outputs/track_d``. + +Where the datasets live +----------------------- + +When you run: + +* ``pystatsv1 workbook init --track d`` + +...the workbook starter is created and the canonical (seeded) Track D datasets are +installed into the workbook folder under: + +* ``data/synthetic/ledgerlab_ch01_seed123/`` +* ``data/synthetic/nso_v1_seed123/`` + +These are small, deterministic datasets (seed=123) designed for learning. + +Two dataset families +-------------------- + +Track D uses two dataset families: + +1) **LedgerLab (Ch01)**: a compact “toy business” general ledger example. +2) **NSO v1 running case**: a larger, more realistic multi-module accounting dataset + (A/R, A/P, bank, inventory, payroll, taxes, schedules) that rolls up into a general + ledger and financial statements. + +The point of using *two* datasets is deliberate: + +* LedgerLab teaches the **accounting database basics**. +* NSO v1 trains you to think like an **accounting-data analyst** in a messy, realistic + environment. + +Two dataset families +-------------------- + +Track D ships two canonical datasets (both deterministic with ``seed=123``): + +1) **LedgerLab (Ch01)** — a small, clean “training ledger.” +2) **NSO v1 running case** — a larger, more realistic operating dataset (AR/AP, bank, inventory, payroll, tax, schedules). + +You will use both: + +* LedgerLab is where you learn the **accounting invariants** and the “shape” of ledger data. +* NSO is where you practice **being an accounting-data analyst**: messy source tables, reconciliations, and quality control. + + +LedgerLab (Ch01): COA → GL → TB → Statements +-------------------------------------------- + +LedgerLab is intentionally small so you can see the whole pipeline at once. + +:: + + chart_of_accounts.csv + | + v + gl_journal.csv (journal lines: debits/credits) + | + v + trial_balance_monthly.csv + | + v + statements_is_monthly.csv + statements_bs_monthly.csv + statements_cf_monthly.csv + +**What the tables mean** + +``chart_of_accounts.csv`` + The chart of accounts (COA): the “dictionary” of accounts. + +``gl_journal.csv`` + The general ledger journal in long format. + + * One business transaction (``txn_id``) typically appears as **multiple lines**. + * Debits and credits should balance **within each ``txn_id``**. + * Lines carry account metadata (name/type/normal side) to make analysis easier. + +``trial_balance_monthly.csv`` + A monthly trial balance per account. + + * It aggregates the journal by account and month. + * It exposes the **ending balance** and the **ending side** (Debit/Credit). + +``statements_*_monthly.csv`` + Monthly financial statement lines. + + * IS: revenue, COGS, expenses, net income + * BS: assets, liabilities, equity totals + * CF: net income plus working-capital deltas + +**Why LedgerLab matters** + +Before you can analyze “business performance,” you must trust the accounting pipeline. +LedgerLab lets you practice asking: + +* “Is each transaction balanced?” +* “Does the accounting equation hold?” +* “Do TB and statements agree?” + +Those checks show up in the Track D workbook outputs and are part of what makes you a stronger analyst. + +NSO v1 running case +------------------- + +NSO v1 is a **running business case** designed to look like the output of several operational systems. +In real organizations, you rarely start from a clean GL export. +You start from **subledgers and operational logs**, then reconcile and validate. + +The NSO flow (high level) +^^^^^^^^^^^^^^^^^^^^^^^^^ + +In NSO, the operational tables describe what happened. The GL journal is the accounting translation. + +:: + + (AR events) (AP events) (Bank statement) + \ | / + \ | / + v v v + (Inventory movements) (Payroll + tax) (Schedules) + \ | / + \ | / + v v v + gl_journal.csv + | + v + trial_balance_monthly.csv + | + v + statements_is_monthly.csv / statements_bs_monthly.csv / statements_cf_monthly.csv + +Source tables (what they represent) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Below is a quick “what is this?” guide. These are the tables you analyze, reconcile, and use to build features. + +``ar_events.csv`` + Customer invoices and collections. + + * Invoices increase A/R (``ar_delta`` positive). + * Collections decrease A/R and increase cash received. + +``ap_events.csv`` + Vendor invoices and payments. + + * Invoices increase A/P (``ap_delta`` positive). + * Payments reduce A/P and reduce cash. + +``bank_statement.csv`` + What the bank says happened. + + * One row per bank posting (cash in/out). + * ``gl_txn_id`` links a bank row to the accounting transaction that explains it. + +``inventory_movements.csv`` + Purchases and sales issues by SKU. + + * Purchases add units and cost. + * Sales issues remove units and cost. + +``payroll_events.csv`` + Payroll accruals, withholding, employer taxes, and cash payments. + +``sales_tax_events.csv`` + Sales tax collected (liability increases) and remittances (liability decreases). + +``fixed_assets.csv`` + Asset master file (what was purchased, when it went into service, and how it should be depreciated). + +``depreciation_schedule.csv`` + The depreciation schedule (a helper table you can audit). + +``debt_schedule.csv`` + Loan activity: beginning balance, interest, principal, ending balance. + +``equity_events.csv`` + Owner contributions and draws. + +Accounting output tables +^^^^^^^^^^^^^^^^^^^^^^^^ + +Just like LedgerLab, NSO has accounting outputs: + +* ``gl_journal.csv`` — the accounting translation in long format +* ``trial_balance_monthly.csv`` — monthly balances per account +* ``statements_*_monthly.csv`` — monthly statement lines + +Your job in Track D is to learn how to: + +* explain statements using the underlying operational tables, +* reconcile cash and working-capital movements, +* and detect data quality problems before they mislead your analysis. + +Keys and joins cheat sheet +-------------------------- + +Most Track D work is “join, aggregate, compare.” These keys show up across tables. + +.. list-table:: Keys you will use again and again + :header-rows: 1 + :widths: 20 80 + + * - Key + - Meaning and typical use + * - ``month`` + - Accounting/reporting period in ``YYYY-MM`` format. Use it to group, pivot, and line up monthly outputs. + * - ``txn_id`` + - The **accounting transaction id**. In ``gl_journal.csv`` it identifies all lines that belong to one balanced transaction. + * - ``doc_id`` + - Human-friendly document id (invoice/payment/asset/etc.) often used to trace a transaction in reports. + * - ``account_id`` + - Chart-of-accounts id. Join ``gl_journal.csv`` and ``trial_balance_monthly.csv`` to ``chart_of_accounts.csv`` on this. + * - ``invoice_id`` + - Subledger document id for AR/AP. Often connects an invoice event to a later collection/payment event. + * - ``bank_txn_id`` + - Bank-provided transaction identifier. Should be unique in a perfect world (but see the QC section below). + * - ``gl_txn_id`` + - Link from a bank row back to the accounting transaction that explains it. + * - ``sku`` + - Product identifier used to group inventory movements and compute unit economics. + * - ``asset_id`` / ``loan_id`` + - Keys for fixed asset and debt schedules. + +A practical join pattern +^^^^^^^^^^^^^^^^^^^^^^^^ + +A common Track D pattern is: + +1) Start from a statement line (for example, ``Sales Revenue`` in ``statements_is_monthly.csv``). +2) Find the accounting lines that feed it (filter ``gl_journal.csv`` to revenue accounts). +3) Trace back to operational drivers (AR events, cash sales, product mix, seasonality). + +This is the accounting-analyst version of “feature engineering.” + +Intentional QC issues (the “warts”) +----------------------------------- + +Some tables include a few “warts” on purpose. +They exist so you can practice the kind of quality-control checks analysts do in real jobs. + +.. admonition:: Example warts you should notice + :class: note + + * **Duplicate bank transaction id** + + In ``bank_statement.csv`` you should see a duplicated ``bank_txn_id`` row (it’s tagged in the description output by ``d00_peek_data``). + This teaches you not to assume the bank feed is automatically clean. + + * **Negative inventory can appear** + + Inventory can go negative early in a period. + That can mean a stockout, a late purchase posting, or a timing mismatch between operational and accounting systems. + The analysis lesson is: *don’t hide it; explain it and measure its impact.* + + * **Aggregated rows vs. transactional rows** + + Some operational tables mix “transaction-like” rows with monthly totals. + For example, ``sales_tax_events.csv`` can have collection rows without a concrete ``txn_id``. + That teaches you to design joins carefully and document assumptions. + +The point is not to “fix the dataset.” +The point is to learn to write analyses that remain correct even when the inputs are imperfect. + +How to use this map in a lab +---------------------------- + +A simple lab rhythm that works well: + +1) **Peek**: run ``d00_peek_data`` and skim the previews. +2) **Pick a question**: choose a business question (profitability, cash, working capital, seasonality, drivers). +3) **Choose a level**: + + * start at **statements** if you are answering an executive question, + * start at **trial balance** if you are reconciling, + * start at **operational tables** if you are building drivers. + +4) **Trace and explain**: + + * statements \u2192 ledger lines (GL) \u2192 operational drivers, + * and annotate any QC issues you encounter. + +5) **Write the story**: the end product is a short, well-documented explanation. + +For more detail on what each script writes to ``outputs/track_d/`` and how to read it, see: + +* :doc:`track_d_outputs_guide` + +If you want to run Track D on your own exported data later, see: + +* :doc:`track_d_my_own_data` diff --git a/docs/source/workbook/track_d_lab_ta_notes.rst b/docs/source/workbook/track_d_lab_ta_notes.rst index cddd409..57830ad 100644 --- a/docs/source/workbook/track_d_lab_ta_notes.rst +++ b/docs/source/workbook/track_d_lab_ta_notes.rst @@ -13,6 +13,15 @@ This handout is for a TA running a lab section where students install **PyStatsV It includes what to say, what students should see, and how to explain the output. +Recommended pre-reading (TA) +---------------------------- + +If you have 5 minutes before lab, skim these pages: + +* :doc:`track_d_student_edition` — the “book-style” Track D entry point +* :doc:`track_d_dataset_map` — the table mental model + intentional QC “warts” +* :doc:`track_d_outputs_guide` — what each generated output means and how to interpret it + 1. Learning goals ================= diff --git a/docs/source/workbook/track_d_my_own_data.rst b/docs/source/workbook/track_d_my_own_data.rst new file mode 100644 index 0000000..e183a01 --- /dev/null +++ b/docs/source/workbook/track_d_my_own_data.rst @@ -0,0 +1,186 @@ +.. _track_d_my_own_data: + +============================================ +Track D: Apply what you learned to your data +============================================ + +This page is the “bridge” between *the Track D running case* (NSO v1 + LedgerLab) +and your own accounting / bookkeeping / finance data. + +**Goal:** take the same analyst habits you practiced in Track D (contracts, checks, joins, +reproducible outputs, and clear communication) and apply them to real data responsibly. + +If you haven’t yet, start here: + +- :doc:`track_d_student_edition` (Student Edition entry point) +- :doc:`track_d_dataset_map` (dataset mental model: tables, keys, intentional QC “warts”) +- :doc:`track_d_outputs_guide` (what each output file means and how to read it) + +What “your own data” usually looks like +======================================= + +Most student projects look like one of these: + +1) **Accounting exports** + - QuickBooks / Xero exports (GL detail, trial balance, A/R aging, A/P aging) + - Bank CSV exports (transactions, balances) + - Payroll exports (wage detail, remittances) + +2) **Operational tables that explain the numbers** + - invoices / sales orders + - purchase orders + - inventory movements + - time sheets / payroll runs + +3) **A “chart of accounts”** + - account_id + account_name + type (“Asset”, “Liability”, …) + +The Track D case teaches you how to *think in tables*: +each table has a grain, keys, expected constraints, and known failure modes. + +The rule: do NOT start with modeling — start with contracts +=========================================================== + +Before you run any stats, write down: + +- **What is a row?** (grain) +- **What uniquely identifies a row?** (primary key) +- **What columns are required?** +- **What values are allowed?** (domain checks) +- **What must be true across tables?** (invariants) + +This is what turns you into an “accounting-data analyst” instead of a “Python runner.” + +A practical workflow you can reuse +================================== + +Step 0 — Make a safe copy +------------------------- + +Work on a copy of the data, not the original export. + +- Remove names / sensitive fields if you’re sharing work. +- Keep a read-only “raw/” folder and an editable “working/” folder. +- Record the export date and the system (QuickBooks/Xero/bank). + +Step 1 — Normalize column names +------------------------------- + +Real exports have messy headers. +Pick one naming style (snake_case is easiest) and normalize: + +- ``Invoice #`` → ``invoice_id`` +- ``Txn Date`` → ``date`` +- ``Customer`` → ``customer`` + +Tip: keep a small mapping note in your project folder (even a plain text file). + +Step 2 — Convert money fields to numeric +---------------------------------------- + +Accounting exports often include: + +- currency symbols (``$1,234.56``) +- parentheses for negatives (``(123.45)``) +- commas as thousands separators + +Convert these into numeric columns early, and confirm: + +- missing values are handled +- signs are correct +- totals match the source system + +Step 3 — Build “checkpoint” tables +---------------------------------- + +The Track D case repeatedly uses “checkpoint” tables: + +- **GL journal** (debits/credits by txn) +- **Trial balance** (ending balances by account) +- **Statements** (IS/BS/CF rollups by month) + +For your own data, try to create at least one checkpoint table that you trust +(e.g., a Trial Balance export) so you can validate your reconstruction. + +Step 4 — Run the minimum set of QA checks +----------------------------------------- + +Borrow these habits from Track D: + +- **Uniqueness checks:** keys unique (no duplicate IDs) +- **Completeness:** required columns non-null +- **Range checks:** dates in expected window, amounts not absurd +- **Reconciliation checks:** totals tie out +- **Invariants:** debits = credits; Assets = Liabilities + Equity (when applicable) + +Document failures. In accounting analytics, failures are often the most useful results. + +Step 5 — Only then: analysis + story +------------------------------------ + +Once the data is clean enough, the “stats” becomes meaningful: + +- trends (month-over-month) +- segmentation (customer/vendor/product) +- variability (outliers, tails, seasonality) +- drivers (regression with careful interpretation) +- risk (probability of cash shortfalls, DSO tails) + +How to translate Track D tables to your tables +============================================== + +Use the mental model page as your guide: + +- :doc:`track_d_dataset_map` + +**You don’t need every table.** +Pick the tables that let you answer your question *and* support validity checks. + +Examples: + +- If your goal is **cash forecasting**, you need bank transactions + expected inflows/outflows. +- If your goal is **A/R collection analysis**, you need invoices + collections + customer IDs. +- If your goal is **profitability diagnostics**, you need revenue + COGS + operating expenses, ideally by month. + +What to do when the data is “weird” +=================================== + +Track D intentionally includes “warts” to train you. + +Real data has the same issues, and the response is the same: + +- Identify the anomaly +- Explain why it matters +- Decide whether to fix, exclude, or model it explicitly +- Document the decision + +Common “weird” patterns: + +- duplicate transaction IDs +- negative inventory (timing, stockouts, backorders) +- negative A/R (credits, misapplied payments) +- month boundary issues (late postings, backdated entries) + +A responsible sharing checklist +=============================== + +Before you share a report, make sure you can answer: + +- What data source(s) did you use and when was it exported? +- What transformations did you apply? +- What checks did you run, and what failed? +- What limitations remain? +- What decisions might change if the data is incomplete? + +Next steps +========== + +If you want a guided assignment structure, see the upcoming Track D assignment pages +(coming as separate docs PRs) that include: + +- a “my own data” project template +- a rubric that rewards good contracts + checks + communication +- example memos and charts that avoid common mistakes + +For now, you can still practice the core workflow by using Track D’s reproducible case +and mirroring the same steps with your own exports. diff --git a/docs/source/workbook/track_d_student_edition.rst b/docs/source/workbook/track_d_student_edition.rst index aa270f6..145b22f 100644 --- a/docs/source/workbook/track_d_student_edition.rst +++ b/docs/source/workbook/track_d_student_edition.rst @@ -49,7 +49,9 @@ These pages live inside the workbook documentation subtree (they build cleanly o - :doc:`quickstart` — first-time setup and commands - :doc:`workflow` — how “run” vs “check” works, outputs conventions, troubleshooting -- :doc:`track_d` — Track D workbook quickstart + dataset map (seed=123) +- :doc:`track_d` — the Track D workbook page (run list + where to start) +- :doc:`track_d_dataset_map` — the **dataset mental model** (what each table is, how they relate, and why some rows are “warts” on purpose) +- :doc:`track_d_outputs_guide` — how to read the outputs folders and key CSV artifacts - :doc:`track_d_lab_ta_notes` — a lab handout + TA notes (walkthrough + interpretation) What you are building (the pipeline)