Skip to content

feat: add tableau-extract-calcs script and tableau-upstream slash command#3473

Closed
GabyRangelB wants to merge 14 commits intomainfrom
claude/feat/tableau-upstream-tool
Closed

feat: add tableau-extract-calcs script and tableau-upstream slash command#3473
GabyRangelB wants to merge 14 commits intomainfrom
claude/feat/tableau-upstream-tool

Conversation

@GabyRangelB
Copy link
Copy Markdown
Contributor

Summary

  • Adds scripts/tableau-extract-calcs.py to extract calculated fields from Tableau workbooks (local .twbx/.twb or downloaded from Tableau Server). Supports exposure-aware mode that resolves the workbook LSID and depends_on models directly from models/exposures/tableau.yml.
  • Adds .claude/commands/tableau-upstream.md — a /tableau-upstream slash command that guides Claude through the full upstream workflow: extract calcs → identify target rpt_tableau__ model → translate Tableau formula syntax to BigQuery SQL → update model SQL and YAML → validate with dbt show.

Closes #3472

Test plan

  • Run uv run scripts/tableau-extract-calcs.py --exposure gradebook_and_gpa_dashboard --list-only and confirm it prints the workbook LSID and depends_on models from tableau.yml without making network calls
  • Run the script against a local .twbx file and confirm the markdown table of calculated fields is printed correctly
  • Run /tableau-upstream in a new Claude session and confirm the guided workflow prompts correctly at each step

🤖 Generated with Claude Code

…mand

Adds two artifacts to support moving Tableau calculated fields upstream
into dbt models:

- scripts/tableau-extract-calcs.py: extracts calculated fields from a
  .twbx/.twb file (local or downloaded from Tableau Server). Supports
  exposure-aware mode that looks up the workbook LSID and depends_on
  models directly from models/exposures/tableau.yml.

- .claude/commands/tableau-upstream.md: /tableau-upstream slash command
  that guides Claude through the full workflow — run the extraction
  script, identify the target rpt_tableau__ model, translate Tableau
  formula syntax to BigQuery SQL, update the model SQL and YAML, and
  validate with dbt show.

Closes #3472

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@GabyRangelB GabyRangelB requested a review from a team as a code owner March 17, 2026 23:10
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 17, 2026

Claude finished @GabyRangelB's task in 1m 10s —— View job


Code Review: PR #3473 — tableau-extract-calcs + tableau-upstream command

Tasks

  • Gather context and read changed files
  • Review scripts/tableau-extract-calcs.py
  • Review .claude/commands/tableau-upstream.md
  • Post final review

scripts/tableau-extract-calcs.py

Overall this is clean, well-structured, and follows the project's uv inline-script convention correctly. A few issues worth addressing:

Bug: extract_twb_bytes_from_bytes defined after it is called

extract_twb_bytes_from_bytes is called at lines 256 and 272 but is defined at line 278 — after main(). In Python this only works if main() is called at runtime (which it is via if __name__ == "__main__"), so it won't cause a NameError at import time. But it's confusing ordering. Move the function up near extract_twb_bytes (around line 32) so all XML-extraction helpers are co-located.

Unhandled StopIteration in zip traversal

# lines 30-31 and 281
twb_name = next(n for n in zf.namelist() if n.endswith(".twb"))

If the zip contains no .twb entry this raises a bare StopIteration. Should be:

twb_name = next((n for n in zf.namelist() if n.endswith(".twb")), None)
if twb_name is None:
    raise ValueError(f"No .twb file found in archive: {path}")

No signout call after Tableau Server auth

signin() acquires a server token but there is no corresponding call to DELETE /auth/signout. For an admin-level personal token this may not matter, but for a username/password session it leaves an active session on the server. Consider adding a signout() helper and calling it in a try/finally block.

stream=True on download but resp.content used

Line 89–91:

resp = requests.get(url, ..., stream=True, timeout=60)
resp.raise_for_status()
return resp.content

stream=True defers body loading, but resp.content then loads the entire response into memory anyway. Either use stream=True with chunked reading (appropriate for very large workbooks), or drop stream=True and keep resp.content. For typical .twbx sizes, dropping stream=True is simpler.

Credential handling

Passwords are collected via getpass.getpass() and passed as plain strings. This is fine for a CLI script, but worth noting: the password is briefly present in the signin() JSON payload in memory. No issue for typical use, just a heads-up if secrets scanning is a concern.

Minor: workbook_id can be None at line 244 check

get_workbook_id_from_exposure returns str | None. The if not workbook_id: check at line 244 correctly handles None and "". This is fine.

Minor: --list-only with --file silently ignored

If a user passes --file myfile.twbx --list-only, the --list-only flag is silently ignored (the file path hits the early return at line 226 before the list_only check). Consider either:

  • Wiring --list-only to be only valid with --exposure (via argparse group constraints), or
  • At minimum noting in the --list-only help string that it only applies to --exposure mode.

.claude/commands/tableau-upstream.md

Clean, well-structured command. The step-by-step flow is logical and the translation table covers the common cases. A few observations:

Step 2 hard-codes kipptaf path

src/dbt/kipptaf/models/extracts/tableau/rpt_tableau__<name>.sql

The script itself hard-codes EXPOSURES_PATH to kipptaf as well. This is fine since the Tableau extract workflow only applies to kipptaf, but worth a short comment in both places so future contributors know this is intentional.

Step 3 translation table — missing ATTR() and MIN()/MAX() as scalar

Tableau authors frequently use ATTR([Field]) (returns the single value of a dimension if unambiguous) and MIN([Field])/MAX([Field]) as row-level scalars (not aggregates). These translate to the plain column reference in BigQuery SQL ([Field]). Not blocking, but adding them would prevent confusion when encountered.

Step 5 validation sequence

The dagster-dbt project prepare-and-package command is listed as a prerequisite for dbt show, which is correct. However, the instruction says "Then via the dbt MCP tool or terminal" — it may be worth being explicit that when using the dbt MCP tool, prepare-and-package may already have been run by a previous step, so Claude should check rather than re-run it unconditionally (it can be slow).

No mention of dbt compile as a faster first check

Running dbt show on a model with new columns added is the right final check, but dbt compile --select <model> is faster for catching SQL syntax errors before hitting the warehouse. Consider adding it between editing the SQL and running dbt show.


Summary

The core logic is solid and the script is a genuine productivity tool. The main items to address before merging:

  1. Bug: Move extract_twb_bytes_from_bytes before main() (function ordering)
  2. Bug: Handle StopIteration in .twb zip name lookup
  3. Improvement: Clarify --list-only only applies to --exposure mode
  4. Nice-to-have: Drop stream=True or use it with chunked reading consistently

The slash command is ready to merge as-is; the notes above are optional improvements.

GabyRangelB and others added 8 commits March 17, 2026 23:20
- Replace xml.etree.ElementTree with defusedxml to resolve Bandit B314
- Default --server and --site from TABLEAU_SERVER_ADDRESS / TABLEAU_SITE_ID
  env vars (already injected via 1Password for all team members)
- Authenticate via TABLEAU_TOKEN_NAME + TABLEAU_PERSONAL_ACCESS_TOKEN env
  vars by default (no password prompt); fall back to --username/password
  only when PAT env vars are absent

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tableau Server returns XML by default. Without the Accept header the
signin and workbook list endpoints return XML, causing a JSONDecodeError
when the script tries to parse the response.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without --datasource, the script now prints a table of unique data
sources and their calc counts, then exits with instructions to re-run
with -d. This avoids flooding output with 983 fields across all sources
when the user only wants one.

Also fixes Bandit B314 cleanly by dropping the stdlib xml import
entirely (defusedxml.ElementTree handles all parsing) and removing
the type annotation from private helpers where the Element type was
only needed as a hint.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… command

Script: when run with no arguments, launches a guided prompt sequence
— numbered exposure list, Tableau Server download, numbered datasource
picker — so users don't need to know any CLI flags.

Slash command: rewritten to drive the same workflow through Claude chat.
Removed all workbook-specific examples (replaced with generic
placeholders) so the command works for any exposure. Also fixed
markdown lint warnings (table alignment, fenced code block languages).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Design for smart classification of Tableau calculated fields into
READY / NEEDS WORK / LOD / SKIP categories, with semantic layer
recommendations for LOD expressions and slash command UX updates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Design for Step 4b in /tableau-upstream: trace calculated field refs
through the dbt DAG, recommend closest-to-extract insertion point,
draft BigQuery SQL at that layer, and let the analyst choose how far
upstream to push each field.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers field categorization (READY/NEEDS WORK/LOD/SKIP), --model flag,
LOD mart-matching, grouped script output, and /tableau-upstream slash
command updates (Step 4b lineage tracing, output format preference,
LOD/NEEDS WORK handling). Reviewed and approved via automated spec
review loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@GabyRangelB GabyRangelB requested a review from a team as a code owner March 18, 2026 00:29
GabyRangelB and others added 2 commits March 18, 2026 00:34
…iterals upstream

- Re-prompt with valid range when user enters an out-of-bounds number
  for workbook or datasource selection (Steps 1 and 2)
- Before marking bare org-name string literals as SKIP, check
  int_extracts__student_enrollments for a district/equivalent column
  and propose adding it to the extract instead

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@GabyRangelB
Copy link
Copy Markdown
Contributor Author

@claude please review this PR for vulnerabilities and/or exposed sensitive information

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 18, 2026

Claude finished @GabyRangelB's task in 1m 12s —— View job


Security Review: PR #3473

Tasks

  • Gather context
  • Read scripts/tableau-extract-calcs.py for security issues
  • Read .claude/commands/tableau-upstream.md for sensitive info
  • Read docs files for sensitive info
  • Post final security review

Security Review Results

No hardcoded credentials or sensitive information found. The PR is clean in that regard. Below is a full assessment of security-relevant patterns.


scripts/tableau-extract-calcs.py

✅ Strengths

  • No hardcoded secrets. All credentials (TABLEAU_TOKEN_NAME, TABLEAU_PERSONAL_ACCESS_TOKEN, TABLEAU_SERVER_ADDRESS, TABLEAU_SITE_ID) are read from environment variables. Correct pattern for a CLI tool.
  • defusedxml used for XML parsing (line 18). This is the right choice — it protects against XML bomb (billion laughs), XXE (XML External Entity), and related XML injection attacks. A plain xml.etree.ElementTree parse of untrusted .twbx files would be exploitable.
  • TLS verification not disabled. No verify=False on any requests call, so certificate validation is active by default.
  • No shell execution. The script never uses subprocess, os.system, or eval. No command injection surface.
  • Password collected via getpass (line 168), not via sys.argv or an env var — avoids leaking credentials into shell history or ps output.

⚠️ Findings

1. No HTTPS enforcement on server URL — medium risk for internal tool

Lines 121, 139, 182, 195: The server URL is taken from --server or $TABLEAU_SERVER_ADDRESS and used directly in requests without validating that it starts with https://. If an operator accidentally configures http://, the PAT or password is transmitted in plaintext over the network.

# line 121 — no scheme check before sending credentials
url = f"{server}/api/{TABLEAU_API_VERSION}/auth/signin"

Recommend adding a guard in get_auth_token or main:

if not server.startswith("https://"):
    raise SystemExit("Server URL must use HTTPS to protect credentials.")

2. Auth token never signed out — low risk

signin_pat / signin_password (lines 119–150) return a session token but there is no corresponding DELETE /auth/signout call anywhere. For PAT-based auth the token is short-lived and scoped, so the practical impact is low. For username/password sessions it leaves an indefinitely active session on the server until it expires server-side. A try/finally signout would be cleaner hygiene.

3. Tableau API filter value not sanitized — very low risk for internal use

Line 198:

params={"filter": f"name:eq:{workbook_name}"},

Tableau's REST API filter syntax uses commas and colons as delimiters. A workbook name containing these characters (e.g. "foo:eq:bar,name:neq") could potentially manipulate the filter query. Since workbook_name comes from the operator's own CLI argument and the tool is used internally, the practical risk is negligible — but worth noting if the script is ever called with user-supplied input.

4. yaml.safe_load used correctly — no issue

Line 213: yaml.safe_load (not yaml.load) is used to parse the exposures YAML. This is the safe pattern and avoids arbitrary code execution via crafted YAML.


.claude/commands/tableau-upstream.md and docs/ files

  • No credentials, tokens, server addresses, or internal system names that aren't already documented elsewhere in the repo.
  • The docs reference generic model naming conventions (rpt_tableau__*), BigQuery dataset names, and dbt MCP tool names — none of these are sensitive.
  • No PII, no API keys, no internal hostnames.

Summary

Issue Severity Location
No HTTPS enforcement before sending credentials Medium get_auth_token, lines 164–169
Session token not signed out after use Low signin_pat/signin_password, lines 119–150
Tableau API filter value not sanitized Very low / informational find_workbook_id_by_name, line 198
Hardcoded credentials None
Exposed secrets in docs or command file None
XML injection risk None (mitigated by defusedxml)

The only item worth addressing before merge is the HTTPS enforcement check — it's a one-liner guard that prevents accidental credential exposure if the env var or flag is misconfigured. The rest are informational.

GabyRangelB and others added 2 commits March 18, 2026 13:50
…adable names

- Add _build_calc_name_map() to map Calculation_* and *(copy)_* XML names
  to their human captions within a datasource element
- Add _resolve_calc_refs() to substitute those IDs in formulas before display
- Wire both into _extract_fields() so all output (and future classification)
  sees human-readable field names instead of opaque Tableau internal IDs
- Add Task 0 to the implementation plan covering this with TDD steps

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Enforce HTTPS before sending credentials: get_auth_token() now raises
  SystemExit if the server URL does not start with https://
- Sign out session token after use: add signout() and wrap all three
  download paths (interactive, --exposure, --workbook) in try/finally
  so the token is always invalidated, even on error
- Sanitize Tableau API filter value: strip ':' and ',' from workbook
  names in find_workbook_id_by_name() to prevent filter injection

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…anatory comment

CodeQL flagged the bare `except: pass` as an empty handler with no explanation.
Keep best-effort semantics but name the exception, add a comment clarifying the
intent, and print to stderr when TABLEAU_DEBUG is set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

cbini commented Mar 30, 2026

Superseded by PR #3537tableau-analyze-workbook.py replaces tableau-extract-calcs.py with viz usage tracking, 4-bucket verdict system, and tests. See #3500.

@cbini cbini closed this Mar 30, 2026
@cbini cbini deleted the claude/feat/tableau-upstream-tool branch March 30, 2026 22:08
@cbini cbini restored the claude/feat/tableau-upstream-tool branch March 31, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: tableau-to-dbt upstream workflow tool

2 participants