Conversation
📝 WalkthroughWalkthroughA new dbt gold model is introduced that aggregates order data by customer, computing metrics including total orders, total revenue, average order value, order date ranges, and customer lifetime days from staging tables. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
dbt/models/gold/gold_order_summary.sql (4)
19-19: Prefer explicitgroup by o.customer_idover positionalgroup by 1Positional references are fragile — inserting a column before
customer_idsilently changes grouping semantics without a compile error.♻️ Proposed fix
- group by 1 + group by o.customer_id🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbt/models/gold/gold_order_summary.sql` at line 19, Replace the positional GROUP BY with an explicit column reference: change the current "group by 1" to "group by o.customer_id" so the aggregation in the gold_order_summary query (the grouping for alias o.customer_id) remains correct and robust to column reordering.
21-21: Avoidselect *in a gold-layer model
select * from finalwill silently include any new columns added to upstream CTEs, making the model's public schema non-deterministic and harder to document/test. Explicit column enumeration is standard practice for gold-layer tables.♻️ Proposed fix
-select * from final +select + customer_id, + total_orders, + total_revenue, + avg_order_value, + first_order_date, + last_order_date, + customer_lifetime_days +from final🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbt/models/gold/gold_order_summary.sql` at line 21, The model currently uses a wildcard "select * from final" which makes the gold table schema non-deterministic; replace the wildcard with an explicit, ordered column list from the final CTE (e.g., list each column name produced by the final CTE in the SELECT, preserving any aliases and data types), ensure you include only the intended columns (created_at, order_id, customer_id, total_amount, etc. as applicable) and update any downstream tests/docs to match the explicit schema defined in gold_order_summary.sql.
1-21: Missing{{ config() }}block — materialization and tags not enforced in the modelThe PR description specifies
materialization: tableand tagsorders,revenue,gold, but there is no{{ config() }}block in the file. Without it the model inherits project-level defaults, which may not betable, and the tags won't be attached for use indbt run --select tag:gold.♻️ Proposed addition
+{{ config( + materialized='table', + tags=['orders', 'revenue', 'gold'] +) }} + -- Gold table: Order summary metrics by customer segment🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbt/models/gold/gold_order_summary.sql` around lines 1 - 21, This model is missing a dbt config block so it won't enforce materialization and tags; add a top-of-file config for the gold_order_summary model (using the same file's context) to set materialized='table' and tags=['orders','revenue','gold'] so the model uses table materialization and is selectable by tag:gold; place the config before the CTEs (before the with orders as (...) block) so it applies to the final select.
16-16: Use{{ dbt.datediff() }}instead of rawdatediff()for cross-database portability
datediffargument order differs between warehouses — BigQuery usesDATETIME_DIFF(end, start, granularity)while other databases useDATEDIFF(datepart, start, end). This model will fail on BigQuery with the rawdatediff()syntax.dbt v1.2+ provides the
dbt.datediff()macro for cross-database compatibility. It accepts SQL expressions (including aggregates in grouped queries) and automatically translates to the correct warehouse syntax.♻️ Proposed fix — use the `dbt.datediff` macro
- datediff('day', min(o.order_date), max(o.order_date)) as customer_lifetime_days + {{ dbt.datediff("min(o.order_date)", "max(o.order_date)", "day") }} as customer_lifetime_days🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbt/models/gold/gold_order_summary.sql` at line 16, The expression calculating customer_lifetime_days uses the raw datediff(...) call which is not portable to BigQuery; replace datediff('day', min(o.order_date), max(o.order_date)) with the dbt macro form dbt.datediff(min(o.order_date), max(o.order_date), 'day') so the model gold_order_summary.sql (the customer_lifetime_days calculation) uses {{ dbt.datediff(...) }} for cross-database compatibility and correct argument ordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dbt/models/gold/gold_order_summary.sql`:
- Around line 5-7: The customers CTE (customers and {{ ref('stg_customers') }})
is redundant and causes silent row-dropping because you INNER JOIN to it but
only select c.customer_id (which equals o.customer_id); either remove the
customers CTE and its join and aggregate directly from the orders source (e.g.,
use orders/order_summary CTEs) or, if you intend customer enrichment, keep
customers, include enrichment columns (e.g., first_name, last_name) in the
SELECT, and convert the INNER JOIN to a LEFT JOIN to avoid losing orders for
missing customers; also add a brief comment explaining the enrichment intent if
you retain the join.
---
Nitpick comments:
In `@dbt/models/gold/gold_order_summary.sql`:
- Line 19: Replace the positional GROUP BY with an explicit column reference:
change the current "group by 1" to "group by o.customer_id" so the aggregation
in the gold_order_summary query (the grouping for alias o.customer_id) remains
correct and robust to column reordering.
- Line 21: The model currently uses a wildcard "select * from final" which makes
the gold table schema non-deterministic; replace the wildcard with an explicit,
ordered column list from the final CTE (e.g., list each column name produced by
the final CTE in the SELECT, preserving any aliases and data types), ensure you
include only the intended columns (created_at, order_id, customer_id,
total_amount, etc. as applicable) and update any downstream tests/docs to match
the explicit schema defined in gold_order_summary.sql.
- Around line 1-21: This model is missing a dbt config block so it won't enforce
materialization and tags; add a top-of-file config for the gold_order_summary
model (using the same file's context) to set materialized='table' and
tags=['orders','revenue','gold'] so the model uses table materialization and is
selectable by tag:gold; place the config before the CTEs (before the with orders
as (...) block) so it applies to the final select.
- Line 16: The expression calculating customer_lifetime_days uses the raw
datediff(...) call which is not portable to BigQuery; replace datediff('day',
min(o.order_date), max(o.order_date)) with the dbt macro form
dbt.datediff(min(o.order_date), max(o.order_date), 'day') so the model
gold_order_summary.sql (the customer_lifetime_days calculation) uses {{
dbt.datediff(...) }} for cross-database compatibility and correct argument
ordering.
| customers as ( | ||
| select * from {{ ref('stg_customers') }} | ||
| ), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's read the actual file content to verify the review claims
cat -n dbt/models/gold/gold_order_summary.sqlRepository: zingleai/dataPortal_demo
Length of output: 1736
🏁 Script executed:
#!/bin/bash
# Search for stg_customers definition to understand the data structure
find . -type f -name "*stg_customers*" | head -20Repository: zingleai/dataPortal_demo
Length of output: 428
🏁 Script executed:
#!/bin/bash
# Look for schema tests on stg_customers to check for uniqueness constraints
fd "schema" --type f -g "*.yml" -g "*.yaml" | xargs rg -l "stg_customers" 2>/dev/null | head -10Repository: zingleai/dataPortal_demo
Length of output: 574
🏁 Script executed:
#!/bin/bash
# Read the actual gold_order_summary.sql file
cat -n dbt/models/gold/gold_order_summary.sqlRepository: zingleai/dataPortal_demo
Length of output: 865
🏁 Script executed:
#!/bin/bash
# Read the stg_customers model
cat -n dbt/models/staging/stg_customers.sqlRepository: zingleai/dataPortal_demo
Length of output: 547
🏁 Script executed:
#!/bin/bash
# Read the schema.yml which should define the unique test
cat -n dbt/models/staging/schema.ymlRepository: zingleai/dataPortal_demo
Length of output: 13065
Remove unnecessary customers CTE join or clarify its purpose
The customers CTE is joined but only c.customer_id is selected in the output, which is identical to o.customer_id under the join predicate. The columns available from stg_customers (first_name, last_name) are not included. This creates two issues:
-
Silent data loss: The
INNER JOINsilently drops any order rows whosecustomer_iddoesn't exist instg_customers. If there are orphaned orders (e.g., due to data quality issues),total_revenueandtotal_orderswill be understated without warning. -
Unnecessary complexity: If customer enrichment columns are not intended, the join adds complexity without benefit. You can aggregate directly from
orders:
Proposed fix — remove redundant join
-customers as (
- select * from {{ ref('stg_customers') }}
-),
final as (
select
- c.customer_id,
+ o.customer_id,
count(distinct o.order_id) as total_orders,
sum(o.amount) as total_revenue,
avg(o.amount) as avg_order_value,
min(o.order_date) as first_order_date,
max(o.order_date) as last_order_date,
datediff('day', min(o.order_date), max(o.order_date)) as customer_lifetime_days
- from orders o
- join customers c on o.customer_id = c.customer_id
- group by 1
+ from orders o
+ group by o.customer_id
)If customer enrichment columns (first_name, last_name) are intended, add them to the output and document the enrichment intent in a comment.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@dbt/models/gold/gold_order_summary.sql` around lines 5 - 7, The customers CTE
(customers and {{ ref('stg_customers') }}) is redundant and causes silent
row-dropping because you INNER JOIN to it but only select c.customer_id (which
equals o.customer_id); either remove the customers CTE and its join and
aggregate directly from the orders source (e.g., use orders/order_summary CTEs)
or, if you intend customer enrichment, keep customers, include enrichment
columns (e.g., first_name, last_name) in the SELECT, and convert the INNER JOIN
to a LEFT JOIN to avoid losing orders for missing customers; also add a brief
comment explaining the enrichment intent if you retain the join.
Summary
Order summary metrics by customer — total orders, revenue, avg order value, lifetime days.
Tags:
orders,revenue,goldCriticality:
P1Models (1)
gold_order_summaryLineage
Data Quality Tests (2)
Generated by Data Portal
Summary by CodeRabbit