The original inspiration for a SQL 201 project was this tweet from Teej. I started writing this as a blog post earlier this year, but Ben Nour gave me the idea to just make this a repo and call it a day.
- Writing Debuggable SQL
- Anti-patterns
- Gotchas
- Joins
- Set Operations
- Window Functions
- String Stuff
- Non-standard Features
- See Also
-- it's easier to comment out columns in this
select
user_id
, timestamp
, num_likes
-- , num_blocks
from table
-- than in this
select
user_id
, timestamp
, num_likes--,
--num_blocks
from table -- it's easier to comment out filters in this
select
user_id
, timestamp
, num_likes
from table
where 1=1
--AND num_blocks > 0
and num_likes is not null
-- than in this
select
user_id
, timestamp
, num_likes
from table
where --num_blocks > 0
--AND
num_likes is not null As Teej says, common table expressions (CTEs) are the closest thing SQL has to import statements. They help organize logic by isolating transformations, and let you define a sub-table to work with or from.
**One business rule per CTE** is a good rule of thumb, just to keep things atomic.
with base as (
SELECT
user_id
, MAX(timstamp) latest_order
FROM table
WHERE 1=1
group by 2
)
select
base.user_id
, orders.order_id
from base
left join orders
on base.user_id = orders.user_id
and base.latest_order = orders.placed_atYou can also write CTEs as sub-queries of a sort, but I personally find this harder to read. More jumping back and forth required IMO. This is most readable if you just want to do one subquery and join on and on, usually building up flags.
select
base.user_id,
orders.order_id
from (
SELECT
user_id
, MAX(timstamp) latest_order
FROM table
WHERE 1=1
group by 2
) as base
left join orders
on base.user_id = orders.user_id
and base.latest_order = orders.placed_atA blanket distinct in a query is SQL code smell. I try and handle duplicates by making an explicit choice - usually taking the most recent row per unique ID.
-- smelly!
select distinct
user_id
from table
-- better, explicitly filters to the most recent row
select
user_id
from table
qualify rank() over (partition over user_id order by timestamp desc) = 1 Sorting is expensive, and should be avoided until necessary. If you need to do any intermediate ordering, I would do it explicitly using a window function.
-- intemediate ordering, not great
WITH base AS (
select
user_id
, num_likes
from table
order by num_likes DESC -- Pointless ordering here
), filtered_base AS (
select
user_id
, num_likes
from base
order by num_likes ASC -- Re-ordering the same data again for no reason
)
select user_id
from filtered_base
order by num_likes desc -- Final ordering that actually matters
limit 1;
-- better!
select
user_id
, num_likes
from table
qualify rank() over(order by num_likes desc) = 1 These are usually subqueries in the WHERE clause to meet a specific filtering criteria. These can get very expensive since you're doing nested evaluation (the query executes once for every row in the main query, meaning that the inner query is driven by the outer one, unlike in a normal subquery where the inner query executes once and first), and are better off as atomic CTEs and explicit joins.
-- correlated sub-query
select
e.name
, e.salary
from Employee e
WHERE 1=1
AND e.salary > (
SELECT AVG(salary)
FROM Employee
WHERE departmentId = e.departmentId
) More examples on the Wikipedia page.
E.g. (‘%jess%’ vs. ‘jess%’)
Or vice versa! I am just a left-join purist.
SQL's execution process first performs the join (in this case, a LEFT JOIN) and then applies the filter in the WHERE clause. Since non-matching rows from the right table result in NULL values, those NULL rows won't pass the condition in the WHERE clause (like WHERE right_table.some_column = 'X'). As a result, those rows are filtered out, mimicking the behavior of an INNER JOIN.
You can add filters to the ON clause of the join to get around this. The result table will still have everything in the left table, but only the right table info you've filtered for would be filled in.
-- THIS IS EFFECTIVELY AN INNER JOIN!
SELECT table_a.id
FROM table_a
LEFT JOIN table_b
ON table_a.id = table_b.id
WHERE table_b.ds = CURRENT_DATE
-- Good left join :)
SELECT table_a.id,
FROM table_a
LEFT JOIN table_b
ON table_a.id = table_b.id AND table_b.ds = CURRENT_DATEThis is also slightly different between Presto and Hive.
Also note that this does not effect anti-joins (as in below), since those explicitly deal with nulls. A left join only behaves like an inner join when you don't account for the null values in the right table.
-- An anti-join, just getting the days that are in table a that are not in table b
SELECT table_a.id
FROM table_a
LEFT JOIN table_b
ON table_a.id = table_b.id
WHERE table_b.ds is null In and not in do not count nulls. Use them only when you're dealing with specific values and you know you won't get any NULLs. EXISTS and NOT EXISTS are better for verifying prescence / absence of relationships without needing exact matches.
A case statement will return the first true condition it hits and not evaluate all the others, so you want your most restrictive conditions first. If no conditions hit and there's no else to default a result to, the case statement will return null.
This is slightly implementation specific, so definitely check the docs for your specific RDBMS.
If you want a more comprehensive introduction to how to think about joins, I'd suggest:
- Julia Evans' nice rules for joins
- Sarah Anoke's intro to joins
- Justin Jaffrey's ways of thinking about joins
- Randy Au's Can we stop with the SQL joins venn diagram insanity
Cross joins give a row for each possible pairing of a row from Table A and a row from Table B. They don't require any join keys, since they're exhaustive.
Comma joins are also default cross joins.
The common pattern that I see is cross-join unnest-ing to explode a struct.
Imagine this table:
| user_id | favorite_colors |
|---|---|
| 1 | ["red", "blue"] |
| 2 | ["green", "yellow", "blue"] |
| 3 | ["black"] |
SELECT
user_i
, color
FROM
users,
UNNEST(favorite_colors) AS color
-- in presto this is
-- CROSS JOIN UNNEST (users.favorite_colors) AS c(colors)
-- where the c is the new exploded table alias and colors is the column names Gets:
| user_id | color |
|---|---|
| 1 | red |
| 1 | blue |
| 2 | green |
| 2 | yellow |
| 2 | blue |
| 3 | black |
If you join a table to itself, you get a sort of poor-man's version of nested iteration. For example, these two things are equivalent.
select id
from table_a a
left join table_a a1 USING (id)
where a.time < a1.time final_set = []
for row in table_a:
for same_row in table_a:
if row[id] == same_row[id]:
if row[time] < same_row[time]:
final_set.append(row)This can be especially useful in situations where data is hierarchical (eg. org charts), or you're trying to find relationships between rows (i.e. users with the same title but different pay).
Joining on id and using some time window filter is a pretty common pattern.
Range joins are joins that involve a range of values (>=, <= etc.) instead of just strict equality. They can be useful for funnels.
SELECT
e.employee_id
, e.name
, e.salary
, s.bracket_name
FROM
employees e
JOIN
salary_brackets s
ON
e.salary BETWEEN s.min_salary AND s.max_salaryThe BETWEEN predicate is inclusive (so both the bounds count as part of the range). The order of arguments is always lower_bound and upper_bound.
Equivalent to a row bind. UNION ALL binds all rows, keeping duplicates. UNION doesn't keep duplicates. We usually prefer UNION ALL.
select
user_id
from table_for_one_day
union all
select
user_id
from table_for_another_day
What they sounds like. You can use this in the traditional set operation way - to difference two full queries - but they seem particularly powerful when the queries are pulling from two different tables.
Window functions let you do aggregations in place. They look something like this:
function_name OVER (PARTITION BY col_name1, col_name2, ...
ORDER BY col_name3
frame_clause)The partition specifies the window over which the query is aggregating. The order specifies the order. The frame clause specifies a subset of the window (the set of rows relative to the current row used in computation).
I prefer keeping window functions in separate CTEs, not mixing them with group-by and summarize aggregations. You don't need to group by window functions, but the grain of the result can be confusing. You can also ORDER BY an aggregation, but that needs to be grouped.
See also:
I had to use this to concatenate two rows to make my own unique ID per row once.
Can save some CTEs. HAVING filters on aggregations in place, but still requires grouping! It's late in the order of execution.
-- no having
with no_having as (
select
user_id
, count(*) num
from examples
where 1=1
group by 1
)
select
user_id
from no_having
where 1=1
and num > 5
-- equivalent to
select
user_id
from examples
where 1=1
group by 1
having count(*) > 5 Can save some CTEs. QUALIFY filers on window functions in place. Commonly used to de-dupe.
-- deduping with qualify. filtering just for the most recent row
select
user_id
, ts
from examples
where 1=1
qualify rank() over(partition by user_id order by ts desc) = 1 I prefer to use a window function to explicitly rank and then select from ranks, but OFFSET can be used in conjunction with LIMIT too.
SELECT
EmployeeID
, FirstName
, LastName
FROM Employees
ORDER BY EmployeeID
LIMIT 5
OFFSET 5This gives us employees 6-10.