- analyzing supermarket's business performance after new packaging program launched
- using MS. SQL SERVER STUDIO
SQL project/case study from : https://8weeksqlchallenge.com/case-study-5/
steps:
-
import data_mart_weekly_sales.csv into SQL SERVER
-
data cleaning -> create new table: clean_weekly_sales
clean_weekly_sales
select region, platform, segment, customer_type, transactions, sales, SUBSTRING(week_date,2, CHARINDEX('/', week_date, 1)-2) as day, SUBSTRING(week_date, CHARINDEX('/', week_date, 1)+1, 1) as month_number, concat('20', SUBSTRING(week_date, CHARINDEX('/', week_date, CHARINDEX('/', week_date)+1)+1, 2)) as calendar_year, DATEFROMPARTS(concat('20', SUBSTRING(week_date, CHARINDEX('/', week_date, CHARINDEX('/', week_date)+1)+1, 2)), SUBSTRING(week_date, CHARINDEX('/', week_date, 1)+1, 1), SUBSTRING(week_date,2, CHARINDEX('/', week_date, 1)-2)) as date_full, case when SUBSTRING(week_date,2, CHARINDEX('/', week_date, 1)-2) between 1 and 7 then 1 when SUBSTRING(week_date,2, CHARINDEX('/', week_date, 1)-2) between 8 and 14 then 2 when SUBSTRING(week_date,2, CHARINDEX('/', week_date, 1)-2) between 15 and 21 then 3 else 4 end as week_number, case when SUBSTRING(segment, 2, 1) = 'C' then 'Couples' when SUBSTRING(segment, 2, 1) = 'F' then 'Families' else 'unknown' end as demographic, case when SUBSTRING(segment, 3, 1) = '1' then 'Young Adults' when SUBSTRING(segment, 3, 1) = '2' then 'Middle Aged' when SUBSTRING(segment, 3, 1) = '3' or SUBSTRING(segment, 3, 1) = '4' then 'Retirees' else 'unknown' end as age_band, round(sales*100.0/transactions, 2) as avg_transaction into clean_weekly_sales from data_mart_weekly_sales;
-
data exploration:
-
What day of the week is used for each week_date value?
-
What range of week numbers are missing from the dataset?
-
How many total transactions were there for each year in the dataset?
-
What is the total sales for each region for each month?
total sales for each region, each month
with satu as (select * from (select region, month_number, CAST(sales as bigint) as sales from clean_weekly_sales) s pivot( sum(sales) for region in (['AFRICA'], ['ASIA'], ['CANADA'], ['EUROPE'], ['OCEANIA'], ['SOUTH AMERICA'], ['USA']) ) pivot_table)select * from satu order by month_number;
-
What is the total count of transactions for each platform?
-
What is the percentage of sales for Retail vs Shopify for each month?
percentage of sales for Retail vs Shopify for each month
with satu as (select calendar_year, month_number, platform, sum(CAST(sales as bigint)) as total_sales from clean_weekly_sales group by calendar_year, month_number, platform ),
dua as (select *, SUM(total_sales) over(partition by calendar_year) as total_sales_all, round(total_sales * 100.0 / SUM(total_sales) over(partition by calendar_year, month_number), 2) as pct_sales_per_month from satu)
select * from (select calendar_year, platform, month_number, pct_sales_per_month from dua) s pivot ( max(pct_sales_per_month) for calendar_year in ([2018], [2019], [2020]) ) pvt
-
What is the percentage of sales by demographic for each year in the dataset?
percentage of sales by demographic for each year
with satu as (select calendar_year, demographic, sum(CAST(sales as bigint)) as total_sales from clean_weekly_sales group by calendar_year, demographic),
dua as (select *, SUM(total_sales) over(partition by calendar_year) as total_sales_all, round(total_sales * 100.0 / SUM(total_sales) over(partition by calendar_year), 2) as pct_demographic_per_year from satu)
select * from (select demographic, calendar_year, pct_demographic_per_year from dua) s pivot( max(pct_demographic_per_year) for calendar_year in ([2018], [2019], [2020]) ) pvt
-
Which age_band and demographic values contribute the most to Retail sales?
age_band and demographic values contribute the most to Retail sales
retirees-families and retirees-couples
select platform, age_band, demographic, sum(CAST(sales as bigint)) as total_sales, RANK() over(order by sum(CAST(sales as bigint)) desc) as ranking from clean_weekly_sales where platform = '''Retail''' and age_band <> 'unknown' group by platform, age_band, demographic;
-
Can we use the avg_transaction column to find the average transaction size for each year for Retail vs Shopify? If not - how would you calculate it instead?
-
-
analyzing data:
-
What is the total sales for the 4 weeks before and after 2020-06-15? What is the growth or reduction rate in actual values and percentage of sales?
total sales for the 4 weeks before and after 2020-06-15
with satu as (select *, case when date_full >= '2020-06-15' then 'after' else 'before' end as new_packaging_date from clean_weekly_sales),
total_sales_4_weeks_before as( select sum(CAST(sales as bigint)) as total_sales_before from satu where date_full between DATEADD(week, -4, '2020-06-15') and '2020-06-15'),
total_sales_4_weeks_after as( select sum(CAST(sales as bigint)) as total_sales_after from satu where date_full between '2020-06-15' and DATEADD(week, 4, '2020-06-15'))
select *, case when total_sales_4_weeks_before.total_sales_before > total_sales_4_weeks_after.total_sales_after then 'before is more trx' else 'after is more trx' end as status, total_sales_4_weeks_before.total_sales_before - total_sales_4_weeks_after.total_sales_after as diff from total_sales_4_weeks_before, total_sales_4_weeks_after;
more trx before new program launched, in timeline 4 weeks before and 4 weeks after
the difference: 10.973.134
-
What about the entire 12 weeks before and after?
total sales for the 12 weeks before and after 2020-06-15
with satu as (select *, case when date_full >= '2020-06-15' then 'after' else 'before' end as new_packaging_date from clean_weekly_sales),
total_sales_12_weeks_before as( select sum(CAST(sales as bigint)) as total_sales_before from satu where date_full between DATEADD(week, -12, '2020-06-15') and '2020-06-15'),
total_sales_12_weeks_after as( select sum(CAST(sales as bigint)) as total_sales_after from satu where date_full between '2020-06-15' and DATEADD(week, 12, '2020-06-15'))
select *, case when total_sales_12_weeks_before.total_sales_before > total_sales_12_weeks_after.total_sales_after then 'before is more trx' else 'after is more trx' end as status, total_sales_12_weeks_before.total_sales_before - total_sales_12_weeks_after.total_sales_after as diff from total_sales_12_weeks_before, total_sales_12_weeks_after
more trx before new program launched, in timeline 12 weeks before and 12 weeks after
the difference got more bigger for 'before' status: 722.350.742
-
How do the sale metrics for these 2 periods before and after compare with the previous years in 2018 and 2019?
compare with the previous years in 2018 and 2019
select calendar_year, sum(CAST(sales as bigint)) as total_sales from clean_weekly_sales where calendar_year in (2018, 2019) group by calendar_year;
with total_sales before new program:
- 2018 around 13 million - 2019 around 14 million - 4 weeks before and - 12 weeks beforeall show bigger total_sales compared to total_sales after new program launched
-
-
more question: Which areas of the business have the highest negative impact in sales metrics performance in 2020 for the 12 week before and after period?
-
region
Details
region with highest negative impact
with satu as (select *, case when date_full >= '2020-06-15' then 'after' else 'before' end as new_packaging_date from clean_weekly_sales),
total_sales_12_weeks_before as( select region, sum(CAST(sales as bigint)) as total_sales_before from satu where date_full between DATEADD(week, -12, '2020-06-15') and '2020-06-15' group by region),
total_sales_12_weeks_after as( select region, sum(CAST(sales as bigint)) as total_sales_after from satu where date_full between '2020-06-15' and DATEADD(week, 12, '2020-06-15') group by region)
select b.region, b. total_sales_before, a.total_sales_after, (b.total_sales_before - a.total_sales_after) as diff, round((b.total_sales_before - a.total_sales_after) *100.0 / b. total_sales_before, 2) as pct_impact from total_sales_12_weeks_before b join total_sales_12_weeks_after a on b.region = a.region order by (b.total_sales_before - a.total_sales_after) *100.0 / b. total_sales_before desc;
the region with highest negative impact is asia
-
platform
Details
platform with highest negative impact
with satu as (select *, case when date_full >= '2020-06-15' then 'after' else 'before' end as new_packaging_date from clean_weekly_sales),
total_sales_12_weeks_before as( select platform, sum(CAST(sales as bigint)) as total_sales_before from satu where date_full between DATEADD(week, -12, '2020-06-15') and '2020-06-15' group by platform),
total_sales_12_weeks_after as( select platform, sum(CAST(sales as bigint)) as total_sales_after from satu where date_full between '2020-06-15' and DATEADD(week, 12, '2020-06-15') group by platform)
select b.platform, b. total_sales_before, a.total_sales_after, (b.total_sales_before - a.total_sales_after) as diff, round((b.total_sales_before - a.total_sales_after) *100.0/ b. total_sales_before, 2) as pct_impact from total_sales_12_weeks_before b join total_sales_12_weeks_after a on b.platform = a.platform order by round((b.total_sales_before - a.total_sales_after) *100.0/ b. total_sales_before, 2) desc;
the platform with highest negative impact is retail
-
age_band
Details
age_band with highest negative impact
with satu as (select *, case when date_full >= '2020-06-15' then 'after' else 'before' end as new_packaging_date from clean_weekly_sales),
total_sales_12_weeks_before as( select age_band, sum(CAST(sales as bigint)) as total_sales_before from satu where date_full between DATEADD(week, -12, '2020-06-15') and '2020-06-15' group by age_band),
total_sales_12_weeks_after as( select age_band, sum(CAST(sales as bigint)) as total_sales_after from satu where date_full between '2020-06-15' and DATEADD(week, 12, '2020-06-15') group by age_band)
select b.age_band, b. total_sales_before, a.total_sales_after, (b.total_sales_before - a.total_sales_after) as diff, round((b.total_sales_before - a.total_sales_after) *100.0/ b. total_sales_before, 2) as pct_impact from total_sales_12_weeks_before b join total_sales_12_weeks_after a on b.age_band = a.age_band where b.age_band <> 'unknown' order by round((b.total_sales_before - a.total_sales_after) *100.0/ b. total_sales_before, 2) desc;
the age_band with highest negative impact is middle aged
-
demographic
Details
demographic with highest negative impact
with satu as (select *, case when date_full >= '2020-06-15' then 'after' else 'before' end as new_packaging_date from clean_weekly_sales),
total_sales_12_weeks_before as( select demographic, sum(CAST(sales as bigint)) as total_sales_before from satu where date_full between DATEADD(week, -12, '2020-06-15') and '2020-06-15' group by demographic),
total_sales_12_weeks_after as( select demographic, sum(CAST(sales as bigint)) as total_sales_after from satu where date_full between '2020-06-15' and DATEADD(week, 12, '2020-06-15') group by demographic)
select b.demographic, b. total_sales_before, a.total_sales_after, (b.total_sales_before - a.total_sales_after) as diff, round((b.total_sales_before - a.total_sales_after) *100.0/ b. total_sales_before, 2) as pct_impact from total_sales_12_weeks_before b join total_sales_12_weeks_after a on b.demographic = a.demographic where b.demographic <> 'unknown' order by round((b.total_sales_before - a.total_sales_after) *100.0/ b. total_sales_before, 2) desc;
the demographic with highest negative impact is families
-
customer_type
Details
customer type with highest negative impact
with satu as (select *, case when date_full >= '2020-06-15' then 'after' else 'before' end as new_packaging_date from clean_weekly_sales),
total_sales_12_weeks_before as( select customer_type, sum(CAST(sales as bigint)) as total_sales_before from satu where date_full between DATEADD(week, -12, '2020-06-15') and '2020-06-15' group by customer_type),
total_sales_12_weeks_after as( select customer_type, sum(CAST(sales as bigint)) as total_sales_after from satu where date_full between '2020-06-15' and DATEADD(week, 12, '2020-06-15') group by customer_type)
select b.customer_type, b. total_sales_before, a.total_sales_after, (b.total_sales_before - a.total_sales_after) as diff, round((b.total_sales_before - a.total_sales_after) *100.0/ b. total_sales_before, 2) as pct_impact from total_sales_12_weeks_before b join total_sales_12_weeks_after a on b.customer_type = a.customer_type order by round((b.total_sales_before - a.total_sales_after) *100.0/ b. total_sales_before, 2) desc;
the customer type with highest negative impact is guest
-




