- SQL Queries for data Retrieval
- Detecting outliers with(IQR)
- How has net worth changed over the years?
- Are there any trends in terms of business industries that tend to have more self-made individuals?
- Which countries have the highest average net worth among their citizens?
- What percentage of individuals in the dataset are classified as self-made?
- Is there a difference in net worth between genders within the same age group?
An SQL query is a statement built by putting together various SQL commands. These SQL commands together perform a specific task to access, manage, modify, update, control, and organize your data stored in a database and managed via a DBMS In this project several SQL queries has been initiated to retrieve tha data from the data database which is creating on a Dataset
IQR is one of the common and effective techniques to identify the outlier and to cap and remove the outlier so to reduce the skewness of data. Outliers deeply effect the overall population when mean is taken
select
case
when net_worth is null then null
when net_worth < q1 - 1.5 * iqr or net_worth > q3 + 1.5 *iqr then 'outliers'
else 'not outlier' end
from ( select *,
PERCENTILE_CONT(0.25) within group(order by net_worth) over () as q1,
PERCENTILE_CONT(0.75) within group (order by net_worth) over () as q3,
PERCENTILE_CONT(0.75) within group (order by net_worth) over ()-
PERCENTILE_CONT(0.25) within group(order by net_worth) over () as iqr
from FB) as data_with_percentilesIn this query the IQR formula has put to a use which is a follows:
IQR = Q3-Q1
- The inner subquery calculates three important statistics for the net worth values in a table called "FB":
- q1: The first quartile, which is the 25th percentile (25% of the data is below this value).
- q3: The third quartile, which is the 75th percentile (75% of the data is below this value).
- iqr: The interquartile range, which is the difference between q3 and q1. It represents the middle 50% of the data.
- The outer query uses a CASE statement to evaluate each net worth value
- If a net worth value is NULL, it returns NULL, meaning there's no classification for it.
- If a net worth value is less than q1 - 1.5 times the interquartile range (iqr) or greater than q3 + 1.5 times iqr, it is classified as an "outlier." -Otherwise, if the net worth value falls within the range defined by 1.5 times iqr around the interquartile range, it is classified as "not an outlier.
To answer this question the following query has been used:
select top 5
year ,full_name,
avg(net_worth) as average_net_worth
from FB
where outlier_status = 'not outlier'
group by year,full_name
order by average_net_worth descIn this query the average of Net worth is taken it is grouped by year to check the change at individual year. I have excluded that outliers from the data set to get more scalable results
Resulting when outliers are included are as follows:
| Year | Full name | Net_worth(BN) |
|---|---|---|
| 2021 | Alexander Abramov | 7.6 |
| 2018 | Alexei Kuzmichev | 7.6 |
| 2006 | Alexey Mordashov | 7.6 |
| 2017 | Aloys Wobben | 7.6 |
| 2010 | Ananda Krishnan | 7.6 |
Resulting when outliers are excluded are as follows:
| year | full_name | average_net_worth(BN) |
|---|---|---|
| 2022 | Elon Musk | 219 |
| 2023 | Bernard Arnault & family | 211 |
| 2023 | Elon Musk | 180 |
| 2021 | Jeff Bezos | 177 |
| 2022 | Jeff Bezos | 171 |
By looking at this change we clearly see that outlier has significant impact on the data set and impact on our analysis
In order to get the answer of this question the following query has been used:
select business_industries,
count(*) as total_individuals,
SUM(CASE WHEN self_made = '1' THEN 1 ELSE 0 END) AS self_made_count,
(SUM(CASE WHEN self_made ='1' THEN 1 ELSE 0 END) * 100.0) / COUNT(*) AS self_made_percentage
from FB
where outlier_status = 'not outlier'
group by business_industries
order by self_made_percentage descThis SQL query calculates the total number of individuals, counts the number of self-made individuals, and calculates the percentage of self-made individuals within each business industry group, but it only considers individuals who are not classified as outliers. The results are then sorted to show industries with the highest percentages of self-made individuals first.
| country_of_citizenship | avg_networth(BN) |
|---|---|
| Eswatini (Swaziland) | 4.8 |
| Georgia | 4.6 |
| Denmark | 3.94347826086956 |
| Swaziland | 3.92857142857143 |
| Algeria | 3.89 |
| South Africa | 3.56044776119403 |
| Colombia | 3.45 |
| Nigeria | 3.28571428571429 |
| Portugal | 3.25961538461538 |
| New Zealand | 3.25416666666667 |
This is the out which we receive when we run
select top 10
country_of_citizenship,
avg(net_worth) as avg_networth
from FB
where outlier_status = 'not outlier'
group by country_of_citizenship
order by avg_networth descthis query
This is a very important question and answer this first Let me show you that output.
| total_individuals | self_made_count | inherited_percentage | self_made_percentage |
|---|---|---|---|
| 23734 | 15320 | 64.548748630656 | 35.451251369343 |
The following Results has been ahcieved with the help of following query:
SELECT
COUNT(*) AS total_individuals,
SUM(CASE WHEN self_made = '1' THEN 1 ELSE 0 END) AS self_made_count,
(SUM(CASE WHEN self_made = '1' THEN 1 ELSE 0 END) * 100.0) / COUNT(*) AS self_made_percentage,
(SUM(CASE WHEN self_made = '0' THEN 1 ELSE 0 END) * 100.0) / COUNT(*) AS inherited_percentage
FROM
FB
WHERE
outlier_status = 'not outlier'This SQL query calculates the total number of individuals in the dataset, counts the number of self-made individuals, and calculates both the percentage of self-made individuals and the percentage of those who inherited wealth. It only considers individuals who are not classified as outliers.
The question reqires some uniqe kind of query to get the answer .First let me show you the answer.
| gender | age_group | avg_networth(BN) |
|---|---|---|
| Female | 18-29 | 1.48695652173913 |
| Male | 18-29 | 2.31153846153846 |
| Female | 30-29 | 2.37264957264957 |
| Male | 30-29 | 2.32085020242915 |
| Female | 40-49 | 2.21849315068493 |
| Male | 40-49 | 2.19327782290821 |
| Female | 50+ | 2.68313253012049 |
| Male | 50+ | 2.5268571585419 |
This table gives a good summary about the age group and within particular age group we can see average net worth(BN)
select
gender, age_group,
avg(net_worth) as avg_networth
from (
select gender ,
case when age >=18 and age<30 then '18-29'
when age >=30 and age< 40 then '30-29'
when age >= 40 and age<50 then '40-49'
else '50+' end as age_group,net_worth
from FB
where outlier_status = 'not outlier'
) as age_gender_data
group by gender,age_group
order by age_group,gender This query is explaining its self. Different age group is being created with the case statement and and grouping the result
- The inner subquery starts by selecting two columns: "gender" and a derived column "age_group." The "age_group" column is calculated based on the "age" column's values using a CASE statement. It categorizes individuals into different age groups: '18-29' for ages 18 to 29, '30-39' for ages 30 to 39, '40-49' for ages 40 to 49, and '50+' for ages 50 and above. It also selects the "net_worth" column.
this SQL query groups individuals by their gender and age groups, calculates the average net worth for each group, and presents the results in an ordered format. It only considers individuals who are not classified as outliers in the "FB" dataset.