This repository contains the project for data cleaning, exploratory data analysis (EDA), and data visualization on layoffs data.
This project focuses on cleaning and analyzing a dataset containing information about company layoffs. The primary goals are to ensure the dataset is clean and standardized, perform exploratory data analysis to uncover insights, and visualize the results for better understanding.
- Project Overview
- Dataset
- Steps for Data Cleaning
- Exploratory Data Analysis (EDA)
- Data Visualization
- Usage
- Contributing
- License
The dataset used in this project contains information on company layoffs, including company names, locations, industries, total laid off, percentage laid off, dates, stages, countries, and funds raised.
The dataset used in this project can be found on Kaggle.
The full SQL code for the data cleaning process can be found in the data-cleaning-layoffs.sql file in this repository.
The first step is to identify and remove duplicate records from the dataset. Duplicates can skew analysis and lead to incorrect insights.
-- Identifying duplicates
WITH duplicate_cte AS
(
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY company, location,
industry, total_laid_off, percentage_laid_off, date, stage, country, funds_raised_millions
) AS row_num
FROM layoffs_staging
)
SELECT *
FROM duplicate_cte
WHERE row_num > 1;
-- Removing duplicates
DELETE
FROM layoffs_staging2
WHERE row_num > 1;
Standardization ensures consistency in data entries. This involves trimming whitespace, standardizing industry names, handling location data, and formatting date fields.
Remove any leading or trailing whitespace from company names to ensure consistency.
-- Trimming company names
UPDATE layoffs_staging2
SET company = TRIM(company);
-- Verifying the change
SELECT DISTINCT company
FROM layoffs_staging2;Unify the industry names for consistency. For example, if there are variations in the way an industry is named, standardize them to a single format.
-- Identifying industry names
SELECT DISTINCT industry
FROM layoffs_staging2
ORDER BY 1;
-- Standardizing 'Crypto' industry names
UPDATE layoffs_staging2
SET industry = 'Crypto'
WHERE industry LIKE 'Crypto%';
-- Verifying the change
SELECT DISTINCT industry
FROM layoffs_staging2
ORDER BY 1;Ensure consistency in location data by standardizing country names.
-- Identifying distinct countries
SELECT DISTINCT country
FROM layoffs_staging2
ORDER BY 1;
-- Standardizing country names by trimming trailing periods
UPDATE layoffs_staging2
SET country = TRIM(TRAILING '.' FROM country)
WHERE country LIKE 'United States%';
-- Verifying the change
SELECT DISTINCT country
FROM layoffs_staging2
ORDER BY 1;Convert the date fields to a standard date format.
-- Converting date format
UPDATE layoffs_staging2
SET `date` = STR_TO_DATE(`date`, '%m/%d/%Y');
-- Verifying the change
SELECT `date`
FROM layoffs_staging2;
-- Changing date column type to DATE
ALTER TABLE layoffs_staging2
MODIFY COLUMN `date` DATE;Handling missing values is crucial for maintaining the integrity of the dataset. This includes filling in missing values and removing records with insufficient information.
-- Removing records with insufficient layoff information
DELETE
FROM layoffs_staging2
WHERE total_laid_off IS NULL
AND percentage_laid_off IS NULL;
-- Filling missing industry data
UPDATE layoffs_staging2 t1
JOIN layoffs_staging2 t2
ON t1.company = t2.company AND t1.location = t2.location
SET t1.industry = t2.industry
WHERE t1.industry IS NULL
AND t2.industry IS NOT NULL;Remove columns that are no longer needed for analysis to keep the dataset clean and manageable.
-- Dropping the 'row_num' column
ALTER TABLE layoffs_staging2
DROP COLUMN row_num;EDA involves examining the dataset to summarize its main characteristics, often using visual methods. This step includes generating descriptive statistics, identifying patterns, and visualizing distributions and relationships.
TBA
To use this project, follow these steps:
- Clone the repository:
git clone https://github.com/AngelicaDolor/layoffs-data-analysis.git- Navigate to the project directory:
cd layoffs-data-analysis- Run the SQL scripts to clean the data and perform analysis.
Contributions are welcome! Please fork the repository and submit a pull request for any improvements or bug fixes.
This project is licensed under the MIT License - see the LICENSE file for details.