This project demonstrates how I used SQL to clean and prepare a dataset on world layoffs for further analysis.
The dataset includes company details, industry, location, funding, and layoff counts.
The goal was to transform raw, messy data into a clean and reliable format.
- Identify and remove duplicate records
- Standardize inconsistent values (company, industry, country)
- Convert dates into the correct format
- Handle missing or blank values
- Deliver a cleaned dataset ready for analysis or visualization
- Created a staging table to protect raw data.
- Applied
ROW_NUMBER()withPARTITION BYacross key fields to detect duplicates. - Deleted duplicate rows.
- Trimmed whitespace from company names.
- Grouped similar industry names (e.g., all variations of Crypto →
Crypto). - Fixed country names (e.g., removed trailing periods in
United States.). - Converted
datefrom text (MM/DD/YYYY) into properDATE.
- Replaced blank industry values with
NULL. - Filled missing industry values using self-joins on company and location.
- Removed rows with no layoff information (
total_laid_offandpercentage_laid_offboth missing).
- Dropped helper columns such as
row_number. - Ensured only clean, consistent fields remain.
- Window functions (
ROW_NUMBER()withOVER) - Common Table Expressions (CTEs)
- String functions (
TRIM(),TRAILING) - Date functions (
STR_TO_DATE()) - Joins and updates for data filling
- Conditional deletes
layoffs_cleaning.sql→ Main SQL script with all cleaning stepsREADME.md→ Project documentation
- Database: MySQL
- Editor: MySQL Workbench
With the dataset cleaned, the next phase would be:
- Performing exploratory data analysis (EDA) to uncover layoff trends
- Building visual dashboards to present insights