Configurable ETL system

Overview

This project implements a configurable ETL (Extract, Transform, Load) system for transforming CSV files using Python's built-in libraries. The transformations are specified in a configuration file (config/transformations.json), allowing for flexible and customizable data processing. The output is a CSV file.

Features

Rename columns
Concatenate columns to form dates
Apply proper case to text fields
Add constant values to columns
Order the data based on a specific column
Select specific columns for the output

Installation

Clone the repository
Ensure you have Python 3 installed
No additional dependencies are required

How to Run

Run the main script by using following command in the terminal:

python main.py

If you need to generate new sample data, run the following script

python .\data\generate_sample.py

Transformations

The transformations are specified in a configuration file config/transformations.json . The available transformations and the parameters they take are further described below.

Rename Column

Renames a column , optionally changes its type and optionally converts the text in the column to proper case

{
    "action": "rename_column",
    "source_column": "Order Number",
    "target_column": "OrderID",
    "type": "Integer",
    "proper_case" : false
}

source_column: The original column name
target_column: The new column name
type: (Optional) The data type to cast the column to. Supported types: Integer, BigDecimal , String . If ommited then it defaults to String.
proper_case (Optional): set to true if you want to convert the text in the column to proper case (title case)

Concatenate Date

Concatenates multiple columns that represent year, month and day into a single column where the format can be specified

{
    "action": "concatenate_date",
    "source_columns": ["Year", "Month", "Day"],
    "target_column": "OrderDate",
    "format": "%Y%m%d"
}

source_columns: List of columns to concatenate
target_column: The new column name
format: The format of the output, use strftimeand strptime compatible formats, like %Y%m%d or %Y-%m-%d

Add Constant Column

Adds a new column with a constant value

{
    "action": "add_constant",
    "target_column": "Unit",
    "value": "kg",
    "type": "String"
}

target_column: The new column name.
value: The constant value to add
type (Optional): The data type of the constant value. If ommited then it defaults to String .

Sort Data

Sorts the data based on a column. Default is ascending, unless otherwise specified.

{
    "action": "sort_data",
    "sort_column": "OrderDate",
    "sort_order": "desc"
}

sort_column: the column used for sorting
sort_order (Optional): Default is ascending. Valid values asc (ascending) and desc (descending)

Select Output Columns

Selects specific columns for the final output

{
    "action": "select_output_columns",
    "columns": ["OrderID", "OrderDate", "ProductId", "ProductName", "Quantity", "Unit"]
}

columns: List of columns to include in the final output

Unit Tests

To run the tests, run the following line in the terminal. This runs all the unit tests found in the tests directory using the unittest Python built-in module.

python -m unittest discover tests

Performance Tests

To evaluate the performance of the ETL system, I conducted tests by varying the size of the sample.csv file.

Here is a summary of the performance test results:

Sample.csv Number of Rows	Execution Time (sec)
1,000	0.3
100,000	2
1,000,000	20
10,000,000	300
10,000,000	250 (etl_chunk)

And here are the specs of the computer running the tests:

Processor: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
RAM: 32 GB
Storage: 474 GB SSD
Operating System: Windows 11 23H2
Python Version: 3.12.4

The last test is run by modifying the main.py to use etl_chunk.py file instead. There are some performance improvements, but the handling of the column transformations do mean that all data needs to be loaded into memory for part of the transformation.

Further tests would need to be conducted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Configurable ETL system

Overview

Features

Installation

How to Run

Transformations

Rename Column

Concatenate Date

Add Constant Column

Sort Data

Select Output Columns

Unit Tests

Performance Tests

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
data		data
etl		etl
tests		tests
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Liszet/Python_etl_config_project

Folders and files

Latest commit

History

Repository files navigation

Configurable ETL system

Overview

Features

Installation

How to Run

Transformations

Rename Column

Concatenate Date

Add Constant Column

Sort Data

Select Output Columns

Unit Tests

Performance Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages