Skip to content

ETL homework task completed#38

Open
NicolasMuras wants to merge 1 commit intoHedgeApple:masterfrom
NicolasMuras:master
Open

ETL homework task completed#38
NicolasMuras wants to merge 1 commit intoHedgeApple:masterfrom
NicolasMuras:master

Conversation

@NicolasMuras
Copy link
Copy Markdown

ETL Service

Overview

This Python project implements an Extract, Transform, Load (ETL) service that processes data from CSV files. It reads data from an input CSV file, applies transformations based on specified rules, and then writes the transformed data to an output CSV file.

Features

  • Reads CSV files containing raw data.
  • Transforms the data according to predefined rules.
  • Writes the transformed data to a new CSV file.
  • Supports various transformations such as date format conversion, currency rounding, and unit conversion.
  • Code coverage at 93% you can run unit tests with pytest and coverage: coverage run -m pytest ..

How to Use

Usage

  1. Prepare your input CSV file containing the raw data to be processed.

  2. Define a columns mapping file in JSON format. This file specifies how each column in the input file should be transformed. An example columns mapping file might look like this:

{
    "system creation date": "date",
    "wholesale ($)": "wholesale_price",
    "item width (cm)": "width_inches",
    "item length (feet)": "length_inches",
    "item weight (kg)": "weight_pounds",
    "upc": "upc_code"
}
  1. Run the ETL service using the following command:
python main.py input.csv output.csv columns_mapping.json

Replace input.csv with the path to your input CSV file, output.csv with the desired path for the output CSV file, and columns_mapping.json with the path to your columns mapping file.

  1. Once the process completes, you will find the transformed data written to the specified output CSV file.

Example

Let's say we have an input CSV file data.csv containing the following data:

system creation date wholesale ($) item width (cm) item length (feet) item weight (kg) upc
7/7/15 $10.50 20 2 2.1 123456789012

And a columns mapping file columns_mapping.json as shown above.

Running the ETL service with the following command:

python main.py data.csv transformed_data.csv columns_mapping.json

Will result in a new CSV file transformed_data.csv with the following data:

date wholesale_price width_inches length_inches weight_pounds upc_code
2015-07-07 10.50 7.87 24 4.62 123456789012

with open(filename, 'r', newline='') as file:
reader = csv.reader(file)
for row in reader:
data.append(row)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a generator and just yield row to avoid building out a potentially large in-memory data structure.


return data

def write_csv(self, filename: str, data: List[List[Any]]) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Iterable[list[object]] because Iterable is contravariant to list

for column_name in self.input_file_headers:

# Get the input column name corresponding to the output column name
input_column_index = list(self.columns_mapping.keys()).index(column_name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move out this list() call to an outer loop and assign to a temporary variable to avoid code in tight loop

# Get the input column name corresponding to the output column name
input_column_index = list(self.columns_mapping.keys()).index(column_name)

if list(self.columns_mapping.values())[input_column_index] is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. list() variable should be reused versus calls in hot path.


transformed_value = self.transform_column_value(column_name, value)
transformed_row.append(transformed_value)
transformed_data.append(transformed_row)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this could be a generator using yield to avoid building out a large intermediate data structure

Comment on lines +151 to +154
elif unit == 'feet':
# 1 feet equal to 12 inches
rounded_amount = round(float(value) * 12, 2)
return str(rounded_amount)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a dangling condition if unit != "cm" and unit != "feet", this returns None without an error

- The value of the measurement converted to inches.
"""
if value:
if unit == 'cm':
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparison is case sensitive.

If unit == 'CM' (uppercase), this will return None without an error

Returns:
- The value of the measurement converted to pounds.
"""
if unit == 'kg':
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issues as dimensions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants