ETL homework task completed by NicolasMuras · Pull Request #38 · HedgeApple/etl_homework

NicolasMuras · 2024-05-05T16:52:42Z

ETL Service

Overview

This Python project implements an Extract, Transform, Load (ETL) service that processes data from CSV files. It reads data from an input CSV file, applies transformations based on specified rules, and then writes the transformed data to an output CSV file.

Features

Reads CSV files containing raw data.
Transforms the data according to predefined rules.
Writes the transformed data to a new CSV file.
Supports various transformations such as date format conversion, currency rounding, and unit conversion.
Code coverage at 93% you can run unit tests with pytest and coverage: coverage run -m pytest ..

How to Use

Usage

Prepare your input CSV file containing the raw data to be processed.
Define a columns mapping file in JSON format. This file specifies how each column in the input file should be transformed. An example columns mapping file might look like this:

{
    "system creation date": "date",
    "wholesale ($)": "wholesale_price",
    "item width (cm)": "width_inches",
    "item length (feet)": "length_inches",
    "item weight (kg)": "weight_pounds",
    "upc": "upc_code"
}

Run the ETL service using the following command:

python main.py input.csv output.csv columns_mapping.json

Replace input.csv with the path to your input CSV file, output.csv with the desired path for the output CSV file, and columns_mapping.json with the path to your columns mapping file.

Once the process completes, you will find the transformed data written to the specified output CSV file.

Example

Let's say we have an input CSV file data.csv containing the following data:

system creation date	wholesale ($)	item width (cm)	item length (feet)	item weight (kg)	upc
7/7/15	$10.50	20	2	2.1	123456789012

And a columns mapping file columns_mapping.json as shown above.

Running the ETL service with the following command:

python main.py data.csv transformed_data.csv columns_mapping.json

Will result in a new CSV file transformed_data.csv with the following data:

date	wholesale_price	width_inches	length_inches	weight_pounds	upc_code
2015-07-07	10.50	7.87	24	4.62	123456789012

john-parton · 2024-05-09T14:28:37Z

etl.py

+            with open(filename, 'r', newline='') as file:
+                reader = csv.reader(file)
+                for row in reader:
+                    data.append(row)


This could be a generator and just yield row to avoid building out a potentially large in-memory data structure.

john-parton · 2024-05-09T14:30:00Z

etl.py

+
+        return data
+
+    def write_csv(self, filename: str, data: List[List[Any]]) -> None:


You can use Iterable[list[object]] because Iterable is contravariant to list

john-parton · 2024-05-09T14:30:46Z

etl.py

+            for column_name in self.input_file_headers:
+
+                # Get the input column name corresponding to the output column name
+                input_column_index = list(self.columns_mapping.keys()).index(column_name)


Move out this list() call to an outer loop and assign to a temporary variable to avoid code in tight loop

john-parton · 2024-05-09T14:31:16Z

etl.py

+                # Get the input column name corresponding to the output column name
+                input_column_index = list(self.columns_mapping.keys()).index(column_name)
+
+                if list(self.columns_mapping.values())[input_column_index] is None:


Same here. list() variable should be reused versus calls in hot path.

john-parton · 2024-05-09T14:31:44Z

etl.py

+
+                transformed_value = self.transform_column_value(column_name, value)
+                transformed_row.append(transformed_value)
+            transformed_data.append(transformed_row)


Again, this could be a generator using yield to avoid building out a large intermediate data structure

john-parton · 2024-05-09T14:36:16Z

etl.py

+            elif unit == 'feet':
+                # 1 feet equal to 12 inches
+                rounded_amount = round(float(value) * 12, 2)
+                return str(rounded_amount)


There's a dangling condition if unit != "cm" and unit != "feet", this returns None without an error

john-parton · 2024-05-09T14:36:58Z

etl.py

+        - The value of the measurement converted to inches.
+        """
+        if value:
+            if unit == 'cm':


Comparison is case sensitive.

If unit == 'CM' (uppercase), this will return None without an error

john-parton · 2024-05-09T14:37:35Z

etl.py

+        Returns:
+        - The value of the measurement converted to pounds.
+        """
+        if unit == 'kg':


Same issues as dimensions

task was fun

3c3829a

john-parton reviewed May 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETL homework task completed#38

ETL homework task completed#38
NicolasMuras wants to merge 1 commit intoHedgeApple:masterfrom
NicolasMuras:master

NicolasMuras commented May 5, 2024

Uh oh!

john-parton May 9, 2024

Uh oh!

john-parton May 9, 2024

Uh oh!

john-parton May 9, 2024

Uh oh!

john-parton May 9, 2024

Uh oh!

john-parton May 9, 2024

Uh oh!

john-parton May 9, 2024

Uh oh!

john-parton May 9, 2024

Uh oh!

john-parton May 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		return data

		def write_csv(self, filename: str, data: List[List[Any]]) -> None:

Conversation

NicolasMuras commented May 5, 2024

ETL Service

Overview

Features

How to Use

Usage

Example

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants