-
Notifications
You must be signed in to change notification settings - Fork 40
ETL homework task completed #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,19 +1,53 @@ | ||
| # Task | ||
| <h1>ETL Service</h1> | ||
| <h2>Overview</h2> | ||
| <p>This Python project implements an Extract, Transform, Load (ETL) service that processes data from CSV files. It reads data from an input CSV file, applies transformations based on specified rules, and then writes the transformed data to an output CSV file.</p> | ||
| <h2>Features</h2> | ||
| <ul> | ||
| <li>Reads CSV files containing raw data.</li> | ||
| <li>Transforms the data according to predefined rules.</li> | ||
| <li>Writes the transformed data to a new CSV file.</li> | ||
| <li>Supports various transformations such as date format conversion, currency rounding, and unit conversion.</li> | ||
| <li>Code coverage at 93% you can run unit tests with pytest and coverage: <code>coverage run -m pytest .</code>.</li> | ||
| </ul> | ||
| <h2>How to Use</h2> | ||
| <h3>Usage</h3> | ||
| <ol> | ||
| <li><p>Prepare your input CSV file containing the raw data to be processed.</p></li> | ||
| <li><p>Define a columns mapping file in JSON format. This file specifies how each column in the input file should be transformed. An example columns mapping file might look like this:</p></li> | ||
| </ol> | ||
|
|
||
| 1. Fork this project | ||
| 2. Create a python script that reads all of the rows from `homework.csv` and outputs them to a new file `formatted.csv` using the headers from `example.csv` as a guideline. (See `Transformations` below for more details.) | ||
| 3. You may you any libraries you wish, but you must include a `requirements.txt` if you import anything outside of the standard library. | ||
| 4. There is no time limit for this assignment. | ||
| 5. You may ask any clarifying questions via email. | ||
| 6. Create a pull request against this repository with an English description of how your code works when you are complete | ||
| ```json | ||
| { | ||
| "system creation date": "date", | ||
| "wholesale ($)": "wholesale_price", | ||
| "item width (cm)": "width_inches", | ||
| "item length (feet)": "length_inches", | ||
| "item weight (kg)": "weight_pounds", | ||
| "upc": "upc_code" | ||
| } | ||
| ``` | ||
| <ol start="3"><li>Run the ETL service using the following command:</li></ol> | ||
| <pre><code>python main.py input.csv output.csv columns_mapping.json | ||
| </code></pre> | ||
|
|
||
| ## Transformations | ||
| <p>Replace <code>input.csv</code> with the path to your input CSV file, <code>output.csv</code> with the desired path for the output CSV file, and <code>columns_mapping.json</code> with the path to your columns mapping file.</p> | ||
|
|
||
| Follow industry standards for each data type when decided on the final format for cells. | ||
| <ol start="4"><li>Once the process completes, you will find the transformed data written to the specified output CSV file.</li></ol> | ||
| <h2>Example</h2> | ||
| <p>Let's say we have an input CSV file <code>data.csv</code> containing the following data:</p> | ||
|
|
||
| | system creation date | wholesale ($) | item width (cm) | item length (feet) | item weight (kg) | upc | | ||
| |----------------------|---------------|------------------|---------------------|-------------------|-------------| | ||
| | 7/7/15 | $10.50 | 20 | 2 | 2.1 | 123456789012| | ||
|
|
||
|
|
||
| <p>And a columns mapping file <code>columns_mapping.json</code> as shown above.</p> | ||
| <p>Running the ETL service with the following command:</p> | ||
| <pre><code>python main.py data.csv transformed_data.csv columns_mapping.json | ||
| </code></pre> | ||
| <p>Will result in a new CSV file <code>transformed_data.csv</code> with the following data:</p> | ||
|
|
||
| | date | wholesale_price | width_inches | length_inches | weight_pounds | upc_code | | ||
| |------------|-----------------|--------------|---------------|---------------|-------------| | ||
| | 2015-07-07 | 10.50 | 7.87 | 24 | 4.62 | 123456789012| | ||
|
|
||
| * Dates should use ISO 8601 | ||
| * Currency should be rounded to unit of accounting. Assume USD for currency and round to cents. | ||
| * For dimensions without units, assume inches. Convert anything which isn't in inches to inches. | ||
| * For weights without units, assume pounds. Convert anything which isn't in pounds to pounds. | ||
| * UPC / Gtin / EAN should be handled as strings | ||
| * Floating point and decimal numbers should preserve as much precision as possible |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,148 @@ | ||
| { | ||
| "item number": "manufacturer_sku", | ||
| "upc": "ean13", | ||
| "new item": null, | ||
| "system creation date": null, | ||
| "min order qty": null, | ||
| "sales uom": null, | ||
| "wholesale ($)": "cost_price", | ||
| "map ($)": "min_price", | ||
| "msrp ($)": null, | ||
| "description": "product__title", | ||
| "long description": "product__description", | ||
| "brand": "product__brand__name", | ||
| "item category": "product__product_class__name", | ||
| "item type": null, | ||
| "outdoor": null, | ||
| "item width (inches)": "width", | ||
| "item depth (inches)": "depth", | ||
| "item height (inches)": "height", | ||
| "item diameter (inches)": null, | ||
| "item weight (pounds)": "weight", | ||
| "multi-piece dimension 1 (inches)": null, | ||
| "multi-piece dimension 2 (inches)": null, | ||
| "multi-piece dimension 3 (inches)": null, | ||
| "multi-piece dimension 4 (inches)": null, | ||
| "item materials": "attrib__material", | ||
| "primary color family": "attrib__color", | ||
| "item finish": "attrib__finish", | ||
| "item finish 1": null, | ||
| "item finish 2": null, | ||
| "item finish 3": null, | ||
| "primary image filename": null, | ||
| "url primary image": null, | ||
| "url alternate image 1": null, | ||
| "url alternate image 2": null, | ||
| "url alternate image 3": null, | ||
| "url alternate image 4": null, | ||
| "url alternate image 5": null, | ||
| "url alternate image 6": null, | ||
| "url alternate image 7": null, | ||
| "url alternate image 8": null, | ||
| "url room setting image 1": null, | ||
| "url room setting image 2": null, | ||
| "url room setting image 3": null, | ||
| "url drawing": null, | ||
| "url interactive 360 image": null, | ||
| "url animated gif": null, | ||
| "url product sheet": null, | ||
| "url instruction sheet": null, | ||
| "url marketing sheet 1": null, | ||
| "url california label (jpg)": null, | ||
| "url california label (pdf)": null, | ||
| "item style": null, | ||
| "item substyle": null, | ||
| "item substyle 2": null, | ||
| "item collection": null, | ||
| "licensed by": null, | ||
| "carton count": null, | ||
| "truck only": null, | ||
| "carton 1 width (inches)": "boxes__0__width", | ||
| "carton 1 length (inches)": "boxes__0__length", | ||
| "carton 1 height (inches)": "boxes__0__height", | ||
| "carton 1 weight (pounds)": "boxes__0__weight", | ||
| "carton 1 volume (cubic feet)": null, | ||
| "carton 2 width (inches)": "boxes__1__width", | ||
| "carton 2 length (inches)": "boxes__1__length", | ||
| "carton 2 height (inches)": "boxes__1__height", | ||
| "carton 2 weight (pounds)": "boxes__1__weight", | ||
| "carton2volumecubicfeet": null, | ||
| "carton 3 width (inches)": "boxes__2__width", | ||
| "carton 3 length (inches)": "boxes__2__length", | ||
| "carton 3 height (inches)": "boxes__2__height", | ||
| "carton 3 weight (pounds)": "boxes__2__weight", | ||
| "carton 3 volume (cubic feet)": null, | ||
| "ada compliant": null, | ||
| "available with eef": null, | ||
| "conversion kit option": null, | ||
| "title 24 compliant": null, | ||
| "safety rating": null, | ||
| "certified damp/wet": null, | ||
| "bulb 1 count": null, | ||
| "bulb 1 wattage": null, | ||
| "bulb 1 type": null, | ||
| "bulb 1 base": null, | ||
| "bulb 1 included": null, | ||
| "bulb 2 count": null, | ||
| "bulb 2 wattage": null, | ||
| "bulb 2 type": null, | ||
| "bulb 2 base": null, | ||
| "bulb 2 included": null, | ||
| "led": null, | ||
| "total lumens": null, | ||
| "color temperature": null, | ||
| "cri": null, | ||
| "voltage": null, | ||
| "switch type": null, | ||
| "dimmable": null, | ||
| "lamp base dimensions (inches)": null, | ||
| "backplate/canopy dimensions (inches)": null, | ||
| "extension rods (inches)": null, | ||
| "min overall height (inches)": null, | ||
| "max overall height (inches)": null, | ||
| "min extension (inches)": null, | ||
| "max extension (inches)": null, | ||
| "hcwo (inches)": null, | ||
| "shade/glass description": null, | ||
| "shade/glass materials": null, | ||
| "shade/glass finish": null, | ||
| "shade/glass width": null, | ||
| "shade/glass width at top (inches)": null, | ||
| "shade/glass width at bottom (inches)": null, | ||
| "shade/glass height (inches)": null, | ||
| "shade shape": null, | ||
| "harp/spider": null, | ||
| "cord color": null, | ||
| "cord length (inches)": null, | ||
| "chain length (inches)": null, | ||
| "chain price ($)": null, | ||
| "replacement glass price ($)": null, | ||
| "replacement crystal price ($)": null, | ||
| "mirror width (inches)": null, | ||
| "mirror height (inches)": null, | ||
| "drawer count": null, | ||
| "drawer 1 interior dimensions (inches)": null, | ||
| "drawer 2 interior dimensions (inches)": null, | ||
| "drawer 3 interior dimensions (inches)": null, | ||
| "furniture arm height (inches)": null, | ||
| "furniture seat height (inches)": null, | ||
| "furniture seat dimensions (inches)": null, | ||
| "furniture weight capacity (pounds)": null, | ||
| "country of origin": "product__country_of_origin__alpha_3", | ||
| "primary catalog": null, | ||
| "primary catalog page": null, | ||
| "related items": null, | ||
| "brand bio": null, | ||
| "helpful tips": null, | ||
| "selling point 1": "product__bullets__1", | ||
| "selling point 2": "product__bullets__2", | ||
| "selling point 3": "product__bullets__3", | ||
| "selling point 4": "product__bullets__4", | ||
| "selling point 5": "product__bullets__5", | ||
| "selling point 6": "product__bullets__6", | ||
| "selling point 7": null, | ||
| "selling point 8": null, | ||
| "selling point 9": null, | ||
| "selling point 10": null, | ||
| "record status": null | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,171 @@ | ||
| import csv | ||
| from datetime import datetime | ||
| import json | ||
| from typing import List, Any, Optional | ||
|
|
||
|
|
||
| class ETLService: | ||
|
|
||
| def __init__(self): | ||
| self.columns_mapping = None | ||
| self.input_file_headers = None | ||
| self.output_file_headers = None | ||
|
|
||
| def run(self, input_file: str, output_file: str, columns_mapping_file: str) -> None: | ||
| """Performs the complete ETL process.""" | ||
|
|
||
| input_data = self.read_csv(input_file) | ||
|
|
||
| input_data = self.extract_headers(input_data) | ||
|
|
||
| with open(columns_mapping_file, 'r') as f: | ||
| self.columns_mapping = json.load(f) | ||
|
|
||
| transformed_data = self.transform_data(input_data) | ||
|
|
||
| self.output_file_headers = [column for column in self.columns_mapping.values() if column is not None] | ||
|
|
||
| self.write_csv(output_file, transformed_data) | ||
|
|
||
| def extract_headers(self, input_data: List[List[Any]]): | ||
| """Save the headers before removing the first row and remove headers from data.""" | ||
| self.input_file_headers = input_data[0] if input_data else [] | ||
| if input_data: | ||
| input_data_without_headers = input_data[1:] | ||
| return input_data_without_headers | ||
| else: | ||
| return input_data | ||
|
|
||
| def read_csv(self, filename: str) -> List[List[Any]]: | ||
| """Reads a CSV file and returns the data as a list of rows.""" | ||
| data: List[List[Any]] = [] | ||
| try: | ||
| with open(filename, 'r', newline='') as file: | ||
| reader = csv.reader(file) | ||
| for row in reader: | ||
| data.append(row) | ||
| except FileNotFoundError: | ||
| print(f"Error: Could not find the file '{filename}'") | ||
| except Exception as e: | ||
| print(f"Error while reading the file '{filename}': {e}") | ||
|
|
||
| return data | ||
|
|
||
| def write_csv(self, filename: str, data: List[List[Any]]) -> None: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can use |
||
| """Writes the data to a CSV file with the provided headers.""" | ||
| try: | ||
| with open(filename, 'w', newline='') as file: | ||
| writer = csv.writer(file) | ||
| writer.writerow(self.output_file_headers) | ||
| writer.writerows(data) | ||
| print(f"The formatted data has been written to '{filename}'") | ||
| except Exception as e: | ||
| print(f"Error writing to the file '{filename}': {e}") | ||
|
|
||
| def transform_data( | ||
| self, | ||
| input_data: List[List[Any]], | ||
| ) -> List[List[Any]]: | ||
| """Transforms the data according to the example headers.""" | ||
| transformed_data: List[List[Any]] = [] | ||
|
|
||
| for input_row in input_data: | ||
| transformed_row = [] | ||
| for column_name in self.input_file_headers: | ||
|
|
||
| # Get the input column name corresponding to the output column name | ||
| input_column_index = list(self.columns_mapping.keys()).index(column_name) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Move out this |
||
|
|
||
| if list(self.columns_mapping.values())[input_column_index] is None: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here. |
||
| continue | ||
|
|
||
| # Get the value of the input column | ||
| value = input_row[input_column_index] | ||
|
|
||
| transformed_value = self.transform_column_value(column_name, value) | ||
| transformed_row.append(transformed_value) | ||
| transformed_data.append(transformed_row) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, this could be a generator using |
||
| return transformed_data | ||
|
|
||
| def transform_column_value(self, column_name: str, value: Any) -> Any: | ||
| """Transforms the value of a column based on its name.""" | ||
| if column_name == 'system creation date': | ||
| return self.date_iso_format_transform(value) | ||
| elif '($)' in column_name: | ||
| return self.round_currency(value) | ||
| elif 'cm' in column_name: | ||
| return self.convert_to_inches(value, 'cm') | ||
| elif 'feet' in column_name: | ||
| return self.convert_to_inches(value, 'feet') | ||
| elif 'kg' in column_name: | ||
| return self.convert_to_pounds(value, 'kg') | ||
| elif 'upc' in column_name or 'gtin' in column_name or 'ean' in column_name: | ||
| return str(value) | ||
| else: | ||
| return value | ||
|
|
||
| def date_iso_format_transform(self, date: str) -> Optional[str]: | ||
| """Transforms a date from the format '7/7/15' to ISO 8601 (YYYY-MM-DD).""" | ||
| try: | ||
| original_date = datetime.strptime(date, '%m/%d/%y') | ||
| iso_date = original_date.strftime('%Y-%m-%d') | ||
| return iso_date | ||
| except ValueError: | ||
| print(f"Error: The date '{date}' does not have the expected format.") | ||
| return date | ||
|
|
||
| def round_currency(self, amount: str) -> str: | ||
| """Rounds the currency amount to the nearest cent.""" | ||
| try: | ||
| # Remove the $ sign if present and ensure it has 2 decimal places. | ||
| amount = amount.replace('$', '') | ||
| amount = amount.replace(',', '') | ||
|
|
||
| # Convert the amount to float and round to the nearest cent. | ||
| rounded_amount = round(float(amount), 2) | ||
|
|
||
| # Format the amount as a string with two decimal places. | ||
| formatted_amount = '{:.2f}'.format(rounded_amount) | ||
|
|
||
| return str(formatted_amount) | ||
| except ValueError: | ||
| print(f"Error: The value '{amount}' is not a valid amount.") | ||
| return amount | ||
|
|
||
| def convert_to_inches(self, value: str, unit: str) -> str: | ||
| """ | ||
| Converts a given measurement from different length units to inches. | ||
|
|
||
| Args: | ||
| - value: The numeric value of the measurement. | ||
| - unit: The unit of measurement (can be 'cm' for centimeters or 'feet' for feet). | ||
|
|
||
| Returns: | ||
| - The value of the measurement converted to inches. | ||
| """ | ||
| if value: | ||
| if unit == 'cm': | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Comparison is case sensitive. If unit == 'CM' (uppercase), this will return |
||
| # 1 pulgada equivale a 2.54 centímetros | ||
| rounded_amount = round(float(value) / 2.54, 2) | ||
| return str(rounded_amount) | ||
| elif unit == 'feet': | ||
| # 1 feet equal to 12 inches | ||
| rounded_amount = round(float(value) * 12, 2) | ||
| return str(rounded_amount) | ||
|
Comment on lines
+151
to
+154
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a dangling condition if unit != "cm" and unit != "feet", this returns |
||
| else: | ||
| return value | ||
|
|
||
| def convert_to_pounds(self, value: str, unit: str) -> str: | ||
| """ | ||
| Converts a given weight measurement from different units to pounds. | ||
|
|
||
| Args: | ||
| - value: The numeric value of the measurement. | ||
| - unit: The unit of measurement (can be 'kg' for kilograms). | ||
|
|
||
| Returns: | ||
| - The value of the measurement converted to pounds. | ||
| """ | ||
| if unit == 'kg': | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same issues as dimensions |
||
| # 1 kg equal to 2.20462 pounds | ||
| return str(round(float(value) * 2.20462, 2)) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a generator and just
yieldrow to avoid building out a potentially large in-memory data structure.