HedgeApple · NicolasMuras · May 5, 2024 · john-parton · May 9, 2024 · john-parton
diff --git a/README.md b/README.md
@@ -1,19 +1,53 @@
-# Task
+<h1>ETL Service</h1>
+<h2>Overview</h2>
+<p>This Python project implements an Extract, Transform, Load (ETL) service that processes data from CSV files. It reads data from an input CSV file, applies transformations based on specified rules, and then writes the transformed data to an output CSV file.</p>
+<h2>Features</h2>
+<ul>
+    <li>Reads CSV files containing raw data.</li>
+    <li>Transforms the data according to predefined rules.</li>
+    <li>Writes the transformed data to a new CSV file.</li>
+    <li>Supports various transformations such as date format conversion, currency rounding, and unit conversion.</li>
+    <li>Code coverage at 93% you can run unit tests with pytest and coverage: <code>coverage run -m pytest .</code>.</li>
+</ul>
+<h2>How to Use</h2>
+<h3>Usage</h3>
+<ol>
+    <li><p>Prepare your input CSV file containing the raw data to be processed.</p></li>
+    <li><p>Define a columns mapping file in JSON format. This file specifies how each column in the input file should be transformed. An example columns mapping file might look like this:</p></li>
+</ol>
 
-1. Fork this project
-2. Create a python script that reads all of the rows from `homework.csv` and outputs them to a new file `formatted.csv` using the headers from `example.csv` as a guideline.  (See `Transformations` below for more details.)
-3. You may you any libraries you wish, but you must include a `requirements.txt` if you import anything outside of the standard library.
-4. There is no time limit for this assignment.
-5. You may ask any clarifying questions via email.
-6. Create a pull request against this repository with an English description of how your code works when you are complete
+```json
+{
+    "system creation date": "date",
+    "wholesale ($)": "wholesale_price",
+    "item width (cm)": "width_inches",
+    "item length (feet)": "length_inches",
+    "item weight (kg)": "weight_pounds",
+    "upc": "upc_code"
+}
+```
+<ol start="3"><li>Run the ETL service using the following command:</li></ol>
+<pre><code>python main.py input.csv output.csv columns_mapping.json
+</code></pre>
 
-## Transformations
+<p>Replace <code>input.csv</code> with the path to your input CSV file, <code>output.csv</code> with the desired path for the output CSV file, and <code>columns_mapping.json</code> with the path to your columns mapping file.</p>
 
-Follow industry standards for each data type when decided on the final format for cells.
+<ol start="4"><li>Once the process completes, you will find the transformed data written to the specified output CSV file.</li></ol>
+<h2>Example</h2>
+<p>Let's say we have an input CSV file <code>data.csv</code> containing the following data:</p>
+
+| system creation date | wholesale ($) | item width (cm) | item length (feet) | item weight (kg) | upc         |
+|----------------------|---------------|------------------|---------------------|-------------------|-------------|
+| 7/7/15               | $10.50        | 20               | 2                   | 2.1               | 123456789012|
+
+
+<p>And a columns mapping file <code>columns_mapping.json</code> as shown above.</p>
+<p>Running the ETL service with the following command:</p>
+<pre><code>python main.py data.csv transformed_data.csv columns_mapping.json
+</code></pre>
+<p>Will result in a new CSV file <code>transformed_data.csv</code> with the following data:</p>
+
+| date       | wholesale_price | width_inches | length_inches | weight_pounds | upc_code    |
+|------------|-----------------|--------------|---------------|---------------|-------------|
+| 2015-07-07 | 10.50           | 7.87         | 24            | 4.62          | 123456789012|
 
-* Dates should use ISO 8601
-* Currency should be rounded to unit of accounting. Assume USD for currency and round to cents.
-* For dimensions without units, assume inches. Convert anything which isn't in inches to inches.
-* For weights without units, assume pounds. Convert anything which isn't in pounds to pounds.
-* UPC / Gtin / EAN should be handled as strings
-* Floating point and decimal numbers should preserve as much precision as possible
diff --git a/columns_mapping.json b/columns_mapping.json
@@ -0,0 +1,148 @@
+{
+    "item number": "manufacturer_sku",
+    "upc": "ean13",
+    "new item": null,
+    "system creation date": null,
+    "min order qty": null,
+    "sales uom": null,
+    "wholesale ($)": "cost_price",
+    "map ($)": "min_price",
+    "msrp ($)": null,
+    "description": "product__title",
+    "long description": "product__description",
+    "brand": "product__brand__name",
+    "item category": "product__product_class__name",
+    "item type": null,
+    "outdoor": null,
+    "item width (inches)": "width",
+    "item depth (inches)": "depth",
+    "item height (inches)": "height",
+    "item diameter (inches)": null,
+    "item weight (pounds)": "weight",
+    "multi-piece dimension 1 (inches)": null,
+    "multi-piece dimension 2 (inches)": null,
+    "multi-piece dimension 3 (inches)": null,
+    "multi-piece dimension 4 (inches)": null,
+    "item materials": "attrib__material",
+    "primary color family": "attrib__color",
+    "item finish": "attrib__finish",
+    "item finish 1": null,
+    "item finish 2": null,
+    "item finish 3": null,
+    "primary image filename": null,
+    "url primary image": null,
+    "url alternate image 1": null,
+    "url alternate image 2": null,
+    "url alternate image 3": null,
+    "url alternate image 4": null,
+    "url alternate image 5": null,
+    "url alternate image 6": null,
+    "url alternate image 7": null,
+    "url alternate image 8": null,
+    "url room setting image 1": null,
+    "url room setting image 2": null,
+    "url room setting image 3": null,
+    "url drawing": null,
+    "url interactive 360 image": null,
+    "url animated gif": null,
+    "url product sheet": null,
+    "url instruction sheet": null,
+    "url marketing sheet 1": null,
+    "url california label (jpg)": null,
+    "url california label (pdf)": null,
+    "item style": null,
+    "item substyle": null,
+    "item substyle 2": null,
+    "item collection": null,
+    "licensed by": null,
+    "carton count": null,
+    "truck only": null,
+    "carton 1 width (inches)": "boxes__0__width",
+    "carton 1 length (inches)": "boxes__0__length",
+    "carton 1 height (inches)": "boxes__0__height",
+    "carton 1 weight (pounds)": "boxes__0__weight",
+    "carton 1 volume (cubic feet)": null,
+    "carton 2 width (inches)": "boxes__1__width",
+    "carton 2 length (inches)": "boxes__1__length",
+    "carton 2 height (inches)": "boxes__1__height",
+    "carton 2 weight (pounds)": "boxes__1__weight",
+    "carton2volumecubicfeet": null,
+    "carton 3 width (inches)": "boxes__2__width",
+    "carton 3 length (inches)": "boxes__2__length",
+    "carton 3 height (inches)": "boxes__2__height",
+    "carton 3 weight (pounds)": "boxes__2__weight",
+    "carton 3 volume (cubic feet)": null,
+    "ada compliant": null,
+    "available with eef": null,
+    "conversion kit option": null,
+    "title 24 compliant": null,
+    "safety rating": null,
+    "certified damp/wet": null,
+    "bulb 1 count": null,
+    "bulb 1 wattage": null,
+    "bulb 1 type": null,
+    "bulb 1 base": null,
+    "bulb 1 included": null,
+    "bulb 2 count": null,
+    "bulb 2 wattage": null,
+    "bulb 2 type": null,
+    "bulb 2 base": null,
+    "bulb 2 included": null,
+    "led": null,
+    "total lumens": null,
+    "color temperature": null,
+    "cri": null,
+    "voltage": null,
+    "switch type": null,
+    "dimmable": null,
+    "lamp base dimensions (inches)": null,
+    "backplate/canopy dimensions (inches)": null,
+    "extension rods (inches)": null,
+    "min overall height (inches)": null,
+    "max overall height (inches)": null,
+    "min extension (inches)": null,
+    "max extension (inches)": null,
+    "hcwo (inches)": null,
+    "shade/glass description": null,
+    "shade/glass materials": null,
+    "shade/glass finish": null,
+    "shade/glass width": null,
+    "shade/glass width at top (inches)": null,
+    "shade/glass width at bottom (inches)": null,
+    "shade/glass height (inches)": null,
+    "shade shape": null,
+    "harp/spider": null,
+    "cord color": null,
+    "cord length (inches)": null,
+    "chain length (inches)": null,
+    "chain price ($)": null,
+    "replacement glass price ($)": null,
+    "replacement crystal price ($)": null,
+    "mirror width (inches)": null,
+    "mirror height (inches)": null,
+    "drawer count": null,
+    "drawer 1 interior dimensions (inches)": null,
+    "drawer 2 interior dimensions (inches)": null,
+    "drawer 3 interior dimensions (inches)": null,
+    "furniture arm height (inches)": null,
+    "furniture seat height (inches)": null,
+    "furniture seat dimensions (inches)": null,
+    "furniture weight capacity (pounds)": null,
+    "country of origin": "product__country_of_origin__alpha_3",
+    "primary catalog": null,
+    "primary catalog page": null,
+    "related items": null,
+    "brand bio": null,
+    "helpful tips": null,
+    "selling point 1": "product__bullets__1",
+    "selling point 2": "product__bullets__2",
+    "selling point 3": "product__bullets__3",
+    "selling point 4": "product__bullets__4",
+    "selling point 5": "product__bullets__5",
+    "selling point 6": "product__bullets__6",
+    "selling point 7": null,
+    "selling point 8": null,
+    "selling point 9": null,
+    "selling point 10": null,
+    "record status": null
+}
diff --git a/etl.py b/etl.py
@@ -0,0 +1,171 @@
+import csv
+from datetime import datetime
+import json
+from typing import List, Any, Optional
+
+
+class ETLService:
+
+    def __init__(self):
+        self.columns_mapping = None
+        self.input_file_headers = None
+        self.output_file_headers = None
+
+    def run(self, input_file: str, output_file: str, columns_mapping_file: str) -> None:
+        """Performs the complete ETL process."""
+
+        input_data = self.read_csv(input_file)
+
+        input_data = self.extract_headers(input_data)
+
+        with open(columns_mapping_file, 'r') as f:
+            self.columns_mapping = json.load(f)
+
+        transformed_data = self.transform_data(input_data)
+
+        self.output_file_headers = [column for column in self.columns_mapping.values() if column is not None]
+
+        self.write_csv(output_file, transformed_data)
+
+    def extract_headers(self, input_data: List[List[Any]]):
+        """Save the headers before removing the first row and remove headers from data."""
+        self.input_file_headers = input_data[0] if input_data else []
+        if input_data:
+            input_data_without_headers = input_data[1:]
+            return input_data_without_headers
+        else:
+            return input_data
+
+    def read_csv(self, filename: str) -> List[List[Any]]:
+        """Reads a CSV file and returns the data as a list of rows."""
+        data: List[List[Any]] = []
+        try:
+            with open(filename, 'r', newline='') as file:
+                reader = csv.reader(file)
+                for row in reader:
+                    data.append(row)
+        except FileNotFoundError:
+            print(f"Error: Could not find the file '{filename}'")
+        except Exception as e:
+            print(f"Error while reading the file '{filename}': {e}")
+
+        return data
+
+    def write_csv(self, filename: str, data: List[List[Any]]) -> None:
+        """Writes the data to a CSV file with the provided headers."""
+        try:
+            with open(filename, 'w', newline='') as file:
+                writer = csv.writer(file)
+                writer.writerow(self.output_file_headers)
+                writer.writerows(data)
+            print(f"The formatted data has been written to '{filename}'")
+        except Exception as e:
+            print(f"Error writing to the file '{filename}': {e}")
+
+    def transform_data(
+            self, 
+            input_data: List[List[Any]], 
+        ) -> List[List[Any]]:
+        """Transforms the data according to the example headers."""
+        transformed_data: List[List[Any]] = []
+
+        for input_row in input_data:
+            transformed_row = []
+            for column_name in self.input_file_headers:
+
+                # Get the input column name corresponding to the output column name
+                input_column_index = list(self.columns_mapping.keys()).index(column_name)
+
+                if list(self.columns_mapping.values())[input_column_index] is None:
+                    continue
+
+                # Get the value of the input column
+                value = input_row[input_column_index]
+
+                transformed_value = self.transform_column_value(column_name, value)
+                transformed_row.append(transformed_value)
+            transformed_data.append(transformed_row)
+        return transformed_data
+
+    def transform_column_value(self, column_name: str, value: Any) -> Any:
+        """Transforms the value of a column based on its name."""
+        if column_name == 'system creation date':
+            return self.date_iso_format_transform(value)
+        elif '($)' in column_name:
+            return self.round_currency(value)
+        elif 'cm' in column_name:
+            return self.convert_to_inches(value, 'cm')
+        elif 'feet' in column_name:
+            return self.convert_to_inches(value, 'feet')
+        elif 'kg' in column_name:
+            return self.convert_to_pounds(value, 'kg')
+        elif 'upc' in column_name or 'gtin' in column_name or 'ean' in column_name:
+            return str(value)
+        else:
+            return value
+
+    def date_iso_format_transform(self, date: str) -> Optional[str]:
+        """Transforms a date from the format '7/7/15' to ISO 8601 (YYYY-MM-DD)."""
+        try:
+            original_date = datetime.strptime(date, '%m/%d/%y')
+            iso_date = original_date.strftime('%Y-%m-%d')
+            return iso_date
+        except ValueError:
+            print(f"Error: The date '{date}' does not have the expected format.")
+            return date
+
+    def round_currency(self, amount: str) -> str:
+        """Rounds the currency amount to the nearest cent."""
+        try:
+            # Remove the $ sign if present and ensure it has 2 decimal places.
+            amount = amount.replace('$', '')
+            amount = amount.replace(',', '')
+
+            # Convert the amount to float and round to the nearest cent.
+            rounded_amount = round(float(amount), 2)
+
+            # Format the amount as a string with two decimal places.
+            formatted_amount = '{:.2f}'.format(rounded_amount)
+
+            return str(formatted_amount)
+        except ValueError:
+            print(f"Error: The value '{amount}' is not a valid amount.")
+            return amount
+
+    def convert_to_inches(self, value: str, unit: str) -> str:
+        """
+        Converts a given measurement from different length units to inches.
+
+        Args:
+        - value: The numeric value of the measurement.
+        - unit: The unit of measurement (can be 'cm' for centimeters or 'feet' for feet).
+
+        Returns:
+        - The value of the measurement converted to inches.
+        """
+        if value:
+            if unit == 'cm':
+                # 1 pulgada equivale a 2.54 centímetros
+                rounded_amount = round(float(value) / 2.54, 2)
+                return str(rounded_amount)
+            elif unit == 'feet':
+                # 1 feet equal to 12 inches
+                rounded_amount = round(float(value) * 12, 2)
+                return str(rounded_amount)
+        else:
+            return value
+
+    def convert_to_pounds(self, value: str, unit: str) -> str:
+        """
+        Converts a given weight measurement from different units to pounds.
+
+        Args:
+        - value: The numeric value of the measurement.
+        - unit: The unit of measurement (can be 'kg' for kilograms).
+
+        Returns:
+        - The value of the measurement converted to pounds.
+        """
+        if unit == 'kg':
+            # 1 kg equal to 2.20462 pounds
+            return str(round(float(value) * 2.20462, 2))