Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 49 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,53 @@
# Task
<h1>ETL Service</h1>
<h2>Overview</h2>
<p>This Python project implements an Extract, Transform, Load (ETL) service that processes data from CSV files. It reads data from an input CSV file, applies transformations based on specified rules, and then writes the transformed data to an output CSV file.</p>
<h2>Features</h2>
<ul>
<li>Reads CSV files containing raw data.</li>
<li>Transforms the data according to predefined rules.</li>
<li>Writes the transformed data to a new CSV file.</li>
<li>Supports various transformations such as date format conversion, currency rounding, and unit conversion.</li>
<li>Code coverage at 93% you can run unit tests with pytest and coverage: <code>coverage run -m pytest .</code>.</li>
</ul>
<h2>How to Use</h2>
<h3>Usage</h3>
<ol>
<li><p>Prepare your input CSV file containing the raw data to be processed.</p></li>
<li><p>Define a columns mapping file in JSON format. This file specifies how each column in the input file should be transformed. An example columns mapping file might look like this:</p></li>
</ol>

1. Fork this project
2. Create a python script that reads all of the rows from `homework.csv` and outputs them to a new file `formatted.csv` using the headers from `example.csv` as a guideline. (See `Transformations` below for more details.)
3. You may you any libraries you wish, but you must include a `requirements.txt` if you import anything outside of the standard library.
4. There is no time limit for this assignment.
5. You may ask any clarifying questions via email.
6. Create a pull request against this repository with an English description of how your code works when you are complete
```json
{
"system creation date": "date",
"wholesale ($)": "wholesale_price",
"item width (cm)": "width_inches",
"item length (feet)": "length_inches",
"item weight (kg)": "weight_pounds",
"upc": "upc_code"
}
```
<ol start="3"><li>Run the ETL service using the following command:</li></ol>
<pre><code>python main.py input.csv output.csv columns_mapping.json
</code></pre>

## Transformations
<p>Replace <code>input.csv</code> with the path to your input CSV file, <code>output.csv</code> with the desired path for the output CSV file, and <code>columns_mapping.json</code> with the path to your columns mapping file.</p>

Follow industry standards for each data type when decided on the final format for cells.
<ol start="4"><li>Once the process completes, you will find the transformed data written to the specified output CSV file.</li></ol>
<h2>Example</h2>
<p>Let's say we have an input CSV file <code>data.csv</code> containing the following data:</p>

| system creation date | wholesale ($) | item width (cm) | item length (feet) | item weight (kg) | upc |
|----------------------|---------------|------------------|---------------------|-------------------|-------------|
| 7/7/15 | $10.50 | 20 | 2 | 2.1 | 123456789012|


<p>And a columns mapping file <code>columns_mapping.json</code> as shown above.</p>
<p>Running the ETL service with the following command:</p>
<pre><code>python main.py data.csv transformed_data.csv columns_mapping.json
</code></pre>
<p>Will result in a new CSV file <code>transformed_data.csv</code> with the following data:</p>

| date | wholesale_price | width_inches | length_inches | weight_pounds | upc_code |
|------------|-----------------|--------------|---------------|---------------|-------------|
| 2015-07-07 | 10.50 | 7.87 | 24 | 4.62 | 123456789012|

* Dates should use ISO 8601
* Currency should be rounded to unit of accounting. Assume USD for currency and round to cents.
* For dimensions without units, assume inches. Convert anything which isn't in inches to inches.
* For weights without units, assume pounds. Convert anything which isn't in pounds to pounds.
* UPC / Gtin / EAN should be handled as strings
* Floating point and decimal numbers should preserve as much precision as possible
148 changes: 148 additions & 0 deletions columns_mapping.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
{
"item number": "manufacturer_sku",
"upc": "ean13",
"new item": null,
"system creation date": null,
"min order qty": null,
"sales uom": null,
"wholesale ($)": "cost_price",
"map ($)": "min_price",
"msrp ($)": null,
"description": "product__title",
"long description": "product__description",
"brand": "product__brand__name",
"item category": "product__product_class__name",
"item type": null,
"outdoor": null,
"item width (inches)": "width",
"item depth (inches)": "depth",
"item height (inches)": "height",
"item diameter (inches)": null,
"item weight (pounds)": "weight",
"multi-piece dimension 1 (inches)": null,
"multi-piece dimension 2 (inches)": null,
"multi-piece dimension 3 (inches)": null,
"multi-piece dimension 4 (inches)": null,
"item materials": "attrib__material",
"primary color family": "attrib__color",
"item finish": "attrib__finish",
"item finish 1": null,
"item finish 2": null,
"item finish 3": null,
"primary image filename": null,
"url primary image": null,
"url alternate image 1": null,
"url alternate image 2": null,
"url alternate image 3": null,
"url alternate image 4": null,
"url alternate image 5": null,
"url alternate image 6": null,
"url alternate image 7": null,
"url alternate image 8": null,
"url room setting image 1": null,
"url room setting image 2": null,
"url room setting image 3": null,
"url drawing": null,
"url interactive 360 image": null,
"url animated gif": null,
"url product sheet": null,
"url instruction sheet": null,
"url marketing sheet 1": null,
"url california label (jpg)": null,
"url california label (pdf)": null,
"item style": null,
"item substyle": null,
"item substyle 2": null,
"item collection": null,
"licensed by": null,
"carton count": null,
"truck only": null,
"carton 1 width (inches)": "boxes__0__width",
"carton 1 length (inches)": "boxes__0__length",
"carton 1 height (inches)": "boxes__0__height",
"carton 1 weight (pounds)": "boxes__0__weight",
"carton 1 volume (cubic feet)": null,
"carton 2 width (inches)": "boxes__1__width",
"carton 2 length (inches)": "boxes__1__length",
"carton 2 height (inches)": "boxes__1__height",
"carton 2 weight (pounds)": "boxes__1__weight",
"carton2volumecubicfeet": null,
"carton 3 width (inches)": "boxes__2__width",
"carton 3 length (inches)": "boxes__2__length",
"carton 3 height (inches)": "boxes__2__height",
"carton 3 weight (pounds)": "boxes__2__weight",
"carton 3 volume (cubic feet)": null,
"ada compliant": null,
"available with eef": null,
"conversion kit option": null,
"title 24 compliant": null,
"safety rating": null,
"certified damp/wet": null,
"bulb 1 count": null,
"bulb 1 wattage": null,
"bulb 1 type": null,
"bulb 1 base": null,
"bulb 1 included": null,
"bulb 2 count": null,
"bulb 2 wattage": null,
"bulb 2 type": null,
"bulb 2 base": null,
"bulb 2 included": null,
"led": null,
"total lumens": null,
"color temperature": null,
"cri": null,
"voltage": null,
"switch type": null,
"dimmable": null,
"lamp base dimensions (inches)": null,
"backplate/canopy dimensions (inches)": null,
"extension rods (inches)": null,
"min overall height (inches)": null,
"max overall height (inches)": null,
"min extension (inches)": null,
"max extension (inches)": null,
"hcwo (inches)": null,
"shade/glass description": null,
"shade/glass materials": null,
"shade/glass finish": null,
"shade/glass width": null,
"shade/glass width at top (inches)": null,
"shade/glass width at bottom (inches)": null,
"shade/glass height (inches)": null,
"shade shape": null,
"harp/spider": null,
"cord color": null,
"cord length (inches)": null,
"chain length (inches)": null,
"chain price ($)": null,
"replacement glass price ($)": null,
"replacement crystal price ($)": null,
"mirror width (inches)": null,
"mirror height (inches)": null,
"drawer count": null,
"drawer 1 interior dimensions (inches)": null,
"drawer 2 interior dimensions (inches)": null,
"drawer 3 interior dimensions (inches)": null,
"furniture arm height (inches)": null,
"furniture seat height (inches)": null,
"furniture seat dimensions (inches)": null,
"furniture weight capacity (pounds)": null,
"country of origin": "product__country_of_origin__alpha_3",
"primary catalog": null,
"primary catalog page": null,
"related items": null,
"brand bio": null,
"helpful tips": null,
"selling point 1": "product__bullets__1",
"selling point 2": "product__bullets__2",
"selling point 3": "product__bullets__3",
"selling point 4": "product__bullets__4",
"selling point 5": "product__bullets__5",
"selling point 6": "product__bullets__6",
"selling point 7": null,
"selling point 8": null,
"selling point 9": null,
"selling point 10": null,
"record status": null
}
171 changes: 171 additions & 0 deletions etl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
import csv
from datetime import datetime
import json
from typing import List, Any, Optional


class ETLService:

def __init__(self):
self.columns_mapping = None
self.input_file_headers = None
self.output_file_headers = None

def run(self, input_file: str, output_file: str, columns_mapping_file: str) -> None:
"""Performs the complete ETL process."""

input_data = self.read_csv(input_file)

input_data = self.extract_headers(input_data)

with open(columns_mapping_file, 'r') as f:
self.columns_mapping = json.load(f)

transformed_data = self.transform_data(input_data)

self.output_file_headers = [column for column in self.columns_mapping.values() if column is not None]

self.write_csv(output_file, transformed_data)

def extract_headers(self, input_data: List[List[Any]]):
"""Save the headers before removing the first row and remove headers from data."""
self.input_file_headers = input_data[0] if input_data else []
if input_data:
input_data_without_headers = input_data[1:]
return input_data_without_headers
else:
return input_data

def read_csv(self, filename: str) -> List[List[Any]]:
"""Reads a CSV file and returns the data as a list of rows."""
data: List[List[Any]] = []
try:
with open(filename, 'r', newline='') as file:
reader = csv.reader(file)
for row in reader:
data.append(row)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a generator and just yield row to avoid building out a potentially large in-memory data structure.

except FileNotFoundError:
print(f"Error: Could not find the file '{filename}'")
except Exception as e:
print(f"Error while reading the file '{filename}': {e}")

return data

def write_csv(self, filename: str, data: List[List[Any]]) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Iterable[list[object]] because Iterable is contravariant to list

"""Writes the data to a CSV file with the provided headers."""
try:
with open(filename, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(self.output_file_headers)
writer.writerows(data)
print(f"The formatted data has been written to '{filename}'")
except Exception as e:
print(f"Error writing to the file '{filename}': {e}")

def transform_data(
self,
input_data: List[List[Any]],
) -> List[List[Any]]:
"""Transforms the data according to the example headers."""
transformed_data: List[List[Any]] = []

for input_row in input_data:
transformed_row = []
for column_name in self.input_file_headers:

# Get the input column name corresponding to the output column name
input_column_index = list(self.columns_mapping.keys()).index(column_name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move out this list() call to an outer loop and assign to a temporary variable to avoid code in tight loop


if list(self.columns_mapping.values())[input_column_index] is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. list() variable should be reused versus calls in hot path.

continue

# Get the value of the input column
value = input_row[input_column_index]

transformed_value = self.transform_column_value(column_name, value)
transformed_row.append(transformed_value)
transformed_data.append(transformed_row)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this could be a generator using yield to avoid building out a large intermediate data structure

return transformed_data

def transform_column_value(self, column_name: str, value: Any) -> Any:
"""Transforms the value of a column based on its name."""
if column_name == 'system creation date':
return self.date_iso_format_transform(value)
elif '($)' in column_name:
return self.round_currency(value)
elif 'cm' in column_name:
return self.convert_to_inches(value, 'cm')
elif 'feet' in column_name:
return self.convert_to_inches(value, 'feet')
elif 'kg' in column_name:
return self.convert_to_pounds(value, 'kg')
elif 'upc' in column_name or 'gtin' in column_name or 'ean' in column_name:
return str(value)
else:
return value

def date_iso_format_transform(self, date: str) -> Optional[str]:
"""Transforms a date from the format '7/7/15' to ISO 8601 (YYYY-MM-DD)."""
try:
original_date = datetime.strptime(date, '%m/%d/%y')
iso_date = original_date.strftime('%Y-%m-%d')
return iso_date
except ValueError:
print(f"Error: The date '{date}' does not have the expected format.")
return date

def round_currency(self, amount: str) -> str:
"""Rounds the currency amount to the nearest cent."""
try:
# Remove the $ sign if present and ensure it has 2 decimal places.
amount = amount.replace('$', '')
amount = amount.replace(',', '')

# Convert the amount to float and round to the nearest cent.
rounded_amount = round(float(amount), 2)

# Format the amount as a string with two decimal places.
formatted_amount = '{:.2f}'.format(rounded_amount)

return str(formatted_amount)
except ValueError:
print(f"Error: The value '{amount}' is not a valid amount.")
return amount

def convert_to_inches(self, value: str, unit: str) -> str:
"""
Converts a given measurement from different length units to inches.

Args:
- value: The numeric value of the measurement.
- unit: The unit of measurement (can be 'cm' for centimeters or 'feet' for feet).

Returns:
- The value of the measurement converted to inches.
"""
if value:
if unit == 'cm':
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparison is case sensitive.

If unit == 'CM' (uppercase), this will return None without an error

# 1 pulgada equivale a 2.54 centímetros
rounded_amount = round(float(value) / 2.54, 2)
return str(rounded_amount)
elif unit == 'feet':
# 1 feet equal to 12 inches
rounded_amount = round(float(value) * 12, 2)
return str(rounded_amount)
Comment on lines +151 to +154
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a dangling condition if unit != "cm" and unit != "feet", this returns None without an error

else:
return value

def convert_to_pounds(self, value: str, unit: str) -> str:
"""
Converts a given weight measurement from different units to pounds.

Args:
- value: The numeric value of the measurement.
- unit: The unit of measurement (can be 'kg' for kilograms).

Returns:
- The value of the measurement converted to pounds.
"""
if unit == 'kg':
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issues as dimensions

# 1 kg equal to 2.20462 pounds
return str(round(float(value) * 2.20462, 2))
Loading