Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions Pipfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
pandas = "*"
pytest = "*"

[dev-packages]

[requires]
python_version = "3.10"
177 changes: 177 additions & 0 deletions Pipfile.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

55 changes: 54 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,57 @@ Follow industry standards for each data type when decided on the final format fo
* For dimensions without units, assume inches. Convert anything which isn't in inches to inches.
* For weights without units, assume pounds. Convert anything which isn't in pounds to pounds.
* UPC / Gtin / EAN should be handled as strings
* Floating point and decimal numbers should preserve as much precision as possible
* Floating point and decimal numbers should preserve as much precision as possible

# Solution
## Idea
The underlying concept behind this implementation is that any part of the code can be used to analyze other files. This idea arise due to the
various formats in which information could appear, given the multiple clients and sources of information.
* The mapper can be changed totally to map new files, or can me modified a little by the methods.

* The `DataTransformer` class will handle the transformation over the dataframes and will hold the changes for each column based on the transformations registered.

* Each transformation was created as a function with an argument due to the imprevisibility of the change itself, so this way any column can be transformed with a custom converter.

## Usage/Examples

The whole project use the library pandas, which read all the data as string.
A requeriments.txt was added to handle the dependencies(pandas install several dependencies).

The workflow as it can be seen in the project is:
1. Create an instance of a `DataTransformer`. It will manage by default the parameters.
2. Register into the instance created before the transformations.
3. Clean the dataframe and prepare it for the transformation with the method `filter_columns`.
4. Apply the transformations with `apply_transformations`
5. Dump the final dataframe to .csv

### Mapper class

The mapper class is a mapper for the columns required and given respectively. It can be changed to another whole set of columns to be adaptable to new files.
It implements methods for create, delete and update the fields in the mapper.

### Transformations

Due to the particularity of this project the next converters were created as functions:
* ean13_converter
* price_converter
* bool_true_converter
* bool_false_converter

For dimension, weights and datetimes no converters were created, in case that the files change and a new column needs to be transformed (from centimeters, meters, kilometers, kilograms, tons, etc), it can be created and passed into the `DataTransformer.register_transformation` method.

A self discover method to transform the data is not implemented because what will determinate the conversion is the unit in the header and is not clear until the file is particulary analized.

For example, in case that a new colum appers and the header is `lenght_cm` wich can be mapped to `lenght`, in that case:
1. You need to update the mapper, use `update` to change the key:value.
2. Create a new converter with the proper conversion unit.
3. Register the conversion to `lenght`
4. Execute the script.

## Running Tests

To run tests, run the following command

```bash
pytest
```
Empty file added __init__.py
Empty file.
78 changes: 78 additions & 0 deletions main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
from utils import DataTransformer

def ean13_converter(value:str):
"""
Converts a string to a formatted EAN-13 string.

Args:
value (str): The string to convert.

Returns:
str: The formatted 13-character EAN-13 code.

Notes:
- If the input is not a 12-character string, this method will silently ignore any extra characters.
"""
value=str(value)
final_text = value
if len(value) == 12:
final_text = '0{}-{}-{}'.format(value[:2],value[2:11],value[11:])

return final_text

def price_converter(value:str):
"""
Converts a price string to a decimal value with two decimal places.

Args:
value (str): The price string to convert.

Returns:
float: The converted price value rounded to two decimal places.

Notes:
- This method assumes the input string is in the format of "$X,XXX.XX" and removes any non-numeric characters.
"""
value = str(value).replace('$','').replace(',','')
return round(float(value),2)

def bool_true_converter(value:str):
"""
Returns always True no matter the input.

Args:
value (str): The input string to convert. It will be ignored

Returns:
bool: Always returns True.

Notes:
- This method is a simple way to return True on each apply inside `DataTransformer.apply_transformations`.
"""
return True

def bool_false_converter(value:str):
"""
Returns always False no matter the input.

Args:
value (str): The input string to convert. It will be ignored

Returns:
bool: Always returns False.

Notes:
- This method is a simple way to return False on each apply inside `DataTransformer.apply_transformations`.
"""
return False

if __name__ == '__main__':
etl_transformer = DataTransformer()
etl_transformer.register_transformation('ean13',ean13_converter)
etl_transformer.register_transformation('cost_price',price_converter)
etl_transformer.register_transformation('min_price',price_converter)
etl_transformer.register_transformation('prop_65',bool_true_converter)
etl_transformer.register_transformation('made_to_order',bool_false_converter)
etl_transformer.filter_columns()
etl_transformer.apply_transformations()
etl_transformer.dump_frame(filename='formatted.csv')
13 changes: 13 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
-i https://pypi.org/simple
exceptiongroup==1.2.1; python_version < '3.11'
iniconfig==2.0.0; python_version >= '3.7'
numpy==1.26.4; python_version < '3.11'
packaging==24.0; python_version >= '3.7'
pandas==2.2.2
pluggy==1.5.0; python_version >= '3.8'
pytest==8.2.0
python-dateutil==2.9.0.post0; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
pytz==2024.1
six==1.16.0; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
tomli==2.0.1; python_version < '3.11'
tzdata==2024.1; python_version >= '2'
Empty file added tests/__init__.py
Empty file.
Loading