Skip to content

Joelcio ETL Homework#39

Open
Joelciomatias wants to merge 4 commits intoHedgeApple:masterfrom
Joelciomatias:master
Open

Joelcio ETL Homework#39
Joelciomatias wants to merge 4 commits intoHedgeApple:masterfrom
Joelciomatias:master

Conversation

@Joelciomatias
Copy link
Copy Markdown

@Joelciomatias Joelciomatias commented May 5, 2024

Solution

The solution starts with the run_etl.py file, this file can receive the input csv filepath.

The run() method calls an instance of the ETL class and executes the process method,
this is the project's main method.

The etl_homework module was created, which contains the following files:

  • mapping.py: where mapping was done using inferences and logic
    and there is also the list of columns from the output csv.
  • schema.py: where is the etl schema made with the pandera lib.
  • csv_reader.py: where the lib pandas csv is read
  • elt_processor.py: where the ETL class and the process method are located.
    This method performs the following steps:
  • Read the input csv into a pandas dataframe, only the columns that were mapped.
  • Columns are renamed according to the mapping;
  • Then the transformations are applied;
    • columns with yes/no text converted to boolean, in these columns:
      attrib__outdoor_safe, attrib__kit, attrib__bulb_included
    • The country_converter lib was used for country acronyms in the product__country_of_origin__alpha_3 column
    • And some more data type transformations in the columns: attrib__number_bulbs, product__multipack_quantity,
      ean13, cost_price
  • The schema is validated according to the mapping

Finally, the output.csv file is generated with the other columns necessary for the desired output.

The run instructions are at the end of the README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant