Skip to content

ETL Assignment Submission #23

Open
gunjanmimo wants to merge 3 commits intoHedgeApple:masterfrom
gunjanmimo:master
Open

ETL Assignment Submission #23
gunjanmimo wants to merge 3 commits intoHedgeApple:masterfrom
gunjanmimo:master

Conversation

@gunjanmimo
Copy link
Copy Markdown

This code defines a class called ETLPipeline that represents an ETL pipeline. It performs data transformation operations on a main DataFrame using a reference DataFrame and outputs a formatted DataFrame.

Here's a breakdown of the code:

  1. The code imports the Pandas library, which is used for working with DataFrames.

  2. The ETLPipeline class is defined. It takes two arguments in its constructor: main_dataframe and reference_dataframe, both of which are instances of Pandas's DataFrame.

  3. The constructor initializes the class by assigning the main_dataframe and reference_dataframe arguments to the corresponding instance variables.

  4. The get_column_name_mapping method returns a dictionary that maps column names from the main DataFrame to column names in the reference DataFrame. It serves as a mapping for transforming the column names during the transformation process.

  5. The get_country_code method returns a dictionary that maps country names to their respective country codes. It is used for transforming the "country of origin" column in the DataFrame.

  6. The upc_to_ean13_transform method is a helper function that takes a UPC value as input and converts it to the EAN13 format. It returns the converted value.

  7. The price_value_transform method is a helper function that takes a price value as input and converts it to a formatted price value. It removes any currency symbols and commas, rounds the value to 2 decimal places, and adds a dollar sign. It returns the formatted price value.

  8. The prop65_transform method takes a row from the DataFrame as input and checks if the url california label (jpg) and url california label (pdf) columns contain any NaN values. If either of the columns is NaN, it returns False; otherwise, it returns True. This method is used to determine whether a row indicates Prop65 compliance.

  9. The transform method is the main transformation function. It creates a new DataFrame called flattened_dataframe with the same columns as the reference DataFrame. It iterates over the column name mapping and assigns values from the main DataFrame to the corresponding columns in the flattened_dataframe.

  10. The method performs several transformations on specific columns:

  • It applies the upc_to_ean13_tranform function to convert the "ean13" column values to the EAN13 format.

  • It applies the price_value_transform function to format the "cost_price" and "min_price" columns.

  • It applies the prop65_transform function to determine the "prop_65" column values.

  • It replaces country names in the "product__country_of_origin__alpha_3" column with their respective country codes using the get_country_code dictionary.

  • It converts "yes" and "no" values in the "attrib__bulb_included" and "attrib__outdoor_safe" columns to True and False, respectively.

  • It checks if the string "ul" is present in the "attrib__ul_certified" column and converts it to True if present; otherwise, it converts it to False.

  1. The __call__ method is implemented to make the class callable. It calls the transform method to obtain the formatted DataFrame and exports it to a CSV file named "formatted.csv". Finally, it returns True to indicate a successful ETL pipeline execution.

  2. The __name__ == "__main__" block is used to execute the code when the script is run directly. It reads two CSV files, "homework.csv" and "example.csv", into DataFrames

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant