Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This code defines a class called
ETLPipelinethat represents an ETL pipeline. It performs data transformation operations on a main DataFrame using a reference DataFrame and outputs a formatted DataFrame.Here's a breakdown of the code:
The code imports the Pandas library, which is used for working with DataFrames.
The
ETLPipelineclass is defined. It takes two arguments in its constructor: main_dataframe and reference_dataframe, both of which are instances of Pandas's DataFrame.The constructor initializes the class by assigning the main_dataframe and reference_dataframe arguments to the corresponding instance variables.
The
get_column_name_mappingmethod returns a dictionary that maps column names from the main DataFrame to column names in the reference DataFrame. It serves as a mapping for transforming the column names during the transformation process.The
get_country_codemethod returns a dictionary that maps country names to their respective country codes. It is used for transforming the "country of origin" column in the DataFrame.The
upc_to_ean13_transformmethod is a helper function that takes a UPC value as input and converts it to the EAN13 format. It returns the converted value.The
price_value_transformmethod is a helper function that takes a price value as input and converts it to a formatted price value. It removes any currency symbols and commas, rounds the value to 2 decimal places, and adds a dollar sign. It returns the formatted price value.The
prop65_transformmethod takes a row from the DataFrame as input and checks if theurl california label (jpg)andurl california label (pdf)columns contain any NaN values. If either of the columns is NaN, it returns False; otherwise, it returns True. This method is used to determine whether a row indicates Prop65 compliance.The
transformmethod is the main transformation function. It creates a new DataFrame called flattened_dataframe with the same columns as the reference DataFrame. It iterates over the column name mapping and assigns values from the main DataFrame to the corresponding columns in the flattened_dataframe.The method performs several transformations on specific columns:
It applies the upc_to_ean13_tranform function to convert the "ean13" column values to the EAN13 format.
It applies the price_value_transform function to format the "cost_price" and "min_price" columns.
It applies the prop65_transform function to determine the "prop_65" column values.
It replaces country names in the "product__country_of_origin__alpha_3" column with their respective country codes using the get_country_code dictionary.
It converts "yes" and "no" values in the "attrib__bulb_included" and "attrib__outdoor_safe" columns to True and False, respectively.
It checks if the string "ul" is present in the "attrib__ul_certified" column and converts it to True if present; otherwise, it converts it to False.
The
__call__method is implemented to make the class callable. It calls thetransformmethod to obtain the formatted DataFrame and exports it to a CSV file named "formatted.csv". Finally, it returns True to indicate a successful ETL pipeline execution.The
__name__ == "__main__"block is used to execute the code when the script is run directly. It reads two CSV files, "homework.csv" and "example.csv", into DataFrames