Script created to format homework.csv by Infandous · Pull Request #46 · HedgeApple/etl_homework

Infandous · 2024-05-19T22:12:18Z

The script creates a class called Processor, an instance of which named entity is created to format the homework.csv file. The program requires 1 command line argument: the name of the file being processed (for this scenario that would be homework.csv). Thus the script is ran by running "python3 processor.py homework.csv". When the program is run, the processor object is created with the command line argument supplied as a paramater. Next the driver() method is called, and lastly the save_dataframe() method is called with the name of the output file "formatted" supplied as a parameter.

When a Processor object is created, it is supplied with the name of a csv file that it will open, if this results in an error, the program will terminate and a message is displayed to the user.

The Driver method is the main method, and first transforms the data in the format described in the README, and then the column names into the format of example.csv(this is words are separated by _, and pricing column names have '$' removed and contain the word 'price' at the end.

The use of the pandas library allows one to use .apply(function_name) to perform some series of changes to a specific column and thus is used for every column we change.

One item of note is that given there is no EAN column, there is no code for making transformations to such a column.

For every series of transformations, a different method is created, however certain similarities exist among them all. First and foremost is the use of the re library to utilize regular expresssions in breaking apart values into individual tokens. The regular expresions used are different based off of the transformation at hand, however some columns utilized the same expression, therefore a method called __extract_tokens(self,val) was created to split based off of the following expression: "\d+.\d+|\d+|[a-zA-Z]+". This splits the tokens based off of a floating point number OR an integer OR a word composed of any set of letters in the alphabet.

As it pertains to the formatting of numbers, "{:,.2f}".format(float(num)) was used to round to two decimal places, and add commas where necessary.

Aside from the afforementioned __extract_tokens method, there are a variety of other helper methods including __is_num which checks if a supplied value is a number, __extract_nums which functions as __extract_tokens except only saves those values which are floats or ints (this is unused in the current version), print(self) which prints the column names and snippets of their values, and print_col(self,col) which prints the name and value snippet for a specific column.

If there are any changes you would like made, please let me know and I can have them complete pronto!

Infandous added 2 commits May 19, 2024 16:36

script created

758c05f

added contact info

c429fcb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script created to format homework.csv#46

Script created to format homework.csv#46
Infandous wants to merge 2 commits intoHedgeApple:masterfrom
Infandous:master

Infandous commented May 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Infandous commented May 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant