Script created to format homework.csv#46
Open
Infandous wants to merge 2 commits intoHedgeApple:masterfrom
Open
Script created to format homework.csv#46Infandous wants to merge 2 commits intoHedgeApple:masterfrom
Infandous wants to merge 2 commits intoHedgeApple:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The script creates a class called Processor, an instance of which named entity is created to format the homework.csv file. The program requires 1 command line argument: the name of the file being processed (for this scenario that would be homework.csv). Thus the script is ran by running "python3 processor.py homework.csv". When the program is run, the processor object is created with the command line argument supplied as a paramater. Next the driver() method is called, and lastly the save_dataframe() method is called with the name of the output file "formatted" supplied as a parameter.
When a Processor object is created, it is supplied with the name of a csv file that it will open, if this results in an error, the program will terminate and a message is displayed to the user.
The Driver method is the main method, and first transforms the data in the format described in the README, and then the column names into the format of example.csv(this is words are separated by _, and pricing column names have '$' removed and contain the word 'price' at the end.
The use of the pandas library allows one to use .apply(function_name) to perform some series of changes to a specific column and thus is used for every column we change.
One item of note is that given there is no EAN column, there is no code for making transformations to such a column.
For every series of transformations, a different method is created, however certain similarities exist among them all. First and foremost is the use of the re library to utilize regular expresssions in breaking apart values into individual tokens. The regular expresions used are different based off of the transformation at hand, however some columns utilized the same expression, therefore a method called __extract_tokens(self,val) was created to split based off of the following expression: "\d+.\d+|\d+|[a-zA-Z]+". This splits the tokens based off of a floating point number OR an integer OR a word composed of any set of letters in the alphabet.
As it pertains to the formatting of numbers, "{:,.2f}".format(float(num)) was used to round to two decimal places, and add commas where necessary.
Aside from the afforementioned __extract_tokens method, there are a variety of other helper methods including __is_num which checks if a supplied value is a number, __extract_nums which functions as __extract_tokens except only saves those values which are floats or ints (this is unused in the current version), print(self) which prints the column names and snippets of their values, and print_col(self,col) which prints the name and value snippet for a specific column.
If there are any changes you would like made, please let me know and I can have them complete pronto!