-
Notifications
You must be signed in to change notification settings - Fork 0
29 code review #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 29_before_ML
Are you sure you want to change the base?
29 code review #80
Conversation
this script takes in human labels, the wavs, and an output dir to parse through the human labels, identify the segments in the audio file where a human labeled a burrowing owl call, and creates a 3 second segment wav file. Still needs to do the same but for creating samples of the background noise with no bird detections to create a balanced dataset with two classes burrowing owl call and no burrowing owl call. can also be made to create segments with specific call types rather than any burrowing owl detection
now it will create a folder of 3 second bird detection segments, as well as a folder from the same wavs with no bird detection segments
this file is based off the segment_2017_labeled_data.py file but the difference is this one creates segments that are purely the length of the labeled duration of the file, with an optional padding on the front and end as the durations seem extremely tight, making the sound files very short and slightly imperfect with sometimes cutting off the milliseconds of the beginning and end of a call. this makes a tighter dataset
removed print statements and there is no longer a need to use librosa to get the sample rate and audio file duration
starting over the script to create the dataset to employ all that we've learned so far. Two scripts, one to run, and one to have all the functions needed to run. create_dataset.py would be run and would call the functions in walk_buow_labels.py (could be named better). It would create/append to 1 folder and 1 csv all the human labeled detections at varied lengths, and an equal number of fixed duration noise samples from the same audio files. Now that I'm writing this, it might be better for the noise samples to be aggregated after all the detections have been aggregated, so we can randomly get samples for noise from wavs that had NO detections as well. But it's a good start.
intermediary commit to save progress, this works to walk the 2017 folders and ensure there's no duplicate labels. was able to catch an error in the 2017 labels where there were some of the same wav files added into another folder. success so far
walk_buow_labels didn't make much sense, and i also wanted to separate the functions for filtering the labels, as those are the functions that will be varied based on the label file, whereas ideally the other functions will be able to be the same across data and label files. It will make it easier in the future for us to add other parsers for different formats of label files
I didn't like my old scripts in with the new and improved functions, params and segment can parse 2017 and 2018 labeled data, segment creates 3 second segment based on retroactively ;labeling 3 second segments, and then params will just create a length of segment equal or slightly longer (depending on your params) to the duration of the label overall
the main function has been debugged, but needs another close look through to ensure that it's not making any grave mistakes. it doesn't crash while running through all the audio and i believe i found a way to ensure that in the event of a duplicate file name, the label in the all labels file will drop it out of the labels. next is to begin the making of the segment part.
it currently will create the segments based on the filtered labels for that audio file, create segments, and add the lines of interest to a dataframe with the uuid of the file, the path to the file, the call type label, the file path to the original file, and the duration of the sample. it will return that dataframe. need to check it with all the audio and ensure if there's any more errors, they get caught, and make the functions to put the data from all the wavs together in 1 csv. also need to create the function that makes the noise samples. but currently creating uuid segments and the metadata associated seems successful.
able to create an array for each second of the audio file, and a mask in the spots that have an owl detection + a buffer. leaving the unmasked portions as free reign for using for randomly generating noise segments. need to add now something to randomly peruse the unmasked portion, seeking the num of burrowing owl detections and generating 3s samples from it, and creating a dataframe with the same values as the create_segments function. most of the code can be borrowed from that except for the random selection, and checking
this is the messiest function ive ever written but it currently works and i want to save progress. i need to ensure it's choosing viable start times, and there's not some weird rounding logic that doesn't make sense. we're dealing with some values that i convert to ints but due to the buffer added between positive segments i think this is fine. currently makes the segments and adds it to the other df, makes an equal number to the detection segments in this group.
added the beginnings of a script to deal with the creation of 5 folds of cleanly distrubuted data, stratified across all classes while not splitting up clips from the same wav file to minimize potential of leakage. also was able to get the data to all get together in one dataframe, but still need to check the relative start time column because something seems off about some of the decimals. also, need to handle a none type error in the noise segment creation because some empty dataframes are still being passed to that one, so need to check for passing None in previous steps. Also need to see how it will handle adding the 2017 and 2018 data together since they need to be run separately. Making the folded dataset is separate because of this- the fact that i need to run 2017 and 2018 separately, though i suppose it could iterate a list and do them both at the same time? seems a little excessive but thats the only way i could do the stratify step otherwise. i think it will be more useful to have it as separate as i feel we'd wanna do this in different ways and with other data.
one of the subset label file names didn't match what the all labels said it was, even though it was the same file. so when it comes across that, it searches for the existence of the actual file name of that sub label file
fixed an out of bounds error when creating random no_buow segment fixed error print statement that was referencing the thing that didn't exist if the error happened. dealt with a none value being passed by a previous function and assigned it a value that ends the chain of passing none
not really working code but beginning to think about how to keep the groups together while properly creating stratified datasets found a tutorial that deals with the same problem
using someone elses optimization group splitting implementation, running strat_k_folds.py uses these functions when given the metadata file thats generated after detection and noise segment creation. it currently doesn't append the fold number to the metadata file, but it takes the class distributions of each 'group' in our case, groups are at the wav file level. it then calculates the optimal split of the groups to obtain the closest class distribution for each fold to the original class distribution. it outputs an array with the fold number for each group, as well as some other metrics like the actual class distribution for each fold, because it won't be perfect given that each group is unbalanced.
this now can add the fold number assignment to the metadata but it does create a new csv. could probably append the old one but this is what it does. there is something funky happening with the indexing, but this is a known issue about that creation of the metadata.csv which would be dealth with in create_dataset pipeline and not the fault of this script. also should create a results file with the metrics of the actual class distributions per fold so it can go in the readme for the dataset
there was a weird indexing issue that caused an extra row of indexes in the final metadata file that had been fixed, and some helpful print statements added during debug stage to help catch issues
the 2017 data when passed through was missing the chick begging detections because it was parsing the class list wrong. and there were some wav files that made it through the checks as legit but were not actually labeled, so that had to be handled
there was one labeled wav file that was literally 1s long and 2/3 of the second was a labeled detection. it was legit and not from another sound file. i decided to have it keep this detection and generate a segment but it cannot generate a complementary no_buow segment. we can change this but it seemed better to just keep the extra detection, unless it starts happening all the time (it wont)
fixed chick begging underline error and adding some clearer variable names
Based on review feedback
Tbh I have no idea if this is the best way to do this. Might look it up later but decided to just do a quick fix
pip does not like underscores...
19 Make and prepare perch embeddings
Old code had weird naming conventions caused by iterating during the development processes. Current code generalizes the base_preprocessor to allow for other preprocessors and model_inputs
Originally thought pylintrc controlled docstrings, but its actually flake8. Reverted the changes in pylintrc that attempted to control docstring style
CometML has weird behavior with strings, where it displays docstrings, going to email their support line and see what that is about. The current solution is to turn it into a comment
* just making this into a branch copies buowset extractor into a panda one and updated the init * adding current scripts for esc50 instead of pandas was working on it on my laptop, but need to commit to be able to train on a gpu or just more powerful cpu because my process keeps getting killed by lack of ram it seems * lowered steps and increased epochs the model was never evaluating because the default buowset steps were being used (100) when esc50 is much smaller so it only had 38 steps. So it was appearing like it was not crashing and was doing the training, but would never get evaluated. It was also not seeming to learn fast enough with only 2 epochs (more classes than buowset?) so epochs were increased to 10 which allowed the model to actually start learning * fix some pylint, still snake case stuff to fix * fixed pylint and removed git tracking from the train.py the train.py is always meant to just be copied and modified by each user, so it didnt need to be tracked for esc50. the only differences were the data extractor and the steps and epochs. pylint done on the extractor
40 Add storage for comet-ml-panels
34 Added Dev deps (Style Guidelines)
…eline 29 model training pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think a lot of this code is very nice. it should be very helpful when working with the burring owl and hopefully more datasets. I think the readmes could do with some more work explicitly tying things together, and detailing how the different parts of the script interact, as well as clarifying which files/functions are dataset specific and which ones are more general. There are some minor magic numbers like one thousand across the code in a number of places. If we are hoping to use this training script for many different models across many different datasets, it might make a lot of sense to use a framework like pytorch lightning which could allow more freedom if it's ever needed. this current script looks completely sufficient for training any given model on a single dataset. Using pytorch lightning might be a waste of effort though, if the hugging face trainer and the written extension really do handle the overwhelming majority of uses. I think if you consider all of the datasets that we could want to use like birdset and custom ones and cant think of a situation where the given trainer would not work but pytorch lightning would, we're probably ok. I don't know how the given code supports things like learning rate scheduling and other modern day training tricks.
I think it would be a very good idea to add somewhere dataset specific testing, to make sure that the extractor written works. I also think some unit and integration tests for the datasets would be an extremely good idea, as dataset problems can be very sneaky and hard to resolve.
| labels = pd.read_csv(labels) | ||
| use_2017 = None | ||
| # iterate through each individual original wav | ||
| if "2017" in labels['DATE'].iloc[0]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a little unclear what this is checking for and what the downstream side effect is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we were iterating over all of the wave files, this would need to be inside of the loop with my current understanding. but since this is outside of the loop, I'm guessing the 2017 and 2018 datasets are processed separately and then glued together somewhere else
| output_dir, | ||
| class_list) | ||
| # create same number of noise segments from the same wav file randomly | ||
| all_buow_rows = create_noise_segments(wav, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the data set is so strongly labelled with every five second interval having either a positive or negative label, shouldn't we extract all negative labels and leave it as a decision downstream to balance the classes? Even if every batch is balanced, there is some benefit from having a larger sample pool of negative examples to draw from
| PARSER.add_argument('-class_list', type=str, | ||
| help='Path to txt file of list of labeled classes') | ||
| ARGS = PARSER.parse_args() | ||
| main(ARGS.labels, ARGS.wav_dir, ARGS.output_dir, ARGS.class_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is the correct place for it, but since this file already has so much specific knowledge about the burrowing owl dataset, it might also be a good idea to include some sanity checks and basic tests here as well. If not here, it might be a good idea to put that somewhere nearby
| return wavs_file_paths | ||
|
|
||
|
|
||
| def create_segments(wav, filtered_labels, out_path, class_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this function dataset specific? What are the assumptions placed on the input?
| label file- will ignore everything that is misspelled or | ||
| unknown labels. | ||
|
|
||
| Returns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this metadata look like? Could be obvious and answered elsewhere and I'm missing something
| The most challenging issue with machine learning is the dataset. This training repo intends to make it easy to modularize parts of the training pipeline, and integrate them together, ideally regardless of the dataset. | ||
|
|
||
| The pipeline works in 5 parts: | ||
| - Extractors: Extractors take in raw data and reformats it into `AudioDatasets`, apache-arrow data structures implemented via HuggingFace with common columns between any dataset. Every label is one_hot_encoded and treated as multilabel regardless of the problem. Audio filepaths as casted into [Audio columns](https://huggingface.co/docs/datasets/v3.6.0/en/package_reference/main_classes#datasets.Audio). Extractors are *unique for each dataset* but *uniform in the AudioDataset*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one hot encoded versus multi label, is this intentional or is multi class the correct description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the versus here? One-hot encoding applies to both multi-label and multi-class ([110] vs [100])
| ) | ||
|
|
||
| # COMMON OPTIONAL ARGS | ||
| training_args.num_train_epochs = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great to have these but these should probably be put somewhere else, maybe default values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we debated it.
Train.py, in my mind, is intended to be a more flexible file. A user edits it, and that becomes the file that outlines an experiment (note this file gets saved to comet-ml every run).
| @@ -0,0 +1,163 @@ | |||
| """Trains a Mutliclass Model with Pytorch and Huggingface. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for this being separated from the other file train.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my explanation above for the intention of train.py
Basically, a different experiment, and an example to show how train.py can be used as an experiment config.
Perhaps the metacommentary here is this intent is not well documented
| phrase_one = "The column `" | ||
| phrase_two = "` is missing from dataset split `" | ||
| phrase_three = "`. Required by system" | ||
| state = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a real problem at all but this f string could be made a lot more pythonic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish the linter complained about it
| TimmInputs, duration=3 | ||
| ) | ||
|
|
||
| preprocessor = MelModelInputPreprocessor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs better variable name
Intended for jay to do code review over. DO NOT ACTUALLY MERGE THIS!