Skip to content

Conversation

@Sean1572
Copy link
Collaborator

Intended for jay to do code review over. DO NOT ACTUALLY MERGE THIS!

kgarwoodsdzwa and others added 30 commits March 12, 2025 14:45
this script takes in human labels, the wavs, and an output dir
to parse through the human labels, identify the segments in the
audio file where a human labeled a burrowing owl call, and
creates a 3 second segment wav file. Still needs to do the
same but for creating samples of the background noise with
no bird detections to create a balanced dataset with two classes
burrowing owl call and no burrowing owl call. can also be made to
create segments with specific call types rather than any
burrowing owl detection
now it will create a folder of 3 second bird detection segments,
as well as a folder from the same wavs with no bird detection
segments
this file is based off the segment_2017_labeled_data.py file
but the difference is this one creates segments that are purely
the length of the labeled duration of the file, with an optional
padding on the front and end as the durations seem extremely
tight, making the sound files very short and slightly imperfect
with sometimes cutting off the milliseconds of the beginning
and end of a call. this makes a tighter dataset
removed print statements and there is no longer a need to
use librosa to get the sample rate and audio file duration
starting over the script to create the dataset to employ all that
we've learned so far. Two scripts, one to run, and one to have all
the functions needed to run. create_dataset.py would be run and
would call the functions in walk_buow_labels.py (could be named
better). It would create/append to 1 folder and 1 csv all the
human labeled detections at varied lengths, and an equal number of
fixed duration noise samples from the same audio files. Now that I'm
writing this, it might be better for the noise samples to be aggregated
after all the detections have been aggregated, so we can randomly
get samples for noise from wavs that had NO detections as well. But
it's a good start.
intermediary commit to save progress, this works to walk the
2017 folders and ensure there's no duplicate labels. was able to
catch an error in the 2017 labels where there were some
of the same wav files added into another folder. success so far
walk_buow_labels didn't make much sense, and i also wanted to
separate the functions for filtering the labels, as those are the
functions that will be varied based on the label file, whereas
ideally the other functions will be able to be the same across
data and label files. It will make it easier in the future for us
to add other parsers for different formats of label files
I didn't like my old scripts in with the new and improved functions,
params and segment can parse 2017 and 2018 labeled data, segment
creates 3 second segment based on retroactively ;labeling 3 second
segments, and then params will just create a length of segment equal
or slightly longer (depending on your params) to the duration of the
label overall
the main function has been debugged, but needs another close look
through to ensure that it's not making any grave mistakes. it doesn't
crash while running through all the audio and i believe i found
a way to ensure that in the event of a duplicate file name, the
label in the all labels file will drop it out of the labels. next is
to begin the making of the segment part.
it currently will create the segments based on the filtered labels
for that audio file, create segments, and add the lines of interest
to a dataframe with the uuid of the file, the path to the file, the
call type label, the file path to the original file, and the duration
of the sample. it will return that dataframe. need to check it with
all the audio and ensure if there's any more errors, they get caught,
and make the functions to put the data from all the wavs together in 1
csv. also need to create the function that makes the noise samples.
but currently creating uuid segments and the metadata associated
seems successful.
able to create an array for each second of the audio file, and a
mask in the spots that have an owl detection + a buffer. leaving
the unmasked portions as free reign for using for randomly generating
noise segments. need to add now something to randomly peruse the
unmasked portion, seeking the num of burrowing owl detections and
generating 3s samples from it, and creating a dataframe with the same
values as the create_segments function. most of the code can be
borrowed from that except for the random selection, and checking
this is the messiest function ive ever written but it currently
works and i want to save progress. i need to ensure it's choosing
viable start times, and there's not some weird rounding logic that
doesn't make sense. we're dealing with some values that i convert
to ints but due to the buffer added between positive segments i
think this is fine. currently makes the segments and adds it to
the other df, makes an equal number to the detection segments in
this group.
added the beginnings of a script to deal with the creation of 5
folds of cleanly distrubuted data, stratified across all classes
while not splitting up clips from the same wav file to minimize
potential of leakage. also was able to get the data to all get
together in one dataframe, but still need to check the relative
start time column because something seems off about some of the
decimals. also, need to handle a none type error in the noise
segment creation because some empty dataframes are still being
passed to that one, so need to check for passing None in previous
steps. Also need to see how it will handle adding the 2017 and
2018 data together since they need to be run separately. Making the
folded dataset is separate because of this- the fact that i need to
run 2017 and 2018 separately, though i suppose it could iterate
a list and do them both at the same time? seems a little excessive
but thats the only way i could do the stratify step otherwise. i think
it will be more useful to have it as separate as i feel we'd wanna
do this in different ways and with other data.
one of the subset label file names didn't match what the all
labels said it was, even though it was the same file. so when it
comes across that, it searches for the existence of the actual
file name of that sub label file
fixed an out of bounds error when creating random no_buow segment
fixed error print statement that was referencing the thing that
didn't exist if the error happened. dealt with a none value being
passed by a previous function and assigned it a value that ends
the chain of passing none
not really working code but beginning to think about how to keep
the groups together while properly creating stratified datasets
found a tutorial that deals with the same problem
using someone elses optimization group splitting
implementation, running strat_k_folds.py uses these functions
when given the metadata file thats generated after detection and noise
segment creation. it currently doesn't append the fold number to
the metadata file, but it takes the class distributions of each
'group' in our case, groups are at the wav file level. it then
calculates the optimal split of the groups to obtain the closest
class distribution for each fold to the original class distribution.
it outputs an array with the fold number for each group, as well
as some other metrics like the actual class distribution for each fold,
because it won't be perfect given that each group is unbalanced.
this now can add the fold number assignment to the metadata but
it does create a new csv. could probably append the old one but
this is what it does. there is something funky happening with the
indexing, but this is a known issue about that creation of the
metadata.csv which would be dealth with in create_dataset pipeline
and not the fault of this script. also should create a results file
with the metrics of the actual class distributions per fold so
it can go in the readme for the dataset
there was a weird indexing issue that caused an extra row of
indexes in the final metadata file that had been fixed, and
some helpful print statements added during debug stage to
help catch issues
the 2017 data when passed through was missing the chick begging
detections because it was parsing the class list wrong. and there
were some wav files that made it through the checks as legit but
were not actually labeled, so that had to be handled
there was one labeled wav file that was literally 1s long and 2/3
of the second was a labeled detection. it was legit and not from
another sound file. i decided to have it keep this detection and
generate a segment but it cannot generate a complementary
no_buow segment. we can change this but it seemed better to just
keep the extra detection, unless it starts happening all the time
(it wont)
fixed chick begging underline error and adding some clearer
variable names
Sean1572 and others added 25 commits July 4, 2025 14:38
Based on review feedback
Tbh I have no idea if this is the best way to do this. Might look it up later but decided to just do a quick fix
Old code had weird naming conventions caused by iterating during the development processes. Current code generalizes the base_preprocessor to allow for other preprocessors and model_inputs
Originally thought pylintrc controlled docstrings, but its actually flake8. Reverted the changes in pylintrc that attempted to control docstring style
CometML has weird behavior with strings, where it displays docstrings, going to email their support line and see what that is about.

The current solution is to turn it into a comment
* just making this into a branch

copies buowset extractor into a panda one and updated the init

* adding current scripts for esc50 instead of pandas

was working on it on my laptop, but need to commit to be able
to train on a gpu or just more powerful cpu because
my process keeps getting killed by lack of ram it seems

* lowered steps and increased epochs

the model was never evaluating because the default buowset
steps were being used (100) when esc50 is much smaller so it only had
38 steps. So it was appearing like it was not crashing and was doing
the training, but would never get evaluated. It was also not
seeming to learn fast enough with only 2 epochs (more classes than
buowset?) so epochs were increased to 10 which allowed the model
to actually start learning

* fix some pylint, still snake case stuff to fix

* fixed pylint and removed git tracking from the train.py

the train.py is always meant to just be copied and modified by
each user, so it didnt need to be tracked for esc50. the
only differences were the data extractor and the steps and epochs.
pylint done on the extractor
34 Added Dev deps (Style Guidelines)
@Sean1572 Sean1572 marked this pull request as draft August 27, 2025 23:44
@Sean1572 Sean1572 changed the base branch from main to 29_before_ML August 27, 2025 23:46
@Sean1572 Sean1572 changed the base branch from 29_before_ML to 20_pre_ML August 27, 2025 23:47
@Sean1572 Sean1572 changed the base branch from 20_pre_ML to 29_before_ML August 27, 2025 23:48
Copy link

@jaradfor jaradfor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think a lot of this code is very nice. it should be very helpful when working with the burring owl and hopefully more datasets. I think the readmes could do with some more work explicitly tying things together, and detailing how the different parts of the script interact, as well as clarifying which files/functions are dataset specific and which ones are more general. There are some minor magic numbers like one thousand across the code in a number of places. If we are hoping to use this training script for many different models across many different datasets, it might make a lot of sense to use a framework like pytorch lightning which could allow more freedom if it's ever needed. this current script looks completely sufficient for training any given model on a single dataset. Using pytorch lightning might be a waste of effort though, if the hugging face trainer and the written extension really do handle the overwhelming majority of uses. I think if you consider all of the datasets that we could want to use like birdset and custom ones and cant think of a situation where the given trainer would not work but pytorch lightning would, we're probably ok. I don't know how the given code supports things like learning rate scheduling and other modern day training tricks.

I think it would be a very good idea to add somewhere dataset specific testing, to make sure that the extractor written works. I also think some unit and integration tests for the datasets would be an extremely good idea, as dataset problems can be very sneaky and hard to resolve.

labels = pd.read_csv(labels)
use_2017 = None
# iterate through each individual original wav
if "2017" in labels['DATE'].iloc[0]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a little unclear what this is checking for and what the downstream side effect is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we were iterating over all of the wave files, this would need to be inside of the loop with my current understanding. but since this is outside of the loop, I'm guessing the 2017 and 2018 datasets are processed separately and then glued together somewhere else

output_dir,
class_list)
# create same number of noise segments from the same wav file randomly
all_buow_rows = create_noise_segments(wav,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the data set is so strongly labelled with every five second interval having either a positive or negative label, shouldn't we extract all negative labels and leave it as a decision downstream to balance the classes? Even if every batch is balanced, there is some benefit from having a larger sample pool of negative examples to draw from

PARSER.add_argument('-class_list', type=str,
help='Path to txt file of list of labeled classes')
ARGS = PARSER.parse_args()
main(ARGS.labels, ARGS.wav_dir, ARGS.output_dir, ARGS.class_list)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is the correct place for it, but since this file already has so much specific knowledge about the burrowing owl dataset, it might also be a good idea to include some sanity checks and basic tests here as well. If not here, it might be a good idea to put that somewhere nearby

return wavs_file_paths


def create_segments(wav, filtered_labels, out_path, class_list):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this function dataset specific? What are the assumptions placed on the input?

label file- will ignore everything that is misspelled or
unknown labels.

Returns:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this metadata look like? Could be obvious and answered elsewhere and I'm missing something

The most challenging issue with machine learning is the dataset. This training repo intends to make it easy to modularize parts of the training pipeline, and integrate them together, ideally regardless of the dataset.

The pipeline works in 5 parts:
- Extractors: Extractors take in raw data and reformats it into `AudioDatasets`, apache-arrow data structures implemented via HuggingFace with common columns between any dataset. Every label is one_hot_encoded and treated as multilabel regardless of the problem. Audio filepaths as casted into [Audio columns](https://huggingface.co/docs/datasets/v3.6.0/en/package_reference/main_classes#datasets.Audio). Extractors are *unique for each dataset* but *uniform in the AudioDataset*.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one hot encoded versus multi label, is this intentional or is multi class the correct description?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the versus here? One-hot encoding applies to both multi-label and multi-class ([110] vs [100])

)

# COMMON OPTIONAL ARGS
training_args.num_train_epochs = 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to have these but these should probably be put somewhere else, maybe default values?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we debated it.

Train.py, in my mind, is intended to be a more flexible file. A user edits it, and that becomes the file that outlines an experiment (note this file gets saved to comet-ml every run).

@@ -0,0 +1,163 @@
"""Trains a Mutliclass Model with Pytorch and Huggingface.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this being separated from the other file train.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my explanation above for the intention of train.py

Basically, a different experiment, and an example to show how train.py can be used as an experiment config.

Perhaps the metacommentary here is this intent is not well documented

phrase_one = "The column `"
phrase_two = "` is missing from dataset split `"
phrase_three = "`. Required by system"
state = (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a real problem at all but this f string could be made a lot more pythonic

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish the linter complained about it

TimmInputs, duration=3
)

preprocessor = MelModelInputPreprocessor(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs better variable name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants