29 code review #80

Sean1572 · 2025-08-27T23:44:24Z

Intended for jay to do code review over. DO NOT ACTUALLY MERGE THIS!

this script takes in human labels, the wavs, and an output dir to parse through the human labels, identify the segments in the audio file where a human labeled a burrowing owl call, and creates a 3 second segment wav file. Still needs to do the same but for creating samples of the background noise with no bird detections to create a balanced dataset with two classes burrowing owl call and no burrowing owl call. can also be made to create segments with specific call types rather than any burrowing owl detection

now it will create a folder of 3 second bird detection segments, as well as a folder from the same wavs with no bird detection segments

this file is based off the segment_2017_labeled_data.py file but the difference is this one creates segments that are purely the length of the labeled duration of the file, with an optional padding on the front and end as the durations seem extremely tight, making the sound files very short and slightly imperfect with sometimes cutting off the milliseconds of the beginning and end of a call. this makes a tighter dataset

removed print statements and there is no longer a need to use librosa to get the sample rate and audio file duration

starting over the script to create the dataset to employ all that we've learned so far. Two scripts, one to run, and one to have all the functions needed to run. create_dataset.py would be run and would call the functions in walk_buow_labels.py (could be named better). It would create/append to 1 folder and 1 csv all the human labeled detections at varied lengths, and an equal number of fixed duration noise samples from the same audio files. Now that I'm writing this, it might be better for the noise samples to be aggregated after all the detections have been aggregated, so we can randomly get samples for noise from wavs that had NO detections as well. But it's a good start.

intermediary commit to save progress, this works to walk the 2017 folders and ensure there's no duplicate labels. was able to catch an error in the 2017 labels where there were some of the same wav files added into another folder. success so far

walk_buow_labels didn't make much sense, and i also wanted to separate the functions for filtering the labels, as those are the functions that will be varied based on the label file, whereas ideally the other functions will be able to be the same across data and label files. It will make it easier in the future for us to add other parsers for different formats of label files

I didn't like my old scripts in with the new and improved functions, params and segment can parse 2017 and 2018 labeled data, segment creates 3 second segment based on retroactively ;labeling 3 second segments, and then params will just create a length of segment equal or slightly longer (depending on your params) to the duration of the label overall

the main function has been debugged, but needs another close look through to ensure that it's not making any grave mistakes. it doesn't crash while running through all the audio and i believe i found a way to ensure that in the event of a duplicate file name, the label in the all labels file will drop it out of the labels. next is to begin the making of the segment part.

it currently will create the segments based on the filtered labels for that audio file, create segments, and add the lines of interest to a dataframe with the uuid of the file, the path to the file, the call type label, the file path to the original file, and the duration of the sample. it will return that dataframe. need to check it with all the audio and ensure if there's any more errors, they get caught, and make the functions to put the data from all the wavs together in 1 csv. also need to create the function that makes the noise samples. but currently creating uuid segments and the metadata associated seems successful.

able to create an array for each second of the audio file, and a mask in the spots that have an owl detection + a buffer. leaving the unmasked portions as free reign for using for randomly generating noise segments. need to add now something to randomly peruse the unmasked portion, seeking the num of burrowing owl detections and generating 3s samples from it, and creating a dataframe with the same values as the create_segments function. most of the code can be borrowed from that except for the random selection, and checking

this is the messiest function ive ever written but it currently works and i want to save progress. i need to ensure it's choosing viable start times, and there's not some weird rounding logic that doesn't make sense. we're dealing with some values that i convert to ints but due to the buffer added between positive segments i think this is fine. currently makes the segments and adds it to the other df, makes an equal number to the detection segments in this group.

added the beginnings of a script to deal with the creation of 5 folds of cleanly distrubuted data, stratified across all classes while not splitting up clips from the same wav file to minimize potential of leakage. also was able to get the data to all get together in one dataframe, but still need to check the relative start time column because something seems off about some of the decimals. also, need to handle a none type error in the noise segment creation because some empty dataframes are still being passed to that one, so need to check for passing None in previous steps. Also need to see how it will handle adding the 2017 and 2018 data together since they need to be run separately. Making the folded dataset is separate because of this- the fact that i need to run 2017 and 2018 separately, though i suppose it could iterate a list and do them both at the same time? seems a little excessive but thats the only way i could do the stratify step otherwise. i think it will be more useful to have it as separate as i feel we'd wanna do this in different ways and with other data.

one of the subset label file names didn't match what the all labels said it was, even though it was the same file. so when it comes across that, it searches for the existence of the actual file name of that sub label file

fixed an out of bounds error when creating random no_buow segment fixed error print statement that was referencing the thing that didn't exist if the error happened. dealt with a none value being passed by a previous function and assigned it a value that ends the chain of passing none

not really working code but beginning to think about how to keep the groups together while properly creating stratified datasets found a tutorial that deals with the same problem

using someone elses optimization group splitting implementation, running strat_k_folds.py uses these functions when given the metadata file thats generated after detection and noise segment creation. it currently doesn't append the fold number to the metadata file, but it takes the class distributions of each 'group' in our case, groups are at the wav file level. it then calculates the optimal split of the groups to obtain the closest class distribution for each fold to the original class distribution. it outputs an array with the fold number for each group, as well as some other metrics like the actual class distribution for each fold, because it won't be perfect given that each group is unbalanced.

this now can add the fold number assignment to the metadata but it does create a new csv. could probably append the old one but this is what it does. there is something funky happening with the indexing, but this is a known issue about that creation of the metadata.csv which would be dealth with in create_dataset pipeline and not the fault of this script. also should create a results file with the metrics of the actual class distributions per fold so it can go in the readme for the dataset

there was a weird indexing issue that caused an extra row of indexes in the final metadata file that had been fixed, and some helpful print statements added during debug stage to help catch issues

the 2017 data when passed through was missing the chick begging detections because it was parsing the class list wrong. and there were some wav files that made it through the checks as legit but were not actually labeled, so that had to be handled

there was one labeled wav file that was literally 1s long and 2/3 of the second was a labeled detection. it was legit and not from another sound file. i decided to have it keep this detection and generate a segment but it cannot generate a complementary no_buow segment. we can change this but it seemed better to just keep the extra detection, unless it starts happening all the time (it wont)

fixed chick begging underline error and adding some clearer variable names

Based on review feedback

Tbh I have no idea if this is the best way to do this. Might look it up later but decided to just do a quick fix

pip does not like underscores...

19 Make and prepare perch embeddings

Old code had weird naming conventions caused by iterating during the development processes. Current code generalizes the base_preprocessor to allow for other preprocessors and model_inputs

Originally thought pylintrc controlled docstrings, but its actually flake8. Reverted the changes in pylintrc that attempted to control docstring style

CometML has weird behavior with strings, where it displays docstrings, going to email their support line and see what that is about. The current solution is to turn it into a comment

* just making this into a branch copies buowset extractor into a panda one and updated the init * adding current scripts for esc50 instead of pandas was working on it on my laptop, but need to commit to be able to train on a gpu or just more powerful cpu because my process keeps getting killed by lack of ram it seems * lowered steps and increased epochs the model was never evaluating because the default buowset steps were being used (100) when esc50 is much smaller so it only had 38 steps. So it was appearing like it was not crashing and was doing the training, but would never get evaluated. It was also not seeming to learn fast enough with only 2 epochs (more classes than buowset?) so epochs were increased to 10 which allowed the model to actually start learning * fix some pylint, still snake case stuff to fix * fixed pylint and removed git tracking from the train.py the train.py is always meant to just be copied and modified by each user, so it didnt need to be tracked for esc50. the only differences were the data extractor and the steps and epochs. pylint done on the extractor

40 Add storage for comet-ml-panels

34 Added Dev deps (Style Guidelines)

…eline 29 model training pipeline

jaradfor

Overall I think a lot of this code is very nice. it should be very helpful when working with the burring owl and hopefully more datasets. I think the readmes could do with some more work explicitly tying things together, and detailing how the different parts of the script interact, as well as clarifying which files/functions are dataset specific and which ones are more general. There are some minor magic numbers like one thousand across the code in a number of places. If we are hoping to use this training script for many different models across many different datasets, it might make a lot of sense to use a framework like pytorch lightning which could allow more freedom if it's ever needed. this current script looks completely sufficient for training any given model on a single dataset. Using pytorch lightning might be a waste of effort though, if the hugging face trainer and the written extension really do handle the overwhelming majority of uses. I think if you consider all of the datasets that we could want to use like birdset and custom ones and cant think of a situation where the given trainer would not work but pytorch lightning would, we're probably ok. I don't know how the given code supports things like learning rate scheduling and other modern day training tricks.

I think it would be a very good idea to add somewhere dataset specific testing, to make sure that the extractor written works. I also think some unit and integration tests for the datasets would be an extremely good idea, as dataset problems can be very sneaky and hard to resolve.

jaradfor · 2025-08-31T00:28:49Z

create_dataset/create_dataset.py

+    labels = pd.read_csv(labels)
+    use_2017 = None
+    # iterate through each individual original wav
+    if "2017" in labels['DATE'].iloc[0]:


it's a little unclear what this is checking for and what the downstream side effect is

if we were iterating over all of the wave files, this would need to be inside of the loop with my current understanding. but since this is outside of the loop, I'm guessing the 2017 and 2018 datasets are processed separately and then glued together somewhere else

jaradfor · 2025-08-31T00:31:15Z

create_dataset/create_dataset.py

+                                        output_dir,
+                                        class_list)
+        # create same number of noise segments from the same wav file randomly
+        all_buow_rows = create_noise_segments(wav,


since the data set is so strongly labelled with every five second interval having either a positive or negative label, shouldn't we extract all negative labels and leave it as a decision downstream to balance the classes? Even if every batch is balanced, there is some benefit from having a larger sample pool of negative examples to draw from

jaradfor · 2025-08-31T00:33:43Z

create_dataset/create_dataset.py

+    PARSER.add_argument('-class_list', type=str,
+                        help='Path to txt file of list of labeled classes')
+    ARGS = PARSER.parse_args()
+    main(ARGS.labels, ARGS.wav_dir, ARGS.output_dir, ARGS.class_list)


I don't know if this is the correct place for it, but since this file already has so much specific knowledge about the burrowing owl dataset, it might also be a good idea to include some sanity checks and basic tests here as well. If not here, it might be a good idea to put that somewhere nearby

jaradfor · 2025-08-31T00:36:26Z

create_dataset/create_segments.py

+    return wavs_file_paths
+
+
+def create_segments(wav, filtered_labels, out_path, class_list):


is this function dataset specific? What are the assumptions placed on the input?

jaradfor · 2025-08-31T00:37:22Z

create_dataset/create_segments.py

+            label file- will ignore everything that is misspelled or
+            unknown labels.
+
+    Returns:


What does this metadata look like? Could be obvious and answered elsewhere and I'm missing something

jaradfor · 2025-08-31T00:48:26Z

whoot_model_training/README.md

+The most challenging issue with machine learning is the dataset. This training repo intends to make it easy to modularize parts of the training pipeline, and integrate them together, ideally regardless of the dataset. 
+
+The pipeline works in 5 parts:
+- Extractors: Extractors take in raw data and reformats it into `AudioDatasets`, apache-arrow data structures implemented via HuggingFace with common columns between any dataset. Every label is one_hot_encoded and treated as multilabel regardless of the problem. Audio filepaths as casted into [Audio columns](https://huggingface.co/docs/datasets/v3.6.0/en/package_reference/main_classes#datasets.Audio). Extractors are *unique for each dataset* but *uniform in the AudioDataset*. 


one hot encoded versus multi label, is this intentional or is multi class the correct description?

Where is the versus here? One-hot encoding applies to both multi-label and multi-class ([110] vs [100])

jaradfor · 2025-08-31T00:49:31Z

whoot_model_training/train.py

+    )
+
+    # COMMON OPTIONAL ARGS
+    training_args.num_train_epochs = 2


It's great to have these but these should probably be put somewhere else, maybe default values?

Maybe we debated it.

Train.py, in my mind, is intended to be a more flexible file. A user edits it, and that becomes the file that outlines an experiment (note this file gets saved to comet-ml every run).

jaradfor · 2025-08-31T00:50:10Z

whoot_model_training/train_binary.py

@@ -0,0 +1,163 @@
+"""Trains a Mutliclass Model with Pytorch and Huggingface.


What is the reason for this being separated from the other file train.py?

See my explanation above for the intention of train.py

Basically, a different experiment, and an example to show how train.py can be used as an experiment config.

Perhaps the metacommentary here is this intent is not well documented

jaradfor · 2025-08-31T00:51:42Z

whoot_model_training/whoot_model_training/dataset.py

+                phrase_one = "The column `"
+                phrase_two = "` is missing from dataset split `"
+                phrase_three = "`. Required by system"
+                state = (


not a real problem at all but this f string could be made a lot more pythonic

I wish the linter complained about it

jaradfor · 2025-08-31T00:56:37Z

whoot_model_training/train.py

+        TimmInputs, duration=3
+    )
+
+    preprocessor = MelModelInputPreprocessor(


needs better variable name

kgarwoodsdzwa and others added 30 commits March 12, 2025 14:45

add no creation of no bird label segments

8fff329

now it will create a folder of 3 second bird detection segments, as well as a folder from the same wavs with no bird detection segments

remove unecessary lines

42c3514

removed print statements and there is no longer a need to use librosa to get the sample rate and audio file duration

make more functions work

f3d93a5

intermediary commit to save progress, this works to walk the 2017 folders and ensure there's no duplicate labels. was able to catch an error in the 2017 labels where there were some of the same wav files added into another folder. success so far

fixed error handling for edge case

719c432

one of the subset label file names didn't match what the all labels said it was, even though it was the same file. so when it comes across that, it searches for the existence of the actual file name of that sub label file

fix indexing on result csv

ba6dda9

starting to figure out the problem

bea4cf4

not really working code but beginning to think about how to keep the groups together while properly creating stratified datasets found a tutorial that deals with the same problem

fixed indexing issue and added print statements for debug

a997618

there was a weird indexing issue that caused an extra row of indexes in the final metadata file that had been fixed, and some helpful print statements added during debug stage to help catch issues

remove extra blank line

3018c15

added proper parsing of indexes in metadata csv

4d35019

fix error of not dropping indexes

7397fad

Add files to create perch embeddings

ecd18b4

working version of code

2051600

fixed chick begging underline error and adding some clearer variable names

Change output dataframe format for standardization

d19829d

pylint and update doc strings

887df67

fixing flake8 and pylint

5f438f2

Sean1572 and others added 25 commits July 4, 2025 14:38

Lint

8a09aca

Fixed install

bab8398

Based on review feedback

Fixed error handling

a596f30

Tbh I have no idea if this is the best way to do this. Might look it up later but decided to just do a quick fix

Update documentation for config.yml

b2d58df

Renamed extra deps from model_training to model-training

fcaa6bb

pip does not like underscores...

Fixed install for pyha

c11eb1c

Fix docstrings

2fa8207

Merge pull request #46 from conservationtechlab/19_make_perch_embeddings

d83fa95

19 Make and prepare perch embeddings

Add COMET_WORKSPACE config

82f13cc

Fix missing comet link in readme.md

64342b8

Cleaned config

0578805

Renamed comment to offline to online

d7f30a3

Reworked preprocessors to clarify inheritance

3a5a4bd

Old code had weird naming conventions caused by iterating during the development processes. Current code generalizes the base_preprocessor to allow for other preprocessors and model_inputs

Improved model saving with PretrainedModel

6211a33

Add google style docstrings, Simplify pylintrc

0318e63

Originally thought pylintrc controlled docstrings, but its actually flake8. Reverted the changes in pylintrc that attempted to control docstring style

Remove docstring

4eb928e

CometML has weird behavior with strings, where it displays docstrings, going to email their support line and see what that is about. The current solution is to turn it into a comment

Fix mispelling of multilabel

c4e8b36

Swap test_fold and vaild_fold in buowset_extractor.py

8dc307b

Removed unneeded change for this PR

99db502

Merge pull request #41 from conservationtechlab/40_comet-ml-panels

3c530f5

40 Add storage for comet-ml-panels

Merge pull request #48 from conservationtechlab/dev-deps

6415d6e

34 Added Dev deps (Style Guidelines)

Merge branch 'dev' into 29_model_training_pipeline

5c3307e

Linted and added more documentation

618da32

Merge pull request #37 from conservationtechlab/29_model_training_pip…

4fb1ad7

…eline 29 model training pipeline

Sean1572 marked this pull request as draft August 27, 2025 23:44

Sean1572 changed the base branch from main to 29_before_ML August 27, 2025 23:46

Sean1572 changed the base branch from 29_before_ML to 20_pre_ML August 27, 2025 23:47

Sean1572 changed the base branch from 20_pre_ML to 29_before_ML August 27, 2025 23:48

jaradfor reviewed Aug 31, 2025

View reviewed changes

		return wavs_file_paths


		def create_segments(wav, filtered_labels, out_path, class_list):

		@@ -0,0 +1,163 @@
		"""Trains a Mutliclass Model with Pytorch and Huggingface.

29 code review #80

Are you sure you want to change the base?

29 code review #80

Uh oh!

Conversation

Sean1572 commented Aug 27, 2025

Uh oh!

jaradfor left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jaradfor left a comment •

edited

Loading