Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
5ce0852
Add dvc and uv projects
KamiCreed Feb 24, 2026
6e88548
Add raw JSON data from API
KamiCreed Feb 26, 2026
2ece96f
Fix filepaths of raw data
KamiCreed Feb 26, 2026
c40ab5c
Update raw data from S3
KamiCreed Feb 26, 2026
b3ef3ce
Add update JSON pipeline step
KamiCreed Feb 26, 2026
c98908e
Organize data more
KamiCreed Feb 26, 2026
078bda1
Add conditions and counts data
KamiCreed Feb 26, 2026
8db310b
Start adding yolo converter
KamiCreed Feb 26, 2026
4dd1d37
Track raw data changes before downloading
KamiCreed Feb 26, 2026
eef10ff
Add yolo conversion to pipeline
KamiCreed Mar 2, 2026
22e1504
Shard dataset for better uploading
KamiCreed Mar 2, 2026
57caa37
Skip every 3rd frame for each video
KamiCreed Mar 2, 2026
4cc372a
Add unpacking step
KamiCreed Mar 2, 2026
3e57d9b
Fix frame sampling
KamiCreed Mar 7, 2026
c6017e6
Move written labels counter generally
KamiCreed Mar 7, 2026
0593afc
Split data based on day, class, area, density, and time of day
KamiCreed Mar 12, 2026
00cfe13
Add negatives from removed annotation vids
KamiCreed Mar 18, 2026
db34694
Remove mutable default
KamiCreed Mar 18, 2026
33d4189
Refactor yolo converter into a module for testing
KamiCreed Mar 18, 2026
ef756a9
Fix Tuple
KamiCreed Mar 18, 2026
2968160
Refactor splitter into modules and add tests
KamiCreed Mar 19, 2026
fe65c4a
Fix tests
KamiCreed Mar 19, 2026
eb7f4d9
Move safe_float to own utils module
KamiCreed Mar 19, 2026
a9ef148
Repro with utils change
KamiCreed Mar 19, 2026
502f8dd
Add smoke tests
KamiCreed Mar 19, 2026
2bf0c81
Aggregate all stats
KamiCreed Mar 19, 2026
a71bbba
Add negatives with balanced water conditions
KamiCreed Mar 19, 2026
9062b37
Add tests
KamiCreed Mar 20, 2026
3881107
Fix tests
KamiCreed Mar 20, 2026
4422edb
Update dvc
KamiCreed Mar 20, 2026
17b082d
Add log copy docker service
KamiCreed Mar 24, 2026
6518081
Add frame extractor and video metadata
KamiCreed Mar 30, 2026
d2dd830
Output video metadata for negatives
KamiCreed Mar 31, 2026
eecd99c
Parse multiple video metadata CSVs
KamiCreed Apr 1, 2026
6ee5fcc
Pack data into shards instead
KamiCreed Apr 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions training/object-detection/.dvc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/config.local
/tmp
/cache
4 changes: 4 additions & 0 deletions training/object-detection/.dvc/config
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[core]
remote = storage
['remote "storage"']
url = s3://salmonvision-dvc/rgb_object_detection
3 changes: 3 additions & 0 deletions training/object-detection/.dvcignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore
Empty file.
1 change: 1 addition & 0 deletions training/object-detection/.python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.8
35 changes: 35 additions & 0 deletions training/object-detection/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Object Detection

Training pipeline to train the SalmonVision object detection model.

Install uv:
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

Install DVC:
```bash
uv tool install dvc
```

Install the module:
```bash
uv pip install -e .
```

Check dvc.yaml for the full pipeline.

Run the following to run specific stages of the pipeline:
```bash
dvc repro stage_name
```

For example, building the model input annotations:
```bash
dvc repro build_model_input
```

Run tests with
```
uv run pytest
```
28 changes: 28 additions & 0 deletions training/object-detection/config/salmon_yolo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, ..]
# Classes updated on 2026-02-12
path: /training/export_combined_bear_kitwanga_yolo # dataset root dir
train: train.txt
val: val.txt
test: test.txt

# Classes
names:
0: Coho
1: Bull
2: Rainbow
3: Sockeye
4: Pink
5: Whitefish
6: Chinook
7: Shiner
8: Pikeminnow
9: Chum
10: Steelhead
11: Lamprey
12: Cutthroat
13: Stickleback
14: Sculpin
15: Jack_Coho
16: Jack_Chinook
17: Otter
18: Sucker
4 changes: 4 additions & 0 deletions training/object-detection/data/01_raw/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
/labelstudio_annos
/salmon_vid_counts.csv
/salmon_vid_counts_summary.csv
/sv_water_conditions
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
outs:
- md5: b6136eee30a358d7473a160197bc91be
size: 6465051
hash: md5
path: salmon_vid_counts.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
outs:
- md5: c65c2ba8214d26905ccb9608e9071b93
size: 1031
hash: md5
path: salmon_vid_counts_summary.csv
6 changes: 6 additions & 0 deletions training/object-detection/data/01_raw/sv_water_conditions.dvc
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
outs:
- md5: 717ade5c2fd57a763609ab55d66bf351.dir
size: 199437
nfiles: 10
hash: md5
path: sv_water_conditions
3 changes: 3 additions & 0 deletions training/object-detection/data/02_interim/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/yolo_annos
/yolo_annos_unpacked
/yolo_condition_negatives
1 change: 1 addition & 0 deletions training/object-detection/data/03_processed/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/splits_baseline
143 changes: 143 additions & 0 deletions training/object-detection/dvc.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
schema: '2.0'
stages:
update_raw:
cmd: rclone sync -P --filter "- /salm_dataset*/**" --filter "+ *.json"
--filter "- *.zip" aws:salmonvision-ml-datasets/rgb/raw/
data/01_raw/labelstudio_annos
deps:
- path: s3://salmonvision-ml-datasets/rgb/raw
hash: md5
md5: 9aab20758f57f39971a38c9a676dad27.dir
size: 720238505
nfiles: 65
outs:
- path: data/01_raw/labelstudio_annos
hash: md5
md5: 188b71c31be326fe36d209cc52c23337.dir
size: 379352848
nfiles: 50
split_data:
cmd: rm -rf data/02_interim/yolo_annos_unpacked && scripts/unpack_annos.sh
data/02_interim/yolo_annos data/02_interim/yolo_annos_unpacked &&
scripts/unpack_annos.sh data/02_interim/yolo_condition_negatives
data/02_interim/yolo_annos_unpacked && scripts/make_splits.py
--labels-root data/02_interim/yolo_annos_unpacked --out-dir
data/03_processed/splits_baseline --sites tankeeah kitwanga bear --seed 42
--train-frac 0.8 --val-frac 0.1 --test-frac 0.1
deps:
- path: data/01_raw/salmon_vid_counts.csv
hash: md5
md5: b6136eee30a358d7473a160197bc91be
size: 6465051
- path: data/01_raw/salmon_vid_counts_summary.csv
hash: md5
md5: c65c2ba8214d26905ccb9608e9071b93
size: 1031
- path: data/02_interim/yolo_annos
hash: md5
md5: 88551957fa5e19f676175f29606a613a.dir
size: 308350474
nfiles: 5
- path: data/02_interim/yolo_condition_negatives
hash: md5
md5: c3505d079516f38874bdb42d2182cb46.dir
size: 338928
nfiles: 3
- path: scripts/make_splits.py
hash: md5
md5: 3921729fec8a5aa6c1eb25c54196a658
size: 650
- path: src/object_detection/splits
hash: md5
md5: 8ca8d1d4f2962a0e27f6fd50da90a8d2.dir
size: 35847
nfiles: 6
- path: src/object_detection/utils
hash: md5
md5: 5828afbf8a25f11a0f8e5bbc6c0065bf.dir
size: 762
nfiles: 4
outs:
- path: data/03_processed/splits_baseline
hash: md5
md5: e821f72b0bf42a09afafb688ce927ac4.dir
size: 16966114
nfiles: 5
build_model_input:
cmd: scripts/yolo_converter_ls_video.py data/01_raw/labelstudio_annos
--data-yaml config/salmon_yolo.yaml --out data/02_interim/yolo_annos_fs
--out-shards data/02_interim/yolo_annos --empty-list
data/02_interim/yolo_annos/empty_vids.txt --shard-size 100000 --pattern
'**/*.json' --include-sites tankeeah kitwanga bear --frame-stride 3
--frame-offset-mode video_hash --include-negatives --negative-ratio 0.10
--negatives-per-video 11
deps:
- path: config/salmon_yolo.yaml
hash: md5
md5: f453f5dc54f1743eaedcc3ab117d269e
size: 547
- path: data/01_raw/labelstudio_annos
hash: md5
md5: 188b71c31be326fe36d209cc52c23337.dir
size: 379352848
nfiles: 50
- path: scripts/yolo_converter_ls_video.py
hash: md5
md5: 95696514ec77c20c6491c5af2ccb46a5
size: 118
- path: src/object_detection/utils
hash: md5
md5: 5828afbf8a25f11a0f8e5bbc6c0065bf.dir
size: 762
nfiles: 4
- path: src/object_detection/yolo_ls
hash: md5
md5: 80ca7160d1912861bbc2a013a45ac5d5.dir
size: 52757
nfiles: 10
outs:
- path: data/02_interim/yolo_annos
hash: md5
md5: 88551957fa5e19f676175f29606a613a.dir
size: 308350474
nfiles: 5
unpack_annos:
cmd: rm -r data/02_interim/yolo_annos_unpacked || scripts/unpack_annos.sh
data/02_interim/yolo_annos data/02_interim/yolo_annos_unpacked
deps:
- path: data/02_interim/yolo_annos
hash: md5
md5: 473a8eedf5fd4cc5d5b8b6d1f80681e7.dir
size: 293713920
nfiles: 3
build_condition_negatives:
cmd: scripts/create_condition_negatives.py --conditions-csv
data/01_raw/sv_water_conditions/SV_conditions_tracking_tankeeah_2025.csv
data/01_raw/sv_water_conditions/SV_conditions_tracking_bear_2025.csv
data/01_raw/sv_water_conditions/SV_conditions_tracking_kitwanga_2025.csv
--out-dir data/02_interim/yolo_condition_negatives --frames-per-video 20
--frame-stride 3 --frame-offset-mode video_hash --shard-size 100000
deps:
- path: data/01_raw/sv_water_conditions
hash: md5
md5: 717ade5c2fd57a763609ab55d66bf351.dir
size: 199437
nfiles: 10
- path: src/object_detection/negatives/cli.py
hash: md5
md5: d04fcc8b316529ebefeec2830b288197
size: 2142
- path: src/object_detection/negatives/conditions.py
hash: md5
md5: f037be2028fead4a5a9e88f95799254a
size: 18950
- path: src/object_detection/yolo_ls/shards.py
hash: md5
md5: 9a1e97420b95c978e10ec236c2645337
size: 1266
outs:
- path: data/02_interim/yolo_condition_negatives
hash: md5
md5: c3505d079516f38874bdb42d2182cb46.dir
size: 338928
nfiles: 3
142 changes: 142 additions & 0 deletions training/object-detection/dvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
stages:
update_raw:
cmd: >-
rclone sync
-P
--filter "- /salm_dataset*/**"
--filter "+ *.json"
--filter "- *.zip"
aws:salmonvision-ml-datasets/rgb/raw/
data/01_raw/labelstudio_annos
deps:
- s3://salmonvision-ml-datasets/rgb/raw
outs:
- data/01_raw/labelstudio_annos
frozen: true
build_model_input:
cmd: >-
scripts/yolo_converter_ls_video.py
data/01_raw/labelstudio_annos
--data-yaml config/salmon_yolo.yaml
--out data/02_interim/yolo_annos_fs
--out-shards data/02_interim/yolo_annos
--empty-list data/02_interim/yolo_annos/empty_vids.txt
--shard-size 100000
--pattern '**/*.json'
--include-sites tankeeah kitwanga bear
--frame-stride 3
--frame-offset-mode video_hash
--include-negatives
--negative-ratio 0.10
--negatives-per-video 11
deps:
- scripts/yolo_converter_ls_video.py
- src/object_detection/yolo_ls
- src/object_detection/utils
- data/01_raw/labelstudio_annos
- config/salmon_yolo.yaml
outs:
- data/02_interim/yolo_annos
build_condition_negatives:
cmd: >-
scripts/create_condition_negatives.py
--conditions-csv data/01_raw/sv_water_conditions/SV_conditions_tracking_tankeeah_2025.csv
data/01_raw/sv_water_conditions/SV_conditions_tracking_bear_2025.csv
data/01_raw/sv_water_conditions/SV_conditions_tracking_kitwanga_2025.csv
--out-dir data/02_interim/yolo_condition_negatives
--frames-per-video 20
--frame-stride 3
--frame-offset-mode video_hash
--shard-size 100000
deps:
- src/object_detection/negatives/conditions.py
- src/object_detection/negatives/cli.py
- src/object_detection/yolo_ls/shards.py
- data/01_raw/sv_water_conditions
outs:
- data/02_interim/yolo_condition_negatives
split_data:
cmd: >-
rm -rf data/02_interim/yolo_annos_unpacked &&
scripts/unpack_annos.sh
data/02_interim/yolo_annos
data/02_interim/yolo_annos_unpacked &&
scripts/unpack_annos.sh
data/02_interim/yolo_condition_negatives
data/02_interim/yolo_annos_unpacked &&
scripts/make_splits.py
--labels-root data/02_interim/yolo_annos_unpacked
--out-dir data/03_processed/splits_baseline
--sites tankeeah kitwanga bear
--seed 42
--train-frac 0.8 --val-frac 0.1 --test-frac 0.1
deps:
- scripts/make_splits.py
- src/object_detection/splits
- src/object_detection/utils
- data/02_interim/yolo_annos
- data/02_interim/yolo_condition_negatives
- data/01_raw/salmon_vid_counts.csv
- data/01_raw/salmon_vid_counts_summary.csv
outs:
- data/03_processed/splits_baseline
build_video_metadata_index:
cmd: >-
scripts/build_video_metadata_index.py
--json-dir data/01_raw/labelstudio_annos
--out-csv data/02_interim/video_metadata_index.csv
deps:
- src/object_detection/metadata/index.py
- src/object_detection/metadata/cli.py
- data/01_raw/labelstudio_annos
outs:
- data/02_interim/video_metadata_index.csv
pack_split_dataset:
cmd: >-
scripts/pack_split_dataset.py
--splits-dir data/03_processed/splits_baseline
--labels-root data/02_interim/yolo_annos_unpacked
--shards-root ${config.drive}/salmon_dataset/dataset_sharded/shards
--manifests-root ${config.drive}/salmon_dataset/dataset_sharded/manifests
--temp-video-dir ${config.drive}/salmon_dataset/tmp_videos
--metadata-csv data/02_interim/video_metadata_index.csv
data/02_interim/yolo_condition_negatives/condition_negative_video_metadata.csv
--data-yaml config/salmon_yolo.yaml
--bucket prod-salmonvision-edge-assets-labelstudio-source
--image-ext .jpg
--manifest-csv data/03_processed/packed_dataset_manifest.csv
--splits train val test
--shard-size 100000
deps:
- src/object_detection/frames/parsing.py
- src/object_detection/frames/extractor.py
- src/object_detection/frames/cli.py
- src/object_detection/yolo_ls/shards.py
- data/03_processed/splits_baseline
- data/02_interim/yolo_annos_unpacked
- data/02_interim/video_metadata_index.csv
- data/02_interim/yolo_condition_negatives/condition_negative_video_metadata.csv
- config/salmon_yolo.yaml
params:
- config.drive
outs:
- ${config.drive}/salmon_dataset/dataset_sharded/shards
- ${config.drive}/salmon_dataset/dataset_sharded/manifests
- data/03_processed/packed_dataset_manifest.csv
unpack_split_dataset:
cmd: >-
rm -rf ${config.drive}/salmon_dataset/yolo_workdir &&
mkdir -p ${config.drive}/salmon_dataset/yolo_workdir &&
scripts/unpack_annos.sh
${config.drive}/salmon_dataset/dataset_sharded/shards
${config.drive}/salmon_dataset/yolo_workdir &&
cp ${config.drive}/salmon_dataset/dataset_sharded/manifests/train.txt ${config.drive}/salmon_dataset/yolo_workdir/train.txt &&
cp ${config.drive}/salmon_dataset/dataset_sharded/manifests/val.txt ${config.drive}/salmon_dataset/yolo_workdir/val.txt &&
cp ${config.drive}/salmon_dataset/dataset_sharded/manifests/test.txt ${config.drive}/salmon_dataset/yolo_workdir/test.txt &&
cp ${config.drive}/salmon_dataset/dataset_sharded/manifests/data.yaml ${config.drive}/salmon_dataset/yolo_workdir/data.yaml
deps:
- ${config.drive}/salmon_dataset/dataset_sharded/shards
- ${config.drive}/salmon_dataset/dataset_sharded/manifests
params:
- config.drive
frozen: true
Loading