vision/
|-- data/
|-- experiments/
|-- models/
|-- cnn.py
|-- lstm.py
|-- pipeline.py
|-- utils/
|-- dataset.py
|-- transforms.py
|-- metrics.py
|-- configs/
|-- default.yaml
|-- train.py
|-- eval.py
|-- requirements.txt
|-- README.md
- Data Acquisition
- Input: video clips or frame sequences.
- Each sample =
[T, C, H, W]. - Label = target coordinates
(x, y)or(x, y, z).
- Preprocessing
- Resize -> Normalize (ImageNet stats).
- Augmentations: crop, blur, etc..
- Dataset & DataLoader
- Datasets returns
(frames, label, length). - DataLoader batches ->
[B, T, C, H, W],[B, output_dim].
- Datasets returns
- CNN Feature Extraction
- ResNet-50 extracts features per frame.
- Options:
- With global pooling:
[B, T, 2048]. - Without global pooling:
[B, T, 100352](spatial info kept).
- With global pooling:
- Temporal Modeling (LSTM)
- Input:
[B, T, D]. - Output: last hidden state
[B, H].
- Input:
- Regression Head
- Linear layer ->
[B, output_dim]. output_dim = 2for (x, y), or3for (x, y, z).- Loss:
MSELossorSmoothL1Loss.
- Linear layer ->
| Stage | Shape |
|---|---|
| Dataset sample | [T, C, H, W], target [2/3] |
| Batch (DataLoader) | [B, T, C, H, W], [B, 2/3] |
| CNN (per frame) | [B*T, F, Hf, Wf] |
| Flatten (no pooing) | [B*T, F*Hf*Wf] |
| Sequence reshape | [B, T, D] |
| LSTM output (final) | [B, H] |
| Regression output | [B, 2] or [B, 3] |
(For ResNet-50: F=2048, Hf=Wf=7 -> D=100,352 without pooling)
- Install requirements
pip install -r requirements.txt- Train
python train.py- Evaluate
python eval.py --checkpoint experiments/latest.pth- Swap CNN backbone -> edit
models/cnn.py. - Swap sequence model (e.g., ConvLSTM, Transformer) ->
models/temporal.py. - Configure output_dim (2D or 3D coords) in
config/default.yaml
- Implement real dataset loaders with coordinate labels.
- Add evaluation metrics (MAE, RMSE).
- Experiment with ConvLSTM for spatiotemporal features.
- Test real-time inference with rolling window.