Ocelot

Pan-tilt face tracking robot (Raspberry Pi 5), plus a simulated environment for training. Classical CV baseline with Haar cascade for now; VLA model coming soon.

Hardware

Component	Detail
Compute	Raspberry Pi 5
Camera	Pi Camera V2 (CSI)
Servo driver	PCA9685 at I2C 0x40
Servos	SG90 — pan ch0 (0–180°), tilt ch1 (90–180°)
OS	Pi OS Bookworm (Python 3.11)

Quickstart

First time

# 1. Build Docker image
docker compose build

# 2. Launch — venv setup and ROS package build happen automatically on first run
docker compose up

Why the venv? capture_worker.py runs under the container's python3.11 (deadsnakes) so that picamera2/libcamera — compiled for Pi OS Bookworm's Python 3.11 — work correctly. The venv is created automatically on first docker compose up and persists in the project directory. See Troubleshooting if you need to force-recreate it.

Launch

docker compose up                          # classical tracker (Haar cascade)
VISUALIZE=true docker compose up           # classical tracker + annotated stream
USE_VLA=true docker compose up             # VLA model (INT8)
USE_VLA=true VISUALIZE=true docker compose up
USE_VLA=true VLA_COMMAND="look at the person" docker compose up

USE_VLA=true loads models/active.onnx — a symlink to the currently active INT8 model.

Switching models

To deploy a new trained model to the Pi, run make use-model from the project root. It quantizes best.onnx to INT8 (skipped if best_int8.onnx already exists), then updates the models/active.onnx symlink:

make use-model RUN=runs/v0.1.1-single-face
docker compose up

Quantization runs inside the robot container (~30 s, once per checkpoint). Always deploy INT8 — FP32 models are too slow for real-time tracking on Pi 5 CPU.

To override the model at launch time without changing the symlink:

USE_VLA=true VLA_CHECKPOINT=/ws/src/ocelot/runs/v0.1.1-single-face/best_int8.onnx docker compose up

Editing Python source files does not require a rebuild (symlinks are live). Rebuilding is only needed when setup.py entry points change.

View streams

Stream	URL
Raw	`http://<pi-ip>:8080/stream?topic=/camera/image_raw`
Annotated	`http://<pi-ip>:8080/stream?topic=/camera/image_annotated`

Sim (dev machine)

make sim-build   # build the sim image (once, or after Dockerfile changes)
make sim         # headless — no GUI, fast, works on any machine
make sim-gui     # Gazebo GUI — software rendering (no GPU required)
make sim-gpu     # Gazebo GUI — GPU accelerated (requires NVIDIA runtime)
make sim-vla VLA_ONNX=runs/v0.1/best.onnx   # run trained VLA model in sim (GPU, requires NVIDIA runtime)
make sim-vla-eval  # eval VLA against N training-distribution scenarios (see below)
make sim-xauth   # one-time X11 auth setup (re-run if display session changes)
make sim-shell   # interactive shell in a fresh sim container

The colcon build is fast on repeat runs — named volumes (sim_build, sim_install) cache artifacts between container invocations.

After ~15 seconds the sim is fully up: the face billboard starts oscillating in both pan (Y) and tilt (Z), and the tracker follows it automatically. No manual steps needed.

Verify tracking is working from a second shell in the container:

ros2 topic echo /joint_states --field position   # pan/tilt positions should change

Episode runner (scenario generator)

The episode runner generates randomized scenarios — face textures, background, lighting, motion patterns, language labels — and drives them in a live Gazebo session. Use it to smoke-test the scenario generator before running full data collection.

Prerequisites (assets must exist before running):

# Face description JSONs (git-tracked — present after clone if committed)
ls sim/scenario_generator/face_descriptions*.json

# Face PNGs in sim/assets/faces/ and background PNGs in sim/assets/backgrounds/
# Pull from DVC if available:
dvc pull
# Or regenerate locally (backgrounds take seconds; faces require an AI image API):
make backgrounds                          # generates 6 plain-color PNGs; no API required
# make faces                             # generates descriptions + calls image API

Run a single episode inside the sim container:

make sim-shell   # open an interactive shell in a fresh sim container

# Inside the container — build, start sim in background, then run one episode
colcon build --symlink-install --packages-select ocelot --event-handlers console_direct-
source /ws/install/setup.bash
ros2 launch ocelot sim_launch.py world:=scenario_world headless:=true use_oracle:=true &
sleep 15   # wait for Gazebo + ros2_control to finish starting

python3 /ws/src/ocelot/sim/scenario_generator/run_one_episode.py --seed 42 --duration 10

Exit code 0 means the episode completed without error. The script prints the full scenario config, face positions every second, and final positions at teardown.

Run 10 sequential episodes (entity leak check):

for i in $(seq 0 9); do
    python3 /ws/src/ocelot/sim/scenario_generator/run_one_episode.py --seed $i --duration 5
done
# After all episodes: verify no leaked entities
gz model --list   # should show only: ground_plane, ocelot
gz light --list   # should be empty

Data collection

collect_parallel.sh generates episodes, runs them, and saves output. Output always goes to /ws/src/ocelot/sim/dataset (bind-mounted to sim/dataset/ on the host).

bash sim/data_gen/collect_parallel.sh --shards 7 --episodes 700

The script auto-detects the next unused shard index from the output directory, so re-running never overwrites existing data. Override with --start-shard N if needed.

After collection, verify a shard:

docker exec -e ROS_DOMAIN_ID=1 ocelot-sim-0 \
  python3 /ws/src/ocelot/sim/data_gen/check_dataset.py --dataset /ws/src/ocelot/sim/dataset/shard_0

Then merge all shards into one dataset on the host (not inside a container — merge only needs h5py from .venv):

source .venv/bin/activate
python3 sim/data_gen/merge_shards.py \
    --parent sim/dataset \
    --output sim/dataset/merged

collect_parallel.sh runs this automatically at the end of a full run. If containers were killed early, run it manually. The merger auto-discovers all shard_N/ directories, deduplicates episode IDs across shards, regenerates train/val/test splits, and writes sim/dataset/merged/.

Training

Install training dependencies:

pip install -r requirements-train.txt

Pull dataset using DVC (>75 GB data) with dvc pull.

Sweep

Hyperparameter sweep over lr × n_fusion_layers:

SWEEP=sweep-v0.1   # increment to avoid overwriting previous sweep checkpoints
for lr in 1e-4 3e-4 1e-3; do
  for layers in 1 2 4; do
    python3 train/train.py \
        --dataset_dir sim/dataset/ \
        --output_dir  runs/$SWEEP/lr${lr}_l${layers}/ \
        --epochs 3 \
        --lr $lr \
        --n_fusion_layers $layers \
        --batch_size 64 \
        --max_episodes 1500 \
        --amp \
        --experiment ocelot-sweep
  done
done

Each combo writes a separate checkpoint under runs/$SWEEP/ and a separate MLflow run under the ocelot-sweep experiment. View results sorted by val_loss:

mlflow ui    # http://localhost:5000 → experiment "ocelot-sweep"

Full train

Full training run with AMP (use best lr/layers/bs from sweep):

python3 train/train.py \
    --dataset_dir sim/dataset/ \
    --output_dir  runs/v0.1.0/ \
    --epochs 20 \
    --batch_size 64 \
    --num_workers 12 \
    --amp \
    --experiment ocelot-v0.1.0

Track-only train

Train on single-face tracking episodes only (label_key=track), filtering out multi-face/attribute commands. Useful when the deployment only needs face tracking:

python3 train/train.py \
    --dataset_dir sim/dataset/ \
    --output_dir  runs/v0.2-track-only/ \
    --epochs 20 \
    --batch_size 64 \
    --num_workers 12 \
    --amp \
    --label_keys track \
    --experiment ocelot-v0.2-track-only

Inspect metrics:

mlflow ui    # open http://localhost:5000

val_mse_<label_key> columns show per-label breakdown (e.g. basic_track, multi_left). A good model reaches RMSE < 0.015 rad/s per axis (< 10% of the typical oracle signal).

Evaluate a checkpoint

source .venv/bin/activate

# Text report (RMSE, Pearson r, sign agreement, per-label breakdown):
python3 train/eval.py \
    --checkpoint runs/v0.0-smoke/best.pt \
    --dataset_dir sim/dataset/

# With scatter plot + 4 episode time-series overlays:
python3 train/eval.py \
    --checkpoint runs/v0.0-smoke/best.pt \
    --dataset_dir sim/dataset/ \
    --plot --episodes 4
# → runs/v0.0-smoke/scatter.png, runs/v0.0-smoke/episodes.png

VLA sim validation

Run the trained model inside Gazebo in closed-loop: the model sees each live camera frame and its output drives the pan-tilt joints. The face billboard oscillates automatically so there is always something to track.

Step 1 — Export to ONNX (host, one-time per checkpoint):

source .venv/bin/activate
python3 train/export_onnx.py \
    --checkpoint runs/v0.0-smoke/best.pt \
    --output     runs/v0.0-smoke/best.onnx \
    --verify
# → best.onnx + best_tokens.json alongside the checkpoint

Step 2 — Rebuild sim image (needed once after the Dockerfile changed to onnxruntime-gpu):

make sim-build

Step 3 — Launch sim with VLA node (CPU):

docker compose -f deploy/docker/docker-compose.sim.yml run --rm sim bash -c "
  source /opt/ros/jazzy/setup.bash && cd /ws &&
  colcon build --symlink-install --packages-select ocelot --event-handlers console_direct- &&
  source /ws/install/setup.bash &&
  ros2 launch ocelot sim_launch.py use_vla:=true headless:=true
"

Step 3 (GPU) — Launch sim with VLA node on NVIDIA GPU (requires NVIDIA container runtime):

# Run VLA in sim
make sim-vla VLA_ONNX=runs/sweep-v0.0.2-1500-ep/lr1e-4_l2/best.onnx

# Evaluate against N reproducible scenarios (optional: override seed and count)
make sim-vla-eval VLA_ONNX=runs/sweep-v0.0.2-1500-ep/lr1e-4_l2/best.onnx SCENARIO_SEED=0 N_SCENARIOS=5

The vla_node logs which ONNX provider it is using on startup:

ONNX session ready (provider: CUDAExecutionProvider)   # GPU
ONNX session ready (provider: CPUExecutionProvider)    # CPU fallback

The default checkpoint path inside the container is /ws/src/ocelot/runs/sweep-v0.0.2-1500-ep/lr1e-4_l2/best.onnx (the runs/ directory is bind-mounted from the host). Override checkpoint or command:

ros2 launch ocelot sim_launch.py use_vla:=true headless:=true \
    vla_checkpoint:=/ws/src/ocelot/runs/v0.1/best.onnx \
    vla_command:="track the face"

Monitor from a second shell in the same container:

# Joint positions should change as the face oscillates
ros2 topic echo /joint_states --field position

# VLA velocity commands
ros2 topic echo /cmd_vel

The vla_node logs pan=+0.xxx tilt=+0.xxx rad/s per frame. If the joints track the face motion, behavioral cloning is working. If output is near-zero or static, the model needs more training data or epochs.

VLA live evaluation (training-distribution scenarios)

sim-vla-eval tests the VLA against N reproducible scenarios drawn from the same distribution used for data collection — varied face textures, backgrounds, lighting, motion patterns, and distractors. It measures FK angular error while the model drives the robot and prints a pass/fail table.

# 5 scenarios from seed 0 (default)
make sim-vla-eval VLA_ONNX=runs/v0.1/best.onnx

# More scenarios, different seed range
make sim-vla-eval VLA_ONNX=runs/v0.1/best.onnx SCENARIO_SEED=50 N_SCENARIOS=10

The script waits up to 90 s for Gazebo and the VLA node to publish before starting — no manual sleep needed. Each scenario runs a 4 s warmup (VLA convergence) then a 10 s measurement window. Output:

Scenario 1/5  seed=0  motion=sinusoidal  label=basic_track
  mean=3.2°  max=8.7°  n=100  [PASS]
...
--- Summary ---
  #    seed  motion            label           mean°   max°  pass
  1       0  sinusoidal        basic_track       3.2     8.7  Y
  ...
Overall: mean=4.1°  pass_rate=80%  (threshold=10.0°)

Pass threshold: mean FK angular error < 10°. A well-trained model should achieve < 5° mean and > 80% pass rate.

Rosbag

Bags are stored in ./bags/ (bind-mounted to /ws/bags/ in the container).

Record

With the stack running (docker compose up), open a second terminal and record:

docker compose exec ocelot bash -i -c "
  ros2 bag record \
    --storage mcap \
    --compression-mode file \
    --compression-format zstd \
    -o /ws/bags/my_session \
    /camera/image_raw /cmd_vel
"

Ctrl+C to stop recording cleanly. The bag lands in ./bags/my_session/ on the host.

Playback

Stop the main stack first (avoids topic conflicts with camera_node), then:

docker compose run --rm ocelot bash -i -c "
  ros2 run web_video_server web_video_server &
  ros2 run ocelot visualizer_node &
  ros2 bag play /ws/bags/my_session --loop
"

Stream	URL
Raw	`http://<pi-ip>:8080/stream?topic=/camera/image_raw`
Annotated	`http://<pi-ip>:8080/stream?topic=/camera/image_annotated`

The annotated stream shows face bounding box, center crosshair, error vector, deadband circle, and cmd_vel values.

Inspect

docker compose exec ocelot bash -i -c "ros2 bag info /ws/bags/my_session"

Architecture

Classical tracker:

camera_node ──/camera/image_raw──▶ tracker_node ──/cmd_vel──▶ servo_node
 (picamera2)   sensor_msgs/Image    (Haar cascade)  Twist       (PCA9685)
                      │
                      └──▶ visualizer_node ──/camera/image_annotated──▶ web_video_server
                              (optional)

VLA mode (USE_VLA=true):

camera_node ──/camera/image_raw──▶ vla_node ──/cmd_vel──▶ servo_node
 (picamera2)   sensor_msgs/Image    (ONNX INT8)  Twist       (PCA9685)

tracker_node and vla_node are mutually exclusive — only one runs at a time.

Nodes

camera_node — Captures 640×480 RGB frames from Pi Camera V2. Because libcamera's Python bindings are compiled for Python 3.11 (Pi OS Bookworm) and the ROS container uses Python 3.12, capture runs in a python3.11 subprocess (capture_worker.py) communicating frames to the node via a length-prefixed pipe.

servo_node — Subscribes to /cmd_vel (geometry_msgs/Twist). Integrates angular.z (pan) and angular.y (tilt) velocity at 30 Hz into servo positions via adafruit-circuitpython-servokit. Centers on shutdown.

tracker_node — Subscribes to /camera/image_raw, runs Haar cascade face detection, publishes velocity commands to /cmd_vel and bounding box to /tracking/face_roi. Key params: kp_pan, kp_tilt, deadband, min_neighbors, min_face_size. Disabled when USE_VLA=true.

vla_node — Subscribes to /camera/image_raw, runs the trained ONNX model (DINOv2-small + CLIP text encoder + action head), publishes /cmd_vel. Enabled when USE_VLA=true. Key params: checkpoint, token_cache, command, max_vel, max_accel.

visualizer_node — Subscribes to /camera/image_raw, /tracking/face_roi, and /cmd_vel; publishes annotated frames to /camera/image_annotated. Optional — launch with visualize:=true.

Validate

# I2C — should show 0x40
i2cdetect -y 1

# Manual servo via ROS topic
ros2 topic pub --once /cmd_vel geometry_msgs/Twist \
  "{angular: {z: 1.0, y: 0.0}}"

# Confirm publish rates
ros2 topic list
ros2 topic hz /camera/image_raw    # expect ~15 Hz
ros2 topic echo /cmd_vel --no-arr

Project Structure

ocelot/
├── ocelot/
│   ├── camera_node.py       # ROS node (py3.12), spawns capture_worker
│   ├── capture_worker.py    # picamera2 capture (py3.11 subprocess)
│   ├── servo_node.py        # PCA9685 servo control
│   ├── tracker_node.py      # Haar cascade proportional controller
│   ├── oracle_node.py       # Privileged ground-truth FK tracker (sim only)
│   ├── oracle_validator.py  # Pixel-error measurement for oracle validation
│   ├── vla_node.py          # ONNX inference node for sim validation (Phase 3)
│   └── visualizer_node.py   # Annotated image publisher (optional)
├── launch/tracker_launch.py
├── config/tracker_params.yaml
├── urdf/pan_tilt.urdf
├── bags/                    # rosbag recordings (gitignored)
├── scripts/                 # bare-metal validation (no ROS needed)
│   ├── test_servos.py
│   └── test_tracking_manual.py
├── deploy/docker/
│   ├── Dockerfile.robot     # robot deployment image (Pi 5)
│   └── docker-compose.yml   # compose config (relative to deploy/docker/)
├── docker-compose.yml       # convenience wrapper — includes deploy/docker/
├── package.xml              # ament_python
├── setup.py
└── setup.cfg

Troubleshooting

`haarcascade_frontalface_default.xml not found`

The apt python3-opencv package does not bundle cascade data files. opencv-data must also be installed — it provides the cascade XMLs at /usr/share/opencv4/haarcascades/. This is already in Dockerfile.sim. If you see this error after a rebuild, check that both python3-opencv and opencv-data are present in the apt install section.

`ImportError: libturbojpeg.so.0: cannot open shared object file`

simplejpeg (required by picamera2's JPEG encoder) needs libturbojpeg from the host. Check that deploy/docker/docker-compose.yml bind-mounts /usr/lib/aarch64-linux-gnu/libturbojpeg.so.0 from the host.

`ModuleNotFoundError: No module named 'v4l2'`

picamera2 imports v4l2 for sensor mode enumeration. The file lives at /usr/lib/python3/dist-packages/v4l2.py on the host and must be bind-mounted into the container. Check deploy/docker/docker-compose.yml.

Stale or incompatible `.venv`

If the .venv was created outside the container (host Pi OS Python 3.11 has a different ABI from deadsnakes), or if numpy/simplejpeg compatibility breaks after a Pi OS update, delete and recreate it inside the container:

docker compose run --rm ocelot bash -i -c "
  rm -rf /ws/src/ocelot/.venv &&
  python3.11 -m venv --without-pip /ws/src/ocelot/.venv &&
  /ws/src/ocelot/.venv/bin/python3.11 -c 'import urllib.request; exec(urllib.request.urlopen(\"https://bootstrap.pypa.io/get-pip.py\").read())' &&
  /ws/src/ocelot/.venv/bin/pip install -r /ws/src/ocelot/requirements-worker.txt
"

Then docker compose up as normal — the venv will be picked up on the next run.

`pip dependency resolver` warning about `pyyaml` / `launch-ros`

Harmless. pip can see ROS packages in the environment and warns about missing deps for them. The worker venv doesn't need them — ignore it.

Annotated stream blank / `visualizer_node` missing from `ros2 node list`

If VISUALIZE=true docker compose up starts only 4 nodes (no visualizer_node), the colcon install directory is stale. Run a rebuild inside the container then restart:

docker compose run --rm ocelot bash -i -c "cd /ws && colcon build --packages-select ocelot --symlink-install"
VISUALIZE=true docker compose up

This is needed whenever a new entry point is added to setup.py.

Sim (Gazebo) — `docker-compose.sim.yml`

Gazebo window appears but freezes / not responding

Root cause: Gazebo transport tries multicast peer discovery on all interfaces when GZ_IP is unset. The GUI event loop blocks waiting for the server handshake — the window frame appears (Qt init succeeds) but hangs before the scene loads.

Fix (already in docker-compose.sim.yml):

environment:
  - GZ_IP=127.0.0.1

This binds Gazebo transport to loopback only, so server↔GUI discovery resolves instantly.

Gazebo window is black / empty world

Root cause: Docker's default /dev/shm is 64 MB — too small for Gazebo's OGRE renderer, which transfers render buffers between server and GUI via shared memory.

Fix (already in docker-compose.sim.yml):

shm_size: '2g'
ipc: host
environment:
  - QT_X11_NO_MITSHM=1

X11 auth: container (root) refused by X server

Run sudo make sim-xauth once (re-run if the display session changes). The compose file mounts /tmp/.docker.xauth and sets XAUTHORITY=/tmp/.docker.xauth.

`MESA: error: ZINK: vkCreateInstance failed` / software rendering

The jazzy-simulation base image doesn't include Vulkan ICDs, so OGRE logs this and falls back to software OpenGL (llvmpipe). This is expected and harmless when running without the GPU overlay — the sim works but renders on CPU.

To switch to GPU-accelerated rendering (NVIDIA), use the GPU compose overlay as described in the Sim section above.

Phase Roadmap

Phase	Weeks	Goal
1	1–4	Classical face tracker (Haar cascade) — complete
2	5–8	Gazebo sim + synthetic data engine — complete
3	9–13	VLA model (DINOv2 + CLIP + action head) — complete
4	14–18	Edge deployment + MLOps loop — current
5	19–20	Polish + portfolio

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.dvc		.dvc
.github/workflows		.github/workflows
.tickets		.tickets
config		config
deploy/docker		deploy/docker
docs		docs
launch		launch
models		models
ocelot		ocelot
resource		resource
runs		runs
scripts		scripts
sim		sim
tests		tests
train		train
urdf		urdf
.dvcignore		.dvcignore
.flake8		.flake8
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Makefile		Makefile
README.md		README.md
TODO.md		TODO.md
conftest.py		conftest.py
docker-compose.yml		docker-compose.yml
package.xml		package.xml
requirements-train.txt		requirements-train.txt
requirements-worker.txt		requirements-worker.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Ocelot

Hardware

Quickstart

First time

Launch

Switching models

View streams

Sim (dev machine)

Episode runner (scenario generator)

Data collection

Training

Sweep

Full train

Track-only train

Evaluate a checkpoint

VLA sim validation

VLA live evaluation (training-distribution scenarios)

Rosbag

Record

Playback

Inspect

Architecture

Nodes

Validate

Project Structure

Troubleshooting

haarcascade_frontalface_default.xml not found

ImportError: libturbojpeg.so.0: cannot open shared object file

ModuleNotFoundError: No module named 'v4l2'

Stale or incompatible .venv

pip dependency resolver warning about pyyaml / launch-ros

Annotated stream blank / visualizer_node missing from ros2 node list

Sim (Gazebo) — docker-compose.sim.yml

Gazebo window appears but freezes / not responding

Gazebo window is black / empty world

X11 auth: container (root) refused by X server

MESA: error: ZINK: vkCreateInstance failed / software rendering

Phase Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`haarcascade_frontalface_default.xml not found`

`ImportError: libturbojpeg.so.0: cannot open shared object file`

`ModuleNotFoundError: No module named 'v4l2'`

Stale or incompatible `.venv`

`pip dependency resolver` warning about `pyyaml` / `launch-ros`

Annotated stream blank / `visualizer_node` missing from `ros2 node list`

Sim (Gazebo) — `docker-compose.sim.yml`

`MESA: error: ZINK: vkCreateInstance failed` / software rendering

Packages