Pan-tilt face tracking robot (Raspberry Pi 5), plus a simulated environment for training. Classical CV baseline with Haar cascade for now; VLA model coming soon.
| Component | Detail |
|---|---|
| Compute | Raspberry Pi 5 |
| Camera | Pi Camera V2 (CSI) |
| Servo driver | PCA9685 at I2C 0x40 |
| Servos | SG90 — pan ch0 (0–180°), tilt ch1 (90–180°) |
| OS | Pi OS Bookworm (Python 3.11) |
# 1. Build Docker image
docker compose build
# 2. Launch — venv setup and ROS package build happen automatically on first run
docker compose upWhy the venv?
capture_worker.pyruns under the container'spython3.11(deadsnakes) so that picamera2/libcamera — compiled for Pi OS Bookworm's Python 3.11 — work correctly. The venv is created automatically on firstdocker compose upand persists in the project directory. See Troubleshooting if you need to force-recreate it.
docker compose up # classical tracker (Haar cascade)
VISUALIZE=true docker compose up # classical tracker + annotated stream
USE_VLA=true docker compose up # VLA model (INT8)
USE_VLA=true VISUALIZE=true docker compose up
USE_VLA=true VLA_COMMAND="look at the person" docker compose upUSE_VLA=true loads models/active.onnx — a symlink to the currently active INT8 model.
To deploy a new trained model to the Pi, run make use-model from the project root. It
quantizes best.onnx to INT8 (skipped if best_int8.onnx already exists), then updates
the models/active.onnx symlink:
make use-model RUN=runs/v0.1.1-single-face
docker compose upQuantization runs inside the robot container (~30 s, once per checkpoint). Always deploy INT8 — FP32 models are too slow for real-time tracking on Pi 5 CPU.
To override the model at launch time without changing the symlink:
USE_VLA=true VLA_CHECKPOINT=/ws/src/ocelot/runs/v0.1.1-single-face/best_int8.onnx docker compose upEditing Python source files does not require a rebuild (symlinks are live). Rebuilding
is only needed when setup.py entry points change.
| Stream | URL |
|---|---|
| Raw | http://<pi-ip>:8080/stream?topic=/camera/image_raw |
| Annotated | http://<pi-ip>:8080/stream?topic=/camera/image_annotated |
make sim-build # build the sim image (once, or after Dockerfile changes)
make sim # headless — no GUI, fast, works on any machine
make sim-gui # Gazebo GUI — software rendering (no GPU required)
make sim-gpu # Gazebo GUI — GPU accelerated (requires NVIDIA runtime)
make sim-vla VLA_ONNX=runs/v0.1/best.onnx # run trained VLA model in sim (GPU, requires NVIDIA runtime)
make sim-vla-eval # eval VLA against N training-distribution scenarios (see below)
make sim-xauth # one-time X11 auth setup (re-run if display session changes)
make sim-shell # interactive shell in a fresh sim containerThe colcon build is fast on repeat runs — named volumes (sim_build, sim_install) cache artifacts between container invocations.
After ~15 seconds the sim is fully up: the face billboard starts oscillating in both pan (Y) and tilt (Z), and the tracker follows it automatically. No manual steps needed.
Verify tracking is working from a second shell in the container:
ros2 topic echo /joint_states --field position # pan/tilt positions should changeThe episode runner generates randomized scenarios — face textures, background, lighting, motion patterns, language labels — and drives them in a live Gazebo session. Use it to smoke-test the scenario generator before running full data collection.
Prerequisites (assets must exist before running):
# Face description JSONs (git-tracked — present after clone if committed)
ls sim/scenario_generator/face_descriptions*.json
# Face PNGs in sim/assets/faces/ and background PNGs in sim/assets/backgrounds/
# Pull from DVC if available:
dvc pull
# Or regenerate locally (backgrounds take seconds; faces require an AI image API):
make backgrounds # generates 6 plain-color PNGs; no API required
# make faces # generates descriptions + calls image APIRun a single episode inside the sim container:
make sim-shell # open an interactive shell in a fresh sim container
# Inside the container — build, start sim in background, then run one episode
colcon build --symlink-install --packages-select ocelot --event-handlers console_direct-
source /ws/install/setup.bash
ros2 launch ocelot sim_launch.py world:=scenario_world headless:=true use_oracle:=true &
sleep 15 # wait for Gazebo + ros2_control to finish starting
python3 /ws/src/ocelot/sim/scenario_generator/run_one_episode.py --seed 42 --duration 10Exit code 0 means the episode completed without error. The script prints the full scenario config, face positions every second, and final positions at teardown.
Run 10 sequential episodes (entity leak check):
for i in $(seq 0 9); do
python3 /ws/src/ocelot/sim/scenario_generator/run_one_episode.py --seed $i --duration 5
done
# After all episodes: verify no leaked entities
gz model --list # should show only: ground_plane, ocelot
gz light --list # should be emptycollect_parallel.sh generates episodes, runs them, and saves output. Output always goes to
/ws/src/ocelot/sim/dataset (bind-mounted to sim/dataset/ on the host).
bash sim/data_gen/collect_parallel.sh --shards 7 --episodes 700The script auto-detects the next unused shard index from the output directory, so re-running
never overwrites existing data. Override with --start-shard N if needed.
After collection, verify a shard:
docker exec -e ROS_DOMAIN_ID=1 ocelot-sim-0 \
python3 /ws/src/ocelot/sim/data_gen/check_dataset.py --dataset /ws/src/ocelot/sim/dataset/shard_0Then merge all shards into one dataset on the host (not inside a container — merge only needs h5py from .venv):
source .venv/bin/activate
python3 sim/data_gen/merge_shards.py \
--parent sim/dataset \
--output sim/dataset/mergedcollect_parallel.sh runs this automatically at the end of a full run. If containers were killed early, run it manually. The merger auto-discovers all shard_N/ directories, deduplicates episode IDs across shards, regenerates train/val/test splits, and writes sim/dataset/merged/.
Install training dependencies:
pip install -r requirements-train.txtPull dataset using DVC (>75 GB data) with dvc pull.
Hyperparameter sweep over lr × n_fusion_layers:
SWEEP=sweep-v0.1 # increment to avoid overwriting previous sweep checkpoints
for lr in 1e-4 3e-4 1e-3; do
for layers in 1 2 4; do
python3 train/train.py \
--dataset_dir sim/dataset/ \
--output_dir runs/$SWEEP/lr${lr}_l${layers}/ \
--epochs 3 \
--lr $lr \
--n_fusion_layers $layers \
--batch_size 64 \
--max_episodes 1500 \
--amp \
--experiment ocelot-sweep
done
doneEach combo writes a separate checkpoint under runs/$SWEEP/ and a separate MLflow run under the ocelot-sweep experiment. View results sorted by val_loss:
mlflow ui # http://localhost:5000 → experiment "ocelot-sweep"Full training run with AMP (use best lr/layers/bs from sweep):
python3 train/train.py \
--dataset_dir sim/dataset/ \
--output_dir runs/v0.1.0/ \
--epochs 20 \
--batch_size 64 \
--num_workers 12 \
--amp \
--experiment ocelot-v0.1.0Train on single-face tracking episodes only (label_key=track), filtering out
multi-face/attribute commands. Useful when the deployment only needs face tracking:
python3 train/train.py \
--dataset_dir sim/dataset/ \
--output_dir runs/v0.2-track-only/ \
--epochs 20 \
--batch_size 64 \
--num_workers 12 \
--amp \
--label_keys track \
--experiment ocelot-v0.2-track-onlyInspect metrics:
mlflow ui # open http://localhost:5000val_mse_<label_key> columns show per-label breakdown (e.g. basic_track, multi_left).
A good model reaches RMSE < 0.015 rad/s per axis (< 10% of the typical oracle signal).
source .venv/bin/activate
# Text report (RMSE, Pearson r, sign agreement, per-label breakdown):
python3 train/eval.py \
--checkpoint runs/v0.0-smoke/best.pt \
--dataset_dir sim/dataset/
# With scatter plot + 4 episode time-series overlays:
python3 train/eval.py \
--checkpoint runs/v0.0-smoke/best.pt \
--dataset_dir sim/dataset/ \
--plot --episodes 4
# → runs/v0.0-smoke/scatter.png, runs/v0.0-smoke/episodes.pngRun the trained model inside Gazebo in closed-loop: the model sees each live camera frame and its output drives the pan-tilt joints. The face billboard oscillates automatically so there is always something to track.
Step 1 — Export to ONNX (host, one-time per checkpoint):
source .venv/bin/activate
python3 train/export_onnx.py \
--checkpoint runs/v0.0-smoke/best.pt \
--output runs/v0.0-smoke/best.onnx \
--verify
# → best.onnx + best_tokens.json alongside the checkpointStep 2 — Rebuild sim image (needed once after the Dockerfile changed to onnxruntime-gpu):
make sim-buildStep 3 — Launch sim with VLA node (CPU):
docker compose -f deploy/docker/docker-compose.sim.yml run --rm sim bash -c "
source /opt/ros/jazzy/setup.bash && cd /ws &&
colcon build --symlink-install --packages-select ocelot --event-handlers console_direct- &&
source /ws/install/setup.bash &&
ros2 launch ocelot sim_launch.py use_vla:=true headless:=true
"Step 3 (GPU) — Launch sim with VLA node on NVIDIA GPU (requires NVIDIA container runtime):
# Run VLA in sim
make sim-vla VLA_ONNX=runs/sweep-v0.0.2-1500-ep/lr1e-4_l2/best.onnx
# Evaluate against N reproducible scenarios (optional: override seed and count)
make sim-vla-eval VLA_ONNX=runs/sweep-v0.0.2-1500-ep/lr1e-4_l2/best.onnx SCENARIO_SEED=0 N_SCENARIOS=5The vla_node logs which ONNX provider it is using on startup:
ONNX session ready (provider: CUDAExecutionProvider) # GPU
ONNX session ready (provider: CPUExecutionProvider) # CPU fallback
The default checkpoint path inside the container is /ws/src/ocelot/runs/sweep-v0.0.2-1500-ep/lr1e-4_l2/best.onnx
(the runs/ directory is bind-mounted from the host). Override checkpoint or command:
ros2 launch ocelot sim_launch.py use_vla:=true headless:=true \
vla_checkpoint:=/ws/src/ocelot/runs/v0.1/best.onnx \
vla_command:="track the face"Monitor from a second shell in the same container:
# Joint positions should change as the face oscillates
ros2 topic echo /joint_states --field position
# VLA velocity commands
ros2 topic echo /cmd_velThe vla_node logs pan=+0.xxx tilt=+0.xxx rad/s per frame. If the joints track
the face motion, behavioral cloning is working. If output is near-zero or static,
the model needs more training data or epochs.
sim-vla-eval tests the VLA against N reproducible scenarios drawn from the same
distribution used for data collection — varied face textures, backgrounds, lighting,
motion patterns, and distractors. It measures FK angular error while the model drives
the robot and prints a pass/fail table.
# 5 scenarios from seed 0 (default)
make sim-vla-eval VLA_ONNX=runs/v0.1/best.onnx
# More scenarios, different seed range
make sim-vla-eval VLA_ONNX=runs/v0.1/best.onnx SCENARIO_SEED=50 N_SCENARIOS=10The script waits up to 90 s for Gazebo and the VLA node to publish before starting — no manual sleep needed. Each scenario runs a 4 s warmup (VLA convergence) then a 10 s measurement window. Output:
Scenario 1/5 seed=0 motion=sinusoidal label=basic_track
mean=3.2° max=8.7° n=100 [PASS]
...
--- Summary ---
# seed motion label mean° max° pass
1 0 sinusoidal basic_track 3.2 8.7 Y
...
Overall: mean=4.1° pass_rate=80% (threshold=10.0°)
Pass threshold: mean FK angular error < 10°. A well-trained model should achieve < 5° mean and > 80% pass rate.
Bags are stored in ./bags/ (bind-mounted to /ws/bags/ in the container).
With the stack running (docker compose up), open a second terminal and record:
docker compose exec ocelot bash -i -c "
ros2 bag record \
--storage mcap \
--compression-mode file \
--compression-format zstd \
-o /ws/bags/my_session \
/camera/image_raw /cmd_vel
"Ctrl+C to stop recording cleanly. The bag lands in ./bags/my_session/ on the host.
Stop the main stack first (avoids topic conflicts with camera_node), then:
docker compose run --rm ocelot bash -i -c "
ros2 run web_video_server web_video_server &
ros2 run ocelot visualizer_node &
ros2 bag play /ws/bags/my_session --loop
"| Stream | URL |
|---|---|
| Raw | http://<pi-ip>:8080/stream?topic=/camera/image_raw |
| Annotated | http://<pi-ip>:8080/stream?topic=/camera/image_annotated |
The annotated stream shows face bounding box, center crosshair, error vector, deadband circle, and cmd_vel values.
docker compose exec ocelot bash -i -c "ros2 bag info /ws/bags/my_session"Classical tracker:
camera_node ──/camera/image_raw──▶ tracker_node ──/cmd_vel──▶ servo_node
(picamera2) sensor_msgs/Image (Haar cascade) Twist (PCA9685)
│
└──▶ visualizer_node ──/camera/image_annotated──▶ web_video_server
(optional)
VLA mode (USE_VLA=true):
camera_node ──/camera/image_raw──▶ vla_node ──/cmd_vel──▶ servo_node
(picamera2) sensor_msgs/Image (ONNX INT8) Twist (PCA9685)
tracker_node and vla_node are mutually exclusive — only one runs at a time.
camera_node — Captures 640×480 RGB frames from Pi Camera V2. Because libcamera's Python bindings are compiled for Python 3.11 (Pi OS Bookworm) and the ROS container uses Python 3.12, capture runs in a python3.11 subprocess (capture_worker.py) communicating frames to the node via a length-prefixed pipe.
servo_node — Subscribes to /cmd_vel (geometry_msgs/Twist). Integrates angular.z (pan) and angular.y (tilt) velocity at 30 Hz into servo positions via adafruit-circuitpython-servokit. Centers on shutdown.
tracker_node — Subscribes to /camera/image_raw, runs Haar cascade face detection, publishes velocity commands to /cmd_vel and bounding box to /tracking/face_roi. Key params: kp_pan, kp_tilt, deadband, min_neighbors, min_face_size. Disabled when USE_VLA=true.
vla_node — Subscribes to /camera/image_raw, runs the trained ONNX model (DINOv2-small + CLIP text encoder + action head), publishes /cmd_vel. Enabled when USE_VLA=true. Key params: checkpoint, token_cache, command, max_vel, max_accel.
visualizer_node — Subscribes to /camera/image_raw, /tracking/face_roi, and /cmd_vel; publishes annotated frames to /camera/image_annotated. Optional — launch with visualize:=true.
# I2C — should show 0x40
i2cdetect -y 1
# Manual servo via ROS topic
ros2 topic pub --once /cmd_vel geometry_msgs/Twist \
"{angular: {z: 1.0, y: 0.0}}"
# Confirm publish rates
ros2 topic list
ros2 topic hz /camera/image_raw # expect ~15 Hz
ros2 topic echo /cmd_vel --no-arrocelot/
├── ocelot/
│ ├── camera_node.py # ROS node (py3.12), spawns capture_worker
│ ├── capture_worker.py # picamera2 capture (py3.11 subprocess)
│ ├── servo_node.py # PCA9685 servo control
│ ├── tracker_node.py # Haar cascade proportional controller
│ ├── oracle_node.py # Privileged ground-truth FK tracker (sim only)
│ ├── oracle_validator.py # Pixel-error measurement for oracle validation
│ ├── vla_node.py # ONNX inference node for sim validation (Phase 3)
│ └── visualizer_node.py # Annotated image publisher (optional)
├── launch/tracker_launch.py
├── config/tracker_params.yaml
├── urdf/pan_tilt.urdf
├── bags/ # rosbag recordings (gitignored)
├── scripts/ # bare-metal validation (no ROS needed)
│ ├── test_servos.py
│ └── test_tracking_manual.py
├── deploy/docker/
│ ├── Dockerfile.robot # robot deployment image (Pi 5)
│ └── docker-compose.yml # compose config (relative to deploy/docker/)
├── docker-compose.yml # convenience wrapper — includes deploy/docker/
├── package.xml # ament_python
├── setup.py
└── setup.cfg
The apt python3-opencv package does not bundle cascade data files. opencv-data must also be installed — it provides the cascade XMLs at /usr/share/opencv4/haarcascades/. This is already in Dockerfile.sim. If you see this error after a rebuild, check that both python3-opencv and opencv-data are present in the apt install section.
simplejpeg (required by picamera2's JPEG encoder) needs libturbojpeg from the host. Check that deploy/docker/docker-compose.yml bind-mounts /usr/lib/aarch64-linux-gnu/libturbojpeg.so.0 from the host.
picamera2 imports v4l2 for sensor mode enumeration. The file lives at /usr/lib/python3/dist-packages/v4l2.py on the host and must be bind-mounted into the container. Check deploy/docker/docker-compose.yml.
If the .venv was created outside the container (host Pi OS Python 3.11 has a different ABI from deadsnakes), or if numpy/simplejpeg compatibility breaks after a Pi OS update, delete and recreate it inside the container:
docker compose run --rm ocelot bash -i -c "
rm -rf /ws/src/ocelot/.venv &&
python3.11 -m venv --without-pip /ws/src/ocelot/.venv &&
/ws/src/ocelot/.venv/bin/python3.11 -c 'import urllib.request; exec(urllib.request.urlopen(\"https://bootstrap.pypa.io/get-pip.py\").read())' &&
/ws/src/ocelot/.venv/bin/pip install -r /ws/src/ocelot/requirements-worker.txt
"Then docker compose up as normal — the venv will be picked up on the next run.
Harmless. pip can see ROS packages in the environment and warns about missing deps for them. The worker venv doesn't need them — ignore it.
If VISUALIZE=true docker compose up starts only 4 nodes (no visualizer_node), the colcon install directory is stale. Run a rebuild inside the container then restart:
docker compose run --rm ocelot bash -i -c "cd /ws && colcon build --packages-select ocelot --symlink-install"
VISUALIZE=true docker compose upThis is needed whenever a new entry point is added to setup.py.
Root cause: Gazebo transport tries multicast peer discovery on all interfaces when GZ_IP is unset. The GUI event loop blocks waiting for the server handshake — the window frame appears (Qt init succeeds) but hangs before the scene loads.
Fix (already in docker-compose.sim.yml):
environment:
- GZ_IP=127.0.0.1This binds Gazebo transport to loopback only, so server↔GUI discovery resolves instantly.
Root cause: Docker's default /dev/shm is 64 MB — too small for Gazebo's OGRE renderer, which transfers render buffers between server and GUI via shared memory.
Fix (already in docker-compose.sim.yml):
shm_size: '2g'
ipc: host
environment:
- QT_X11_NO_MITSHM=1Run sudo make sim-xauth once (re-run if the display session changes). The compose file mounts /tmp/.docker.xauth and sets XAUTHORITY=/tmp/.docker.xauth.
The jazzy-simulation base image doesn't include Vulkan ICDs, so OGRE logs this and falls back to software OpenGL (llvmpipe). This is expected and harmless when running without the GPU overlay — the sim works but renders on CPU.
To switch to GPU-accelerated rendering (NVIDIA), use the GPU compose overlay as described in the Sim section above.
| Phase | Weeks | Goal |
|---|---|---|
| 1 | 1–4 | Classical face tracker (Haar cascade) — complete |
| 2 | 5–8 | Gazebo sim + synthetic data engine — complete |
| 3 | 9–13 | VLA model (DINOv2 + CLIP + action head) — complete |
| 4 | 14–18 | Edge deployment + MLOps loop — current |
| 5 | 19–20 | Polish + portfolio |