Skip to content

eclipse0922/isaac_ros_image_segmentation

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Isaac ROS Image Segmentation (+ SAM3)

Fork of NVIDIA-ISAAC-ROS/isaac_ros_image_segmentation. Adds SAM3 (Meta, 2025) as a ROS 2 node with a PyTorch-native backend.

SAM3 on ROS 2

SAM3 is Meta's open-vocabulary segmentation model released in 2025 that handles grounded segmentation via text prompts. This fork integrates it into the Isaac ROS image segmentation ecosystem as a Python-based ROS 2 node.

Key characteristics:

  • Open-vocabulary — segment by text prompt ("robot arm", "person", etc.)
  • Full SAM3 — ViT-H backbone, FPN neck, grounding decoder (not a distilled variant)
  • PyTorch-native — no ONNX or Triton; runs directly with torch.inference_mode()

Why PyTorch-native?

The natural path for Isaac ROS is to export models to ONNX → TensorRT for optimized inference. That didn't work here.

torch.export (required for torch_tensorrt) fails on the SAM3 decoder due to pin_memory not being supported in export mode. ONNX export hits the same wall. TensorRT works fine for the vision encoder only, which is why the optional TRT vision path exists — but the decoder stays in PyTorch.

Instead, the optimizations applied are:

  • BF16 autocast (pytorch_amp_bf16, default True) — wraps the full pipeline, ~2× vision encoder speedup with no quality loss
  • torch.compile decoder (pytorch_compile_decoder, default True) — ~3× decoder speedup via inductor
  • torch.compile vision encoder (pytorch_compile_vision, optional) — additional speedup if TRT not used

Performance (RTX 4090, r2b_robotarm dataset)

Stage Time
Vision Encoder (BF16) ~56 ms
Text Encoder (cached) ~0 ms
Decoder (compile + BF16) ~34 ms
Preprocess + postprocess ~30 ms
ROS 2 E2E ~130 ms (~7.5 fps)

Quickstart

Prerequisites

1. Clone & Download Checkpoint

git clone https://github.com/eclipse0922/isaac_ros_image_segmentation.git
cd isaac_ros_image_segmentation

# Login to HuggingFace and download sam3.pt (~3.3 GB)
pip install huggingface_hub
huggingface-cli login
mkdir -p models/sam3
huggingface-cli download facebook/sam3 sam3.pt --local-dir models/sam3

2. Build Docker Image

docker build \
  -f isaac_ros_segment_anything3/docker/Dockerfile.pytorch \
  -t sam3_pytorch:latest .

Base: NGC PyTorch 26.01 (Ubuntu 24.04, CUDA 13.1, PyTorch 2.10, TensorRT 10.14) + ROS 2 Jazzy.

3. Run Foxglove Demo (with robot arm video)

# On host — streams segmentation overlay to Foxglove Studio on port 8765
./run_foxglove_demo.sh \
  --bag datasets/r2bdataset2024_v1/r2b_robotarm \
  --topic /camera_1/color/image_raw \
  --prompt "robot arm"

Open Foxglove Studio → connect to ws://localhost:8765 → subscribe to /sam3/overlay.

The r2b_robotarm dataset (~1.4 GB MCAP) can be downloaded via NGC:

./download_dataset_ngc.sh

Standalone Test (no ROS 2)

docker run --runtime=nvidia --rm \
  -v $(pwd):/ws -w /ws \
  sam3_pytorch:latest \
  python3 isaac_ros_segment_anything3/scripts/test_sam3_pytorch.py \
    --checkpoint models/sam3/sam3.pt \
    --image /path/to/image.jpg \
    --prompt "cat" \
    --precision bf16

ROS 2 Node

Package: isaac_ros_segment_anything3 Node: sam3_node.py

Key Parameters

Parameter Default Description
model_checkpoint_path /tmp/models/sam3.pt Path to sam3.pt
text_prompts "" Comma-separated prompts (also settable at runtime via service)
confidence_threshold 0.3 Detection score threshold
pytorch_amp_bf16 True BF16 autocast for full pipeline
pytorch_compile_decoder True torch.compile on decoder
pytorch_compile_vision False torch.compile on vision encoder
image_size 1008 Input resolution (SAM3 native: 1008×1008)

Runtime Prompt Update

ros2 service call /sam3/set_text_prompt \
  isaac_ros_segment_anything3_interfaces/srv/SetTextPrompt \
  "{text_prompts: ['robot arm', 'person']}"

Topics

Topic Type Description
/image_raw (sub) sensor_msgs/Image Input RGB image
/sam3/raw_segmentation_mask (pub) sensor_msgs/Image Binary mask
/sam3/overlay (pub) sensor_msgs/Image Visualization overlay
/sam3/timing (pub) Sam3Timing Per-stage latency

Architecture Notes

SAM3 uses a ViT-H vision backbone with an FPN neck (4 feature scales) and a grounding decoder that takes text embeddings as queries. Unlike SAM1/2, there is no point/box prompt — everything goes through text.

Key implementation details:

  • Image resolution: 1008×1008 (stretch resize, no letterbox)
  • Normalization: mean=std=[0.5, 0.5, 0.5]
  • Text encoding via built-in CLIP tokenizer (model.backbone.forward_text())
  • Decoder outputs 200 query slots with normalized [cx, cy, w, h] boxes
  • Scoring: sigmoid(pred_logits) × sigmoid(presence_logits) > threshold
  • Geometry encoder CLS token requires cross-attention with image features before the main decoder

Future Work

  • C++ NITROS zero-copy: The vision encoder is already exported as a TRT engine (.pt2). A C++ ComposableNode wrapping it with NITROS image transport would enable zero-copy GPU input — the natural Isaac ROS path. The decoder currently cannot be TRT-exported (pin_memory not supported in torch.export), so it would remain Python, but the image-to-features path could be fully GPU-resident.
  • Decoder TRT export: If the upstream SAM3 decoder removes the pin_memory dependency, full TRT deployment via torch_tensorrt becomes straightforward.
  • Dynamic batch / multi-camera: The current node is single-image, single-camera. Batching across multiple camera streams would improve GPU utilization.

Original NVIDIA Packages

The packages below are from the original NVIDIA Isaac ROS release and remain unchanged in this fork. Refer to the upstream documentation for usage.

Package Model Description
isaac_ros_unet U-Net Semantic segmentation via TensorRT
isaac_ros_segformer Segformer Transformer-based segmentation
isaac_ros_segment_anything SAM Prompt-based segmentation (SAM1)
isaac_ros_segment_anything2 SAM2 Video object segmentation

License

Apache 2.0 — see LICENSE. SAM3 model weights are subject to Meta's model license.

About

NVIDIA-accelerated, deep learned semantic image segmentation

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • C++ 77.3%
  • Shell 10.7%
  • CMake 10.2%
  • Cuda 1.8%