Skip to content
Draft
66 changes: 66 additions & 0 deletions nemo_curator_semantic_dedup/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# NeMo Curator Image Deduplication Example
# Uses CUDA 12.8 for GPU-accelerated processing
FROM anyscale/ray:2.52.0-slim-py312-cu128

# Note: Cache busting for git clone is done via CURATOR_CACHE_BUST arg below

# Install system dependencies
RUN sudo apt-get update && \
sudo apt-get install -y --no-install-recommends \
build-essential \
unzip \
wget \
curl \
git && \
sudo apt-get clean && \
sudo rm -rf /var/lib/apt/lists/*

# Install uv for fast package management
RUN curl -LsSf https://astral.sh/uv/install.sh | sh

# Install Python dependencies
# Use uv pip install --system to install into the base anaconda environment
# so all Ray workers (not just the driver) have these packages
RUN python -m pip install --upgrade pip setuptools wheel

# IMPORTANT: Uninstall any pre-existing RAPIDS/cuML packages from the base image
# The base image may have incompatible versions that conflict with scikit-learn
RUN python -m pip uninstall -y cuml-cu12 cudf-cu12 cugraph-cu12 pylibraft-cu12 raft-dask-cu12 rmm-cu12 || true && \
echo "Cleaned up pre-existing RAPIDS packages"

# Clone NeMo-Curator from fork and install in editable mode
# This ensures all Ray workers have the same code with your local edits
ARG CURATOR_REPO=https://github.com/avigyabb/Curator.git
ARG CURATOR_REF=avi-test
# ARG CURATOR_REF=main
# Cache bust for git clone - change this value to force re-clone after pushing to branch
ARG CURATOR_CACHE_BUST=2025-12-29-v3
RUN echo "Cache bust: ${CURATOR_CACHE_BUST}" && \
git clone --depth 1 -b ${CURATOR_REF} ${CURATOR_REPO} /home/ray/NeMo-Curator && \
uv pip install --system -e /home/ray/NeMo-Curator[image_cuda12]

# Re-upgrade scikit-learn AFTER nemo-curator in case it was downgraded
# cuML 25.6.* needs sklearn >= 1.5 (has _get_default_requests)
RUN uv pip install --system "scikit-learn>=1.5,<1.6" && \
python -c "import sklearn; print(f'Final scikit-learn version: {sklearn.__version__}')"

# Additional dependencies for image downloading and processing
RUN uv pip install --system \
loguru \
Pillow \
aiohttp \
tqdm \
pandas \
pyarrow \
huggingface_hub \
transformers

# Set environment variable for model directory
ENV MODEL_DIR=/home/ray/model_weights

# Create output directories
RUN mkdir -p /home/ray/data/webdataset \
/home/ray/data/results \
/home/ray/data/embeddings \
/home/ray/data/removal_ids

57 changes: 57 additions & 0 deletions nemo_curator_semantic_dedup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Image Semantic Deduplication with NeMo Curator

This example uses [NVIDIA NeMo Curator](https://github.com/NVIDIA-NeMo/Curator) to perform GPU-accelerated semantic deduplication on image datasets.

NeMo Curator is a scalable data curation library that leverages NVIDIA RAPIDS™ for GPU acceleration. This example downloads images from a parquet file, generates CLIP embeddings, and removes near-duplicate images based on semantic similarity.

## Install the Anyscale CLI

```bash
pip install -U anyscale
anyscale login
```

## Run the job

Clone the example from GitHub.

```bash
git clone https://github.com/anyscale/examples.git
cd examples/nemo_curator_semantic_dedup
```

Submit the job.

```bash
anyscale job submit -f job.yaml
```

## Understanding the example

- The [Dockerfile](./Dockerfile) builds a custom image with NeMo Curator CUDA dependencies (`nemo-curator[image_cuda12]`), downloads the MS COCO sample dataset from HuggingFace, and pre-downloads the CLIP model weights to speed up job startup.

- The entrypoint defined in [job.yaml](./job.yaml) runs `image_dedup_example.py` which executes a 3-step pipeline:
1. **Download WebDataset**: Fetches images from URLs in the parquet file and saves them as WebDataset tar files to `/mnt/cluster_storage/nemo_curator/webdataset`
2. **Generate CLIP embeddings**: Uses OpenAI's CLIP ViT-L/14 model to create 768-dimensional embeddings for each image
3. **Semantic deduplication**: Clusters embeddings with k-means and removes near-duplicates based on cosine similarity

- The `/mnt/cluster_storage/` directory is an ephemeral shared filesystem attached to the cluster for the duration of the job. All outputs (embeddings, duplicate IDs, and deduplicated images) are saved here.

- To use your own data, prepare a parquet file with `URL` and `TEXT` columns, upload it to cluster storage, and override the `INPUT_PARQUET` environment variable:
```bash
anyscale job submit -f job.yaml \
--env INPUT_PARQUET=/mnt/cluster_storage/your_data.parquet \
--env OUTPUT_DIR=/mnt/cluster_storage/your_results
```

- The [helper.py](./helper.py) module provides utilities for downloading images in parallel and converting them to [WebDataset](https://github.com/webdataset/webdataset) format, which is optimized for streaming large-scale image datasets.

## View the job

View the job in the [jobs tab](https://console.anyscale.com/jobs) of the Anyscale console.

## Learn more

- [NeMo Curator Documentation](https://docs.nvidia.com/nemo/curator/latest/)
- [NeMo Curator Image Tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/image/getting-started)
- [Anyscale Jobs Documentation](https://docs.anyscale.com/platform/jobs/)
Loading