Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions nemo_curator_semantic_dedup/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# NeMo Curator Image Deduplication Example
# Uses CUDA 12.8 for GPU-accelerated processing
FROM anyscale/ray:2.52.0-slim-py312-cu128

# Install system dependencies
RUN sudo apt-get update && \
sudo apt-get install -y --no-install-recommends \
build-essential \
unzip \
wget \
curl && \
sudo apt-get clean && \
sudo rm -rf /var/lib/apt/lists/*

# Install uv for fast package management
RUN curl -LsSf https://astral.sh/uv/install.sh | sh

# Install Python dependencies
# NeMo Curator with CUDA 12 support for image processing
RUN uv pip install --system "nemo-curator[image_cuda12]"

# Additional dependencies for image downloading and processing
RUN uv pip install --system \
loguru \
Pillow \
aiohttp \
tqdm \
pandas \
pyarrow \
huggingface_hub \
transformers

# Pre-download CLIP model weights to avoid runtime downloads
# This makes job startup faster and more reliable
RUN python -c "\
from huggingface_hub import snapshot_download; \
import os; \
model_dir = '/home/ray/model_weights/openai/clip-vit-large-patch14'; \
os.makedirs(model_dir, exist_ok=True); \
snapshot_download('openai/clip-vit-large-patch14', local_dir=model_dir)"

# Set environment variable for model directory
ENV MODEL_DIR=/home/ray/model_weights

# Download and prepare the example dataset from HuggingFace
# Downloads MS COCO parquet, deduplicates URLs, and truncates to 100k rows
RUN mkdir -p /home/ray/data && \
curl -L https://huggingface.co/datasets/ChristophSchuhmann/MS_COCO_2017_URL_TEXT/resolve/main/mscoco.parquet \
-o /home/ray/data/mscoco.parquet && \
python -c "\
import pandas as pd; \
df = pd.read_parquet('/home/ray/data/mscoco.parquet'); \
deduped = df[~df['URL'].duplicated()]; \
truncated = deduped[:100000]; \
truncated.to_parquet('/home/ray/data/truncated_100k_mscoco.parquet'); \
print(f'Created truncated dataset with {len(truncated)} rows')" && \
rm /home/ray/data/mscoco.parquet

# Create output directories
RUN mkdir -p /home/ray/data/webdataset \
/home/ray/data/results \
/home/ray/data/embeddings \
/home/ray/data/removal_ids

57 changes: 57 additions & 0 deletions nemo_curator_semantic_dedup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Image Semantic Deduplication with NeMo Curator

This example uses [NVIDIA NeMo Curator](https://github.com/NVIDIA-NeMo/Curator) to perform GPU-accelerated semantic deduplication on image datasets.

NeMo Curator is a scalable data curation library that leverages NVIDIA RAPIDS™ for GPU acceleration. This example downloads images from a parquet file, generates CLIP embeddings, and removes near-duplicate images based on semantic similarity.

## Install the Anyscale CLI

```bash
pip install -U anyscale
anyscale login
```

## Run the job

Clone the example from GitHub.

```bash
git clone https://github.com/anyscale/examples.git
cd examples/nemo_curator_semantic_dedup
```

Submit the job.

```bash
anyscale job submit -f job.yaml
```

## Understanding the example

- The [Dockerfile](./Dockerfile) builds a custom image with NeMo Curator CUDA dependencies (`nemo-curator[image_cuda12]`), downloads the MS COCO sample dataset from HuggingFace, and pre-downloads the CLIP model weights to speed up job startup.

- The entrypoint defined in [job.yaml](./job.yaml) runs `image_dedup_example.py` which executes a 3-step pipeline:
1. **Download WebDataset**: Fetches images from URLs in the parquet file and saves them as WebDataset tar files to `/mnt/cluster_storage/nemo_curator/webdataset`
2. **Generate CLIP embeddings**: Uses OpenAI's CLIP ViT-L/14 model to create 768-dimensional embeddings for each image
3. **Semantic deduplication**: Clusters embeddings with k-means and removes near-duplicates based on cosine similarity

- The `/mnt/cluster_storage/` directory is an ephemeral shared filesystem attached to the cluster for the duration of the job. All outputs (embeddings, duplicate IDs, and deduplicated images) are saved here.

- To use your own data, prepare a parquet file with `URL` and `TEXT` columns, upload it to cluster storage, and override the `INPUT_PARQUET` environment variable:
```bash
anyscale job submit -f job.yaml \
--env INPUT_PARQUET=/mnt/cluster_storage/your_data.parquet \
--env OUTPUT_DIR=/mnt/cluster_storage/your_results
```

- The [helper.py](./helper.py) module provides utilities for downloading images in parallel and converting them to [WebDataset](https://github.com/webdataset/webdataset) format, which is optimized for streaming large-scale image datasets.

## View the job

View the job in the [jobs tab](https://console.anyscale.com/jobs) of the Anyscale console.

## Learn more

- [NeMo Curator Documentation](https://docs.nvidia.com/nemo/curator/latest/)
- [NeMo Curator Image Tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/image/getting-started)
- [Anyscale Jobs Documentation](https://docs.anyscale.com/platform/jobs/)
Loading