Add Ask-LLM data curation example #29

xyuzh · 2025-12-24T02:46:17Z

Summary

This PR adds the Ask-LLM data curation example demonstrating LLM-based data curation using the Ask-LLM methodology from the DCLM paper.

What it does

Uses Qwen2.5-3B-Instruct via vLLM to score text quality
Implements the Ask-LLM approach: prompts an LLM with "Is this text suitable for training a language model?"
Uses the softmax probability of "Yes" (P(Yes)) as a quality score
Filters the FineWeb-edu dataset based on quality threshold
Writes curated data to parquet

Files

main.py: Main pipeline with preprocessing, LLM inference, and postprocessing
job.yaml: Anyscale job configuration (8x g5.xlarge GPUs)
Dockerfile: Container with vLLM and dependencies
README.md: Documentation and usage instructions

References

This example demonstrates LLM-based data curation using the Ask-LLM methodology from the DCLM paper. It uses Qwen2.5-3B-Instruct via vLLM to score text quality and filter the FineWeb-edu dataset. Features: - Ask-LLM prompting to judge text quality for LLM training - Uses softmax P(Yes) probability as the quality score - Scalable Ray Data pipeline with vLLM inference - Configurable quality threshold filtering

xyuzh force-pushed the ask-llm-data-curation branch from cddcd4e to f4838b6 Compare December 24, 2025 02:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Ask-LLM data curation example #29

Add Ask-LLM data curation example #29

Uh oh!

xyuzh commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add Ask-LLM data curation example #29

Are you sure you want to change the base?

Add Ask-LLM data curation example #29

Uh oh!

Conversation

xyuzh commented Dec 24, 2025

Summary

What it does

Files

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant