Skip to content

Conversation

@xyuzh
Copy link
Contributor

@xyuzh xyuzh commented Dec 24, 2025

Summary

This PR adds the Ask-LLM data curation example demonstrating LLM-based data curation using the Ask-LLM methodology from the DCLM paper.

What it does

  • Uses Qwen2.5-3B-Instruct via vLLM to score text quality
  • Implements the Ask-LLM approach: prompts an LLM with "Is this text suitable for training a language model?"
  • Uses the softmax probability of "Yes" (P(Yes)) as a quality score
  • Filters the FineWeb-edu dataset based on quality threshold
  • Writes curated data to parquet

Files

  • main.py: Main pipeline with preprocessing, LLM inference, and postprocessing
  • job.yaml: Anyscale job configuration (8x g5.xlarge GPUs)
  • Dockerfile: Container with vLLM and dependencies
  • README.md: Documentation and usage instructions

References

This example demonstrates LLM-based data curation using the Ask-LLM
methodology from the DCLM paper. It uses Qwen2.5-3B-Instruct via vLLM
to score text quality and filter the FineWeb-edu dataset.

Features:
- Ask-LLM prompting to judge text quality for LLM training
- Uses softmax P(Yes) probability as the quality score
- Scalable Ray Data pipeline with vLLM inference
- Configurable quality threshold filtering
@xyuzh xyuzh force-pushed the ask-llm-data-curation branch from cddcd4e to f4838b6 Compare December 24, 2025 02:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant