Skip to content

feat(engine): async generators and task-queue dataset builder #346

@andreatgretel

Description

@andreatgretel

Priority Level

High

Task Summary

Transform the dataset builder from sequential column-by-column processing into an async task queue with dependency-aware scheduling. Generators become async-first, and the builder dispatches individual cell/batch tasks as soon as their upstream dependencies are satisfied — enabling pipeline parallelism across columns and rows.

Technical Details & Implementation Plan

  • Dependency map: built from each column config's required_columns property (Jinja2 template introspection) — no config schema changes needed
  • Completion tracker: lightweight columns × rows matrix that determines task readiness
  • Async task scheduler: replaces the sequential _run_batch loop; dispatches tasks as dependencies are met, bounded by semaphore
  • Generator async migration: all generator types get async-capable agenerate methods (LLM generators already have native async from feat(engine): env-var switch for async-first models experiment #280)
  • Row group checkpointing: parquet written when a row group fully completes

Dependencies

Part of #260. Builds on #280 (merged). Related: #269, #344.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions