Skip to content

Split streams for parallel flows #2

@thearchitector

Description

@thearchitector

A fairly obvious limitation of the current IO design is that it forces a concurrent-but-single-stream processing model. If my stream is 20,000 items long, and with a flow of 3 processors, I can still only process 3 items at a time; the last item in the stream will not get processed until all 19,999 before it have started.

It's a different type of scalability problem, and the current design does not handle it.

It would be better if, in addition to the concurrent processing approach, datagraph supported parallel processing via stream partitions. So instead of 1 flow processing 3 items of a stream concurrently, the stream could be partitioned so that 3 flows could process 9 items concurrently.

There are a couple ways we could do this: in-process or distributed.

  1. in-process is simpler and avoids having to integrate with the parent scheduling framework. When a flow is started, IO streams are partitioned and the underlying stream is prefixed with a partition_id. Processors would then gather N runners in a task group, where each runner is responsible for some partition.
  2. out-of-process / distributed would involve coordinating partition handling across several replicas of the actual worker service. It would be possible for different replicas to run different partitions. I'm not sure how this would impact the Executor interface, but it seems likely that we'd need to alter it so that each runner was actually a group of runners (each with their own partition).

The tricky thing with both of these is figuring out how to partition the streams. If the final stream size for each IO is known initially, that's easier. Is that guaranteed, though?

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions