-
Notifications
You must be signed in to change notification settings - Fork 0
Description
A fairly obvious limitation of the current IO design is that it forces a concurrent-but-single-stream processing model. If my stream is 20,000 items long, and with a flow of 3 processors, I can still only process 3 items at a time; the last item in the stream will not get processed until all 19,999 before it have started.
It's a different type of scalability problem, and the current design does not handle it.
It would be better if, in addition to the concurrent processing approach, datagraph supported parallel processing via stream partitions. So instead of 1 flow processing 3 items of a stream concurrently, the stream could be partitioned so that 3 flows could process 9 items concurrently.
There are a couple ways we could do this: in-process or distributed.
- in-process is simpler and avoids having to integrate with the parent scheduling framework. When a flow is started, IO streams are partitioned and the underlying stream is prefixed with a
partition_id. Processors would then gather N runners in a task group, where each runner is responsible for some partition. - out-of-process / distributed would involve coordinating partition handling across several replicas of the actual worker service. It would be possible for different replicas to run different partitions. I'm not sure how this would impact the
Executorinterface, but it seems likely that we'd need to alter it so that each runner was actually a group of runners (each with their own partition).
The tricky thing with both of these is figuring out how to partition the streams. If the final stream size for each IO is known initially, that's easier. Is that guaranteed, though?