RFC: materialize the result to handle non-deterministic functions#24
RFC: materialize the result to handle non-deterministic functions#24
Conversation
|
Any interaction of UDF with watermark? |
I think we should give user way to config on the UDF such as
|
|
This proposal inspires me that, if we're going to fix a bug for our built-in expression in a version update, it also breaks the determinism, even if the expression itself is mathematically pure. 😄 If we're for the sake of serious consistency and determinism, we should even version-tag the built-in expressions and keep the buggy one when fixing, in this case. |
There was a problem hiding this comment.
Just came up with a problem: IIRC currently UDF is just an expression, which can be used in other places beside Project, e.g., Filter. Should we make them occur only in Project? Besides the determinism problem, there are also optimizations like risingwavelabs/risingwave#8703
There was a problem hiding this comment.
BTW, https://github.com/risingwavelabs/rfcs/pull/12/files#diff-135315ddd7b7929c8daf439f35043ba3c168ce620ab72c56740a6055d15e8f9fR141 refused to add a new operator, but it seems we eventually need to add some 😅
There was a problem hiding this comment.
Firstly, I need to clarify that the "UDF" and "non-deterministic expression" are independent concepts. PROC_TIME() and RAND() are built-in expressions but they are non-deterministic. A user-defined function can also be deterministic if a user makes a contract with us.
There was a problem hiding this comment.
And materialized Project is just used for non-deterministic expression on an "updatable stream". We can use a normal project on the append-only stream for non-deterministic expression too.
There was a problem hiding this comment.
I totally agree to refuse to add a new operator for UDF or async expression but not for non-deterministic expression. In another word, if we did add a "UDFProject", now we maybe have to accept StreamProject, StreamAsyncProject, StreamMaterializeProject, StreamAsyncMaterializeProject 😅 🥵
There was a problem hiding this comment.
Hmmm, after thinking a while I feel that although conceptually different, practically they cannot be so cleanly separated 😅:
Since we want to "always materializing UDF" (except for rare cases and used with caution), MaterializeProject would become the main executor for UDFs. Optimizations for UDF in Project like risingwavelabs/risingwave#8703 would be in vain. On the other hand, do you think we should optimize MaterializeProject for async execution?
There was a problem hiding this comment.
Oh, I'm wrong again😇😇😇. I forgot append-only stream. But the points might still apply
There was a problem hiding this comment.
I think all the optimization for async execution should be done behind the expression framework 🤔 Well there also are some executor-level optimization. For example, some users might not care about the order of the records on the append-only stream and we might reorder them.
There was a problem hiding this comment.
all the optimization for async execution should be done behind the expression framework
Is that enough? If so we can only tweak the performance for expr on one chunk, but not amoung chunks, like buffered in https://github.com/risingwavelabs/risingwave/pull/8703/files
There was a problem hiding this comment.
Well, actually I don’t mind adding many different executor variants in the future, just like the TopN variants 😅
Let’s conclude the discussion.
|
|
||
| ## Design | ||
|
|
||
| We will introduce a `MaterializeProject`, it will materialize some **partial** columns of result with the stream key of the input as the primiary key. When an `Insert` operation comes, it will compute the result and materialize some columns. when `Update` or `Delete` comes, it will lookup its state to replace the old value of the operation. |
There was a problem hiding this comment.
Find a problem: not all expressions can be extracted to a Project easily, e.g., in FILTER clause. 🤡
This is mentioned in #12 as a reason against a dedicated operator for UDF.
No description provided.