[CORE] Refactor task error logging into dedicated utility class#11718
[CORE] Refactor task error logging into dedicated utility class#11718zhztheplayer merged 2 commits intoapache:mainfrom
Conversation
|
Run Gluten Clickhouse CI on x86 |
af1dfe8 to
2764e11
Compare
|
Run Gluten Clickhouse CI on x86 |
|
@zhztheplayer Could you please review? |
zhztheplayer
left a comment
There was a problem hiding this comment.
Hi @pratham76, thanks for helping out.
The intention of the TODO could mean we need to rely on Spark's scheduler layer to log the error. But we had to keep the TaskFailureListener here because Gluten may cause crash in CompletionListener so Spark's scheduler code will never have a chance to log the error.
So I think the discussion here is not about how to log the error. We may need to consider open PR in upstream Spark to make logging work even CompletionListener crashed.
I might not recall everything so let me know if you have different findings here.
zhztheplayer
left a comment
There was a problem hiding this comment.
Apart from the above comment, PR change itself LGTM.
|
Run Gluten Clickhouse CI on x86 |
|
Hi @zhztheplayer, Thanks for the clarification, I do get that the proper solution for the above mentioned TODO would be make spark's error logging more resilient when CompletionListener crashes. In this PR, I have tried to refactor the logging logic into separate utility class, to handle both scenarios of task failures and task recompute/retries, just to make the code a bit cleaner. I'm happy to explore the upstream Spark changes for this TODO (which I have updated here to give better context). For now, could we proceed with this refactoring as an incremental improvement? |
693fd91 to
9ef29f5
Compare
|
Run Gluten Clickhouse CI on x86 |
Sounds reasonable. Thanks for the patch. |
What changes are proposed in this pull request?
Extract task error logging from TaskResources into new
TaskErrorLoggerutility, resolving a TODO here. Improves separation of concerns while maintaining identical functionality for filtering speculative execution errors.How was this patch tested?
Existing UTs
Was this patch authored or co-authored using generative AI tooling?
No