Skip to content

Dev mainloop integration 1#949

Open
kaya-david wants to merge 26 commits intopoc-mainloopfrom
dev-mainloop-integration-1
Open

Dev mainloop integration 1#949
kaya-david wants to merge 26 commits intopoc-mainloopfrom
dev-mainloop-integration-1

Conversation

@kaya-david
Copy link
Copy Markdown
Collaborator

@kaya-david kaya-david commented Mar 26, 2026


The rendered docs for this PR can be found here.

@kaya-david kaya-david force-pushed the dev-mainloop-integration-1 branch from 9c6b0c4 to 8c36e95 Compare March 26, 2026 06:38
@kaya-david kaya-david marked this pull request as draft March 26, 2026 07:31
@kaya-david kaya-david self-assigned this Mar 26, 2026
@kaya-david kaya-david marked this pull request as ready for review March 30, 2026 08:08
@kaya-david kaya-david requested a review from mhoff March 30, 2026 08:08
Copy link
Copy Markdown
Collaborator

@mhoff mhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kaya-david I did a first review pass. Happy to discuss

@kaya-david kaya-david force-pushed the dev-mainloop-integration-1 branch from 9db4ab9 to 04234fd Compare March 31, 2026 09:12
Copy link
Copy Markdown
Collaborator

@mhoff mhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just got one comment for now. I have not further analyzed kafka or opensearch output as @Pablu23 is already working on those

@kaya-david kaya-david requested a review from Pablu23 April 2, 2026 04:18
@kaya-david kaya-david force-pushed the dev-mainloop-integration-1 branch from af8deb4 to 5dbb963 Compare April 2, 2026 04:32
@kaya-david kaya-david requested a review from mhoff April 2, 2026 04:57
Copy link
Copy Markdown
Collaborator

@mhoff mhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @kaya-david, I have added some comments

Comment on lines +359 to +363
async def shut_down(self):
"""Raises Uvicorn HTTP Server internal stop flag and waits to join"""
if self.http_server:
self.http_server.shut_down()
return super()._shut_down()
await super().shut_down()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained above, we no longer introduce separate "clean-up" paths, but instead rely on a single, well-defined tear-down path that implies the instance must not be used afterwards.

tasks_but_current = [t for t in self._worker_tasks if t is not current_task]

logger.debug("waiting for termination of %d tasks", len(tasks_but_current))
logger.debug(f"waiting for termination of {len(tasks_but_current)} tasks")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit puzzled by this change. I thought using %d would be the proper way to avoid string interpolation if the log level is not activated.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! I’ve reverted this change.

Copy link
Copy Markdown
Collaborator

@Pablu23 Pablu23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work, I left you a few Comments I noticed in regards to the in and outputs

consumer = await self.get_consumer()

if consumer is not None:
await consumer.unsubscribe()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe unsubscribe is unnecessary here, but its probably also not doing any bad

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are rigth, unsubscribe is only needed for dynamic topic switching during runtime.

In our case, shut_down is not designed for that. It follows RAII-like semantics: calling it implies a full teardown of the instance and all associated resources. After that, a fresh instance is expected to be created via setup (as the counterpart to shut_down).

Continuing to operate on the same instance after a partial cleanup (e.g. via unsubscribe) is explicitly not part of the intended lifecycle.

Comment on lines +327 to +343

except BufferError:
# block program until buffer is empty or timeout is reached
self._producer.flush(timeout=self.config.flush_timeout)
logger.debug("Buffer full, flushing")

try:
self._producer.produce(
topic=target,
value=self._encoder.encode(document),
on_delivery=partial(self.on_delivery, event),
)
except BufferError as err:
event.state.current_state = EventStateType.FAILED
event.errors.append(err)
logger.error("Message delivery failed after retry: %s", err)
self.metrics.number_of_errors += 1
return
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new logic, now we dont try to flush everytime, we only try to flush if we get a BufferError, which I guess is fine, if we just want to flush on a full Buffer but I dont think we do. Also I dont like nesting try, except like this. But I also dont have a better solution for now, maybe recursivly call this same function an increment an optional depth argument, and if that reaches the set retry times, we can error out again

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applied to the old sync producer where we had to manually coordinate produce/poll/flush. I’ve now migrated the producer to the async AIOProducer, where delivery is handled via awaitable futures and internal batching, so this control flow (and related concerns) no longer applies.

Could you please cherry-pick the relevant parts into your output ticket and take another look there? If needed, feel free to implement your own async variant of store_custom / producer handling.

For now I’d keep the current implementation as is and suggest we clean this up together in a separate PR to properly align on the async approach.

logger.error("Message delivery failed: %s", err)
self.metrics.number_of_errors += 1
return

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this part is weird, why do we have to handle the case that Kafka ran into an Exception twice? First in the try, except of the store_custom function, and also here? This Callback is run from the Context of the store_custom function

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above

Comment on lines +413 to +414
if "_producer" in self.__dict__:
await self.flush()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we do this? Shouldnt we just always flush, I mmean shouldnt the flush be agnostic to, there is a producer and there is none? Also I dont like this if, isnt there any other way to check if we have a producer?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_producer is a cached property and is only initialized on first access. A shut_down could technically occur before it was ever used (i.e. before the producer exists), which would cause a crash during flush. This check is therefore a precaution.

Comment on lines +371 to +373
search_context = self.__dict__.get("_search_context")
if search_context is not None:
await search_context.close()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
search_context = self.__dict__.get("_search_context")
if search_context is not None:
await search_context.close()
await self._search_context.close()

Copy link
Copy Markdown
Collaborator Author

@kaya-david kaya-david Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I added a guard here as well.

PS: I removed the override decorator - as long as mypy does not complain (which is currently the case), overrides don’t add much value. Also, since we have many overloads and don’t consistently use override elsewhere, I prefer to omit it here for consistency.

…cle across components

- unify component lifecycle by introducing async setup/shut_down across NG components
- remove legacy _shut_down pattern and simplify base Component shutdown logic
- align Connector/Input/Output/Processor lifecycle interfaces
- fix kafka output delivery semantics by setting DELIVERED only via on_delivery callback
- improve kafka error handling (BufferError retry, KafkaException -> FAILED)
- ensure proper resource cleanup (consumer unsubscribe/close, producer flush, opensearch context close)
- improve worker shutdown by cancelling only unfinished tasks

# Conflicts:
#	logprep/ng/connector/opensearch/output.py
- remove docker compose teardown from SIGINT handler to avoid interfering with active OpenSearch requests
- introduce coordinated shutdown via _shutdown_requested flag
- add shutdown checkpoints to abort benchmark flow safely
- ensure compose teardown happens only in controlled finally blocks
- fix intermittent 503 errors during OpenSearch _count caused by concurrent shutdown
- remove docker compose teardown from SIGINT handler to avoid interfering with active OpenSearch requests
- introduce coordinated shutdown via _shutdown_requested flag
- add shutdown checkpoints to abort benchmark flow safely
- ensure compose teardown happens only in controlled finally blocks
- fix intermittent 503 errors during OpenSearch _count caused by concurrent shutdown
@kaya-david kaya-david force-pushed the dev-mainloop-integration-1 branch from b037231 to 99cd7ec Compare April 7, 2026 04:43
… (unsubscribe only needed for dynamic topic switching during runtime)
@kaya-david kaya-david force-pushed the dev-mainloop-integration-1 branch from 5e28118 to 6d8bb81 Compare April 7, 2026 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants