Skip to content

Conversation

@pwnage101
Copy link
Member

DENG-????

# Iterate over processing segments in chronological order, handling each one
# one at a time, and completely freeing the memory of the previous one before
# fetching the next.
for processing_segment in processing_segments:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each iteration of this loop could become a task, as long as the tasks can be made to run sequentially. This eliminates a big reason to use tasks which in my mind is automatic concurrency and dependency resolution. However, it may still be useful to make this into tasks for better visibility into the current progress, and ability to do things like mark a task run as "skipped" or "failed" if that's ever a possible scenario, or "retried" if we want to see retries manifest as task retries instead of backoff retries hidden in the logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we are going to go on a mark tasks as "skipped" or "failed", then we are also not running them sequentially.

Copy link
Contributor

@brianhw brianhw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Great progress! Just a few nits, really, nothing major.

However, the worse-case-scenario is that somehow a large proportion of the events in this processing segment
corresponds to a single user, in which case the memory consumption would equal that of loading the entire processing
segment into memory (multiple gigabytes). We could address memory consumption by adding even more complexity (and
possibly bugs) to the code, but this is an exceedingly rare scenario that isn't worth optimizing for.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be fun to calculate a statistic about what the maximum number of events per user actually is. But the case where we're most likely to hit this is if we were to do a backfill job. That first time, we'll presumably have all events for a user over the last two years. After the first load, we expect to run in smaller increments, and should be fine.

# Iterate over processing segments in chronological order, handling each one
# one at a time, and completely freeing the memory of the previous one before
# fetching the next.
for processing_segment in processing_segments:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we are going to go on a mark tasks as "skipped" or "failed", then we are also not running them sequentially.

@pwnage101 pwnage101 force-pushed the pwnage101/amplitude_loader_logic branch from 0338b2b to 2f57f68 Compare November 18, 2021 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants