-
Notifications
You must be signed in to change notification settings - Fork 1
feat: Add logic to enable loading pepped events into Amplitude #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| # Iterate over processing segments in chronological order, handling each one | ||
| # one at a time, and completely freeing the memory of the previous one before | ||
| # fetching the next. | ||
| for processing_segment in processing_segments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each iteration of this loop could become a task, as long as the tasks can be made to run sequentially. This eliminates a big reason to use tasks which in my mind is automatic concurrency and dependency resolution. However, it may still be useful to make this into tasks for better visibility into the current progress, and ability to do things like mark a task run as "skipped" or "failed" if that's ever a possible scenario, or "retried" if we want to see retries manifest as task retries instead of backoff retries hidden in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we are going to go on a mark tasks as "skipped" or "failed", then we are also not running them sequentially.
brianhw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Great progress! Just a few nits, really, nothing major.
| However, the worse-case-scenario is that somehow a large proportion of the events in this processing segment | ||
| corresponds to a single user, in which case the memory consumption would equal that of loading the entire processing | ||
| segment into memory (multiple gigabytes). We could address memory consumption by adding even more complexity (and | ||
| possibly bugs) to the code, but this is an exceedingly rare scenario that isn't worth optimizing for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be fun to calculate a statistic about what the maximum number of events per user actually is. But the case where we're most likely to hit this is if we were to do a backfill job. That first time, we'll presumably have all events for a user over the last two years. After the first load, we expect to run in smaller increments, and should be fine.
| # Iterate over processing segments in chronological order, handling each one | ||
| # one at a time, and completely freeing the memory of the previous one before | ||
| # fetching the next. | ||
| for processing_segment in processing_segments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we are going to go on a mark tasks as "skipped" or "failed", then we are also not running them sequentially.
0338b2b to
2f57f68
Compare
DENG-????