Conversation
Enovotny
left a comment
There was a problem hiding this comment.
overall looks good. Just have some additions to the tests. Also did you do any performance testing to see what the optimal chunk size would be? is it 2 weeks? or would a larger size be more efficient.
| # Generate 15-minute interval timestamps | ||
| dt = pd.date_range( | ||
| start=START_DATE_CHUNK_MULTI, end=END_DATE_CHUNK_MULTI, freq="15T", tz="UTC" | ||
| ) |
There was a problem hiding this comment.
change 15T to 15min based on warning from test.
I did some testing last week on reads for chunk size and number of max_workers (I assume stores are similar). Generally, 14 day chunks are good for queries up to about a year or so, and then you are better using about 30 day chunks. For max_worker threads, 20 is good for lengths below a year and then you are better off bumping to 30. We could code this default in to scale depending on the query length, but I think that would just complicate things. The user can up the chunks and max_workers if they are getting POR data. |
|
How does the 30 day size run with data less than a year? If it is comparable to 14 I would put it at 30 as default. Same for workers. if the 14 is faster than 30 for smaller pulls I would agree and leave it. maybe add a note to the parameter that lets the user know this. Also have additional asserts to add. An assert to make sure that the number of values returned is that same as the number stored. and make sure there are not any null values returned. |
Yeah for smaller pulls, 14 is faster. Same with max_workers (20 is better than 30 for small pulls). I added a note to the documentation. Probably worth revisiting once CDA gets performance improvements.
Added those asserts. |
|
|
Updated tests to compare entire df read/write with |



Hi Guys,
I refactored my first attempt at chunk multi-threading timeseries to make it easier to follow/maintain (hopefully). By default, chunk timeseries multi-threading will be on. For example for a query that is longer than 2 weeks it will spawn threads for every 2 week chunk until it hits the max_worker parameter. For a store longer than 2 weeks it will do the same until it hits the max worker.
I added a store and read test of roughly 1 month, which checks to make sure only 2 threads are used.
This PR replaces #210