Skip to content

Multithread timeseries Take 2#219

Merged
msweier merged 15 commits intomainfrom
multithread_timeseries2
Oct 6, 2025
Merged

Multithread timeseries Take 2#219
msweier merged 15 commits intomainfrom
multithread_timeseries2

Conversation

@msweier
Copy link
Collaborator

@msweier msweier commented Sep 30, 2025

Hi Guys,

I refactored my first attempt at chunk multi-threading timeseries to make it easier to follow/maintain (hopefully). By default, chunk timeseries multi-threading will be on. For example for a query that is longer than 2 weeks it will spawn threads for every 2 week chunk until it hits the max_worker parameter. For a store longer than 2 weeks it will do the same until it hits the max worker.

**Multi-threading to get_timeseries**

The function defaults to multi-threading and will use up to a max_worker number of threads (~20) and will chunk data by a default of 14 days (assuming 15 minute data). It uses a helper function get_ts_extents from the catalog to check the extents of the timeseries to prevent requesting data for times outside the extents, if the request is before 2014.

**Multi-threading to store_timeseries**

The function defaults to multi-threading and will use up to a max_worker number of threads (~20) and will chunk data by a default of roughly 700 values (~14 days of 15 minute data).

I added a store and read test of roughly 1 month, which checks to make sure only 2 threads are used.

This PR replaces #210

@msweier msweier requested review from Enovotny and krowvin September 30, 2025 15:23
@msweier msweier mentioned this pull request Sep 30, 2025
Copy link
Collaborator

@Enovotny Enovotny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good. Just have some additions to the tests. Also did you do any performance testing to see what the optimal chunk size would be? is it 2 weeks? or would a larger size be more efficient.

# Generate 15-minute interval timestamps
dt = pd.date_range(
start=START_DATE_CHUNK_MULTI, end=END_DATE_CHUNK_MULTI, freq="15T", tz="UTC"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change 15T to 15min based on warning from test.

@msweier
Copy link
Collaborator Author

msweier commented Oct 6, 2025

overall looks good. Just have some additions to the tests. Also did you do any performance testing to see what the optimal chunk size would be? is it 2 weeks? or would a larger size be more efficient.
api_read_performance_result.csv

I did some testing last week on reads for chunk size and number of max_workers (I assume stores are similar). Generally, 14 day chunks are good for queries up to about a year or so, and then you are better using about 30 day chunks. For max_worker threads, 20 is good for lengths below a year and then you are better off bumping to 30. We could code this default in to scale depending on the query length, but I think that would just complicate things. The user can up the chunks and max_workers if they are getting POR data.

@Enovotny
Copy link
Collaborator

Enovotny commented Oct 6, 2025

How does the 30 day size run with data less than a year? If it is comparable to 14 I would put it at 30 as default. Same for workers. if the 14 is faster than 30 for smaller pulls I would agree and leave it. maybe add a note to the parameter that lets the user know this.

Also have additional asserts to add. An assert to make sure that the number of values returned is that same as the number stored. and make sure there are not any null values returned.

@msweier
Copy link
Collaborator Author

msweier commented Oct 6, 2025

How does the 30 day size run with data less than a year? If it is comparable to 14 I would put it at 30 as default. Same for workers. if the 14 is faster than 30 for smaller pulls I would agree and leave it. maybe add a note to the parameter that lets the user know this.

Yeah for smaller pulls, 14 is faster. Same with max_workers (20 is better than 30 for small pulls). I added a note to the documentation. Probably worth revisiting once CDA gets performance improvements.

Also have additional asserts to add. An assert to make sure that the number of values returned is that same as the number stored. and make sure there are not any null values returned.

Added those asserts.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Oct 6, 2025

@msweier
Copy link
Collaborator Author

msweier commented Oct 6, 2025

Updated tests to compare entire df read/write with assert_frame_equal

@msweier msweier marked this pull request as ready for review October 6, 2025 19:59
@msweier msweier merged commit 1769a73 into main Oct 6, 2025
9 checks passed
@Enovotny Enovotny deleted the multithread_timeseries2 branch October 15, 2025 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants