-
Notifications
You must be signed in to change notification settings - Fork 9
start using polars #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: migrate-to-polars
Are you sure you want to change the base?
Conversation
1b875a5 to
547f4f4
Compare
examples/smarteole_example.ipynb
Outdated
| "filter_all_test_wtgs_together SMV5 set 400 rows [2.9%] to NA\n", | ||
| "filter_all_test_wtgs_together SMV5 set 3284 rows [23.4%] to NA\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, these look rather different, have you looked in to that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot. I've addressed this in 8675dad and looking at smarteole_example.ipynb cell 14 output confirms it is fixed. I've also added a test case around the function _filter_turbine_df_by_other_turbine_dfs, which uses an expected value calculated using the current main branch and Hill of Towie open data (because it's faster than the smarteole pytest.Fixture).
The error was because there are differences between pandas' isna() and Polars' is_nan():
- Pandas
isna()detects bothNaNandNone/NULLvalues. In pandas, these are often treated interchangeably as "missing data." - Polars
is_nan()detects onlyNaNvalues (floating-point NaN). To check for null values in Polars, you need to useis_null()instead.
aclerc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this, I think polars could really help!
As per comment I am not sure if the logic is definitely the same as before. Could you add a test for _filter_turbine_df_by_other_turbine_dfs, ideally using the same input and results from a known case like smarteole?
9106c17 to
a04080b
Compare
2e5c89c to
3c81936
Compare
The main change is that
polarsdoes not have the same concept of named/custom indexes likepandas, therefore the TimeStamp_StartFormat and TurbineName indexes are treated as columns.The intention of this PR is to start the transition of replacing
pandaswithpolarsthrough the wind-up calculations, with the ultimate longer-term goal to speed up wind-up runs.Benchmark Results
Note
Benchmark tests can be run using
poe test-benchmarkThe polars implementation in this PR, specifically in the function
_filter_turbine_df_by_other_turbine_dfs, shows significant performance improvements:Key findings:
Full benchmark output
``` --------------------------------------------------------------------------------------- benchmark: 2 tests -------------------------------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- test_benchmark_polars_version 13.3168 (1.0) 17.6729 (1.0) 14.0306 (1.0) 0.9006 (1.0) 13.7265 (1.0) 0.3789 (1.0) 6;7 71.2730 (1.0) 53 1 test_benchmark_pandas_version 36.2638 (2.72) 41.9831 (2.38) 37.8548 (2.70) 1.1473 (1.27) 37.5767 (2.74) 1.2970 (3.42) 3;1 26.4167 (0.37) 25 1 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ```