Skip to content

Conversation

@V1NAY8
Copy link
Contributor

@V1NAY8 V1NAY8 commented Sep 16, 2021

_es_results is a duplicate of search_yield_pandas_dataframes.

In _es_results We are directly dumping the ES data into a List[...] and then converting into a pd.Df which is time taking on larger datasets because we must dump all the data before we start any processing

So, if we internally Iterate over the results and parallelly convert data into df and concatenate them at last. It improves the performance.

Further Performance will be improved once this PR + #389 are merged.

Right now, the metrics are as follows for nyc-restaurants:

Before After
2021-09-16 21:55:37.504203: read 10000 rows 2021-09-16 21:54:04.917145: read 10000 rows
2021-09-16 21:55:37.683338: read 20000 rows 2021-09-16 21:54:05.391358: read 20000 rows
2021-09-16 21:55:37.852315: read 30000 rows 2021-09-16 21:54:05.860558: read 30000 rows
2021-09-16 21:55:38.009345: read 40000 rows 2021-09-16 21:54:06.390333: read 40000 rows
2021-09-16 21:55:38.169318: read 50000 rows 2021-09-16 21:54:06.869681: read 50000 rows
2021-09-16 21:55:38.331315: read 60000 rows 2021-09-16 21:54:07.409141: read 60000 rows
2021-09-16 21:55:38.488319: read 70000 rows 2021-09-16 21:54:07.880103: read 70000 rows
2021-09-16 21:55:38.639349: read 80000 rows 2021-09-16 21:54:08.339117: read 80000 rows
2021-09-16 21:55:38.804760: read 90000 rows 2021-09-16 21:54:08.854142: read 90000 rows
2021-09-16 21:55:38.954770: read 100000 rows 2021-09-16 21:54:09.324386: read 100000 rows
2021-09-16 21:55:39.097755: read 110000 rows 2021-09-16 21:54:09.783561: read 110000 rows
2021-09-16 21:55:39.245798: read 120000 rows 2021-09-16 21:54:10.288510: read 120000 rows
2021-09-16 21:55:39.393770: read 130000 rows 2021-09-16 21:54:10.796494: read 130000 rows
2021-09-16 21:55:39.539786: read 140000 rows 2021-09-16 21:54:11.306445: read 140000 rows
2021-09-16 21:55:39.685956: read 150000 rows 2021-09-16 21:54:11.761043: read 150000 rows
2021-09-16 21:55:39.830952: read 160000 rows 2021-09-16 21:54:12.222031: read 160000 rows
2021-09-16 21:55:39.975965: read 170000 rows 2021-09-16 21:54:12.744994: read 170000 rows
2021-09-16 21:55:40.122981: read 180000 rows 2021-09-16 21:54:13.220023: read 180000 rows
2021-09-16 21:55:40.268981: read 190000 rows 2021-09-16 21:54:13.682519: read 190000 rows
2021-09-16 21:55:40.978960: read 191636 rows 2021-09-16 21:54:13.834509: read 191636 rows
Time Elapsed: 0:00:10.633314 Time Elapsed: 0:00:09.929368

@sethmlarson Take a look :)

@elasticmachine
Copy link

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@sethmlarson
Copy link
Contributor

jenkins test this please

@sethmlarson
Copy link
Contributor

jenkins test this please

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for this!

@V1NAY8
Copy link
Contributor Author

V1NAY8 commented Oct 13, 2021

Good change though! I totally missed it. Also lint is failing. Also can you once format the code before merging?

@sethmlarson
Copy link
Contributor

Okay, now our formatter and Mypy should be happy :)

@sethmlarson
Copy link
Contributor

jenkins test this please

@sethmlarson sethmlarson merged commit 704c898 into elastic:main Oct 13, 2021
@V1NAY8 V1NAY8 deleted the optimize-to-pandas branch October 14, 2021 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants