Optimize to_pandas() internally to improve performance #390

V1NAY8 · 2021-09-16T16:33:50Z

_es_results is a duplicate of search_yield_pandas_dataframes.

In _es_results We are directly dumping the ES data into a List[...] and then converting into a pd.Df which is time taking on larger datasets because we must dump all the data before we start any processing

So, if we internally Iterate over the results and parallelly convert data into df and concatenate them at last. It improves the performance.

Further Performance will be improved once this PR + #389 are merged.

Right now, the metrics are as follows for nyc-restaurants:

Before	After
2021-09-16 21:55:37.504203: read 10000 rows	2021-09-16 21:54:04.917145: read 10000 rows
2021-09-16 21:55:37.683338: read 20000 rows	2021-09-16 21:54:05.391358: read 20000 rows
2021-09-16 21:55:37.852315: read 30000 rows	2021-09-16 21:54:05.860558: read 30000 rows
2021-09-16 21:55:38.009345: read 40000 rows	2021-09-16 21:54:06.390333: read 40000 rows
2021-09-16 21:55:38.169318: read 50000 rows	2021-09-16 21:54:06.869681: read 50000 rows
2021-09-16 21:55:38.331315: read 60000 rows	2021-09-16 21:54:07.409141: read 60000 rows
2021-09-16 21:55:38.488319: read 70000 rows	2021-09-16 21:54:07.880103: read 70000 rows
2021-09-16 21:55:38.639349: read 80000 rows	2021-09-16 21:54:08.339117: read 80000 rows
2021-09-16 21:55:38.804760: read 90000 rows	2021-09-16 21:54:08.854142: read 90000 rows
2021-09-16 21:55:38.954770: read 100000 rows	2021-09-16 21:54:09.324386: read 100000 rows
2021-09-16 21:55:39.097755: read 110000 rows	2021-09-16 21:54:09.783561: read 110000 rows
2021-09-16 21:55:39.245798: read 120000 rows	2021-09-16 21:54:10.288510: read 120000 rows
2021-09-16 21:55:39.393770: read 130000 rows	2021-09-16 21:54:10.796494: read 130000 rows
2021-09-16 21:55:39.539786: read 140000 rows	2021-09-16 21:54:11.306445: read 140000 rows
2021-09-16 21:55:39.685956: read 150000 rows	2021-09-16 21:54:11.761043: read 150000 rows
2021-09-16 21:55:39.830952: read 160000 rows	2021-09-16 21:54:12.222031: read 160000 rows
2021-09-16 21:55:39.975965: read 170000 rows	2021-09-16 21:54:12.744994: read 170000 rows
2021-09-16 21:55:40.122981: read 180000 rows	2021-09-16 21:54:13.220023: read 180000 rows
2021-09-16 21:55:40.268981: read 190000 rows	2021-09-16 21:54:13.682519: read 190000 rows
2021-09-16 21:55:40.978960: read 191636 rows	2021-09-16 21:54:13.834509: read 191636 rows
Time Elapsed: 0:00:10.633314	Time Elapsed: 0:00:09.929368

@sethmlarson Take a look :)

elasticmachine · 2021-09-16T16:33:52Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

sethmlarson · 2021-10-13T13:51:20Z

jenkins test this please

sethmlarson · 2021-10-13T13:52:43Z

jenkins test this please

sethmlarson

LGTM, thanks for this!

V1NAY8 · 2021-10-13T14:54:31Z

Good change though! I totally missed it. Also lint is failing. Also can you once format the code before merging?

sethmlarson · 2021-10-13T17:25:59Z

Okay, now our formatter and Mypy should be happy :)

sethmlarson · 2021-10-13T17:49:10Z

jenkins test this please

Optimize to_pandas() internally

bfbe2ac

Update operations.py

098f65f

sethmlarson approved these changes Oct 13, 2021

View reviewed changes

sethmlarson added 3 commits October 13, 2021 12:13

Handle the n=0 case for pandas.concat()

bb8672c

Update operations.py

d7ce4a7

Update query_compiler.py

596462d

sethmlarson approved these changes Oct 13, 2021

View reviewed changes

sethmlarson merged commit 704c898 into elastic:main Oct 13, 2021

V1NAY8 deleted the optimize-to-pandas branch October 14, 2021 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize to_pandas() internally to improve performance #390

Optimize to_pandas() internally to improve performance #390

Uh oh!

V1NAY8 commented Sep 16, 2021

Uh oh!

elasticmachine commented Sep 16, 2021

Uh oh!

sethmlarson commented Oct 13, 2021

Uh oh!

sethmlarson commented Oct 13, 2021

Uh oh!

sethmlarson left a comment

Uh oh!

V1NAY8 commented Oct 13, 2021

Uh oh!

sethmlarson commented Oct 13, 2021

Uh oh!

sethmlarson commented Oct 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimize to_pandas() internally to improve performance #390

Optimize to_pandas() internally to improve performance #390

Uh oh!

Conversation

V1NAY8 commented Sep 16, 2021

Uh oh!

elasticmachine commented Sep 16, 2021

Uh oh!

sethmlarson commented Oct 13, 2021

Uh oh!

sethmlarson commented Oct 13, 2021

Uh oh!

sethmlarson left a comment

Choose a reason for hiding this comment

Uh oh!

V1NAY8 commented Oct 13, 2021

Uh oh!

sethmlarson commented Oct 13, 2021

Uh oh!

sethmlarson commented Oct 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants