Add dask-expr benchmarking post#174

Open

phofl wants to merge 2 commits intodask:gh-pagesfrom

phofl:dask-expr-benchmarks

Contributor

phofl commented Oct 11, 2023

No description provided.


          Add dask-expr benchmarking post

516901f

netlify bot commented Oct 11, 2023 •

edited

Loading

✅ Deploy Preview for dask-blog ready!

Name	Link
🔨 Latest commit	`2ff62c5`
🔍 Latest deploy log	https://app.netlify.com/sites/dask-blog/deploys/65273f778c571800087d64c3
😎 Deploy Preview	https://deploy-preview-174--dask-blog.netlify.app/2023/10/05/dask-expr-tpch-dask
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.


          Fix markdown heading levels in dask-expr benchmarking post

2ff62c5

GenevieveBuckley reviewed

View reviewed changes

Collaborator

GenevieveBuckley left a comment

This is an interesting post, I enjoyed reading it.
I've updated the markdown formatting so the CI checks pass, and left some comments inline. My only major comment is that it appears the confidence threshold figure is not referred to or explained anywhere in the results. Otherwise it looks good!

_posts/2023-10-05-dask-expr-tpch-dask.md

+                   alt=""></a>
+              [Dask-expr](https://github.com/dask-contrib/dask-expr) is an ongoing effort to add a
+              [logical query optimization layer](https://medium.com/coiled-hq/high-level-query-optimization-in-dask-995640564ed7) to Dask DataFrames.

Collaborator

GenevieveBuckley Oct 12, 2023

I thought this was the canonical link for that post: https://blog.coiled.io/blog/dask-expr-introduction.html

_posts/2023-10-05-dask-expr-tpch-dask.md

+              We now have the first benchmark results to share that were run against the current DataFrame
+              implementation.
+              Dask-expr is up to 3 times faster and more memory efficient in its current state than the status quo.

Collaborator

GenevieveBuckley Oct 12, 2023

"status quo" is ambiguous here. Can we update this sentence to be more specific?

The first paragraph in the results section suggests it should be "compared to conventional dask dataframes" (instead of "compared to the previous version of dask-expr" or "compared to other software tools designed for this task", etc.). Is that correct?

_posts/2023-10-05-dask-expr-tpch-dask.md

+              ## Results
+              We are comparing Dask 2023.09.02 with the main branch of Dask-expr. Both implementations
+              will use the [P2P shuffling algorithm](https://medium.com/coiled-hq/shuffling-large-data-at-constant-memory-in-dask-bb683e92d70b). The results were produced on 100GB of data, e.g. scale

Collaborator

GenevieveBuckley Oct 12, 2023

It's probably best to link to articles directly at https://blog.coiled.io, rather than the Medium mirror.

_posts/2023-10-05-dask-expr-tpch-dask.md

+              ```python
+              pip install dask-expr
+              ```

Collaborator

GenevieveBuckley Oct 12, 2023

It'd be good to repeat the link to Dask-expr here, where users can find the documentation to help them get started.

_posts/2023-10-05-dask-expr-tpch-dask.md

+                   width="70%"
+                   alt="Runtime on a per-query basis of Dask DataFrame and Dask-expr"></a>
+              We can see that Dask-expr performs better on every single query, up to a 3-times improvement

Collaborator

GenevieveBuckley Oct 12, 2023

All of your figures show impressive improvements, very clear and convincing.

_posts/2023-10-05-dask-expr-tpch-dask.md

+              - Getting Dask-expr to compute the results successfully was very easy. 20 workers very sufficient to
+                compute the results, which is the same number of workers as we used for the 100GB benchmarks.
+              - The original version needed a bigger cluster with 50 workers and with 32GB memory each.

Collaborator

GenevieveBuckley Oct 12, 2023

Suggested change

      
            - The original version needed a bigger cluster with 50 workers and with 32GB memory each.
          
            - The original version using conventional dask dataframes needed a bigger cluster with 50 workers and with 32GB memory each.

_posts/2023-10-05-dask-expr-tpch-dask.md

+              Summarizing, getting these queries to complete was significantly easier with Dask-expr.
+              These results show us that we are on the right course and motivates
+              us to improve the performance of Dask-expr further. There is still a lot of untapped potential.

Collaborator

GenevieveBuckley Oct 12, 2023

Should "Dask-expr" be capitalized like this in the middle of a sentence?

_posts/2023-10-05-dask-expr-tpch-dask.md

+              # TPC-H Benchmarks for Query Optimization with Dask Expressions
+              <a href="/images/dask_expr/tpch-comparison-dask-confidence.png">
+              <img src="/images/dask_expr/tpch-comparison-dask-confidence.png"

Collaborator

GenevieveBuckley Oct 12, 2023

This figure is never referred to in the text, did I miss that?
I think there should be a small paragraph or caption explaining what the figure shows. This figure was also the least intuitive for me to understand out of the three, so I think it's especially important to explain. (i.e. what is the x-axis, it's the confidence threshold of what? Of the wall clock time? That's the tab that is selected, but the graph itself doesn't have the axis labelled with what it actually measured)
This figure and text really belong in the results section. I think it's here because you want the readers to see something pretty when they click on the blog post. I think it's ok to have a copy of a pretty figure used as an illustration up the top (but you still need the figure and text to exist in the results section below).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet