Skip to content

Conversation

@rocha-paula
Copy link
Contributor

Brief overview of changes

Added a comprehensive vignette and supporting benchmarking datasets for the write_df_to_delta() function.

Why are these changes being made?

To provide DfE analysts with a clear technical guide and empirical performance evidence for write_df_to_delta(). These changes ensure the new functionality is reproducible, documented to DfE standards, and proven to offer significant time savings for large-scale data ingestion.

Detailed description of changes

  • Vignette: Created vignettes/write_df_to_delta.Rmd.
  • Benchmarking Data: Added write_df_to_delta_benchmarks.rda and write_df_to_delta_stress_test.rda to the data/ folder.
  • Data Generation Scripts: Added data-raw/write_df_to_delta_benchmarks.R and data-raw/write_df_to_delta_stress_test.R.
  • Documentation: Updated R/datasets_documentation.R with documentation for the new datasets.
  • Dependencies: Updated DESCRIPTION to include ggplot2, purrr, and scales under Suggests.
  • Spelling: Updated inst/WORDLIST to resolve technical jargon flags.

Essential Checklist

  • I have read the contributing guidelines
  • The code follows the package style and naming conventions
  • All new and existing tests pass (devtools::test())
  • I have updated the documentation using devtools::document()
  • I have checked that my changes do not break existing functionality

Consider (as applicable)

  • I have added or updated documentation (function documentation, vignettes, readme, etc.)
  • I have updated the NEWS.md file with a summary of my changes
  • I have considered if a version bump is required in DESCRIPTION
  • I have added examples or usage where relevant
  • I have resolved styling (formatted using Air, or styler::style_pkg()) and lintr issues (lintr::lint_package())

Additional context

Note on test failures: During devtools::check(), a failure was noted in test-air_formatter.R. This is a pre-existing issue related to the local air.exe binary path on Windows and is unrelated to the changes in this PR.

Issue ticket number/s and link

#126

@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.19%. Comparing base (378f30e) to head (472f1ea).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #138   +/-   ##
=======================================
  Coverage   57.19%   57.19%           
=======================================
  Files          18       18           
  Lines        1509     1509           
=======================================
  Hits          863      863           
  Misses        646      646           

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 378f30e...472f1ea. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mzayeddfe mzayeddfe self-requested a review January 28, 2026 11:04
Copy link
Contributor

@mzayeddfe mzayeddfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work on this @rocha-paula ! I wasn't able to render the vignette but I did go through it in an unrendered format combined with the one you sent on teams. The vignette is so well organised and the structure is very clear and I love the comparative content. I think it really drives the point home about how effective this function is!

I think I can't render the vignette and the pkgdown test fails here because we either need:

  • to document the write_df_to_delta_benchmarks.rda and write_df_to_delta_stress_test.rda datasets in pkgdown file
  • or we need to save those datafiles as internal. You can do this by using internal = TRUE in usethis::use_data for each of those datasets. Then amend the vignette so the wording doesn't imply users can access those datasets and you can showcase the structure there if needed.

My suggestion would be to make the data internal because most end users would not use it. You go through the methodology and the code is publicly available here. However, I'm interested to hear your thoughts on this.

I made one tiny suggestion below to link the check odbc function :)

- `DATABRICKS_HOST`: The URL of your workspace.
- `DATABRICKS_TOKEN`: Your personal access token (PAT).

**Tip:** Before you start, use `check_databricks_odbc()` to verify that your connection and environment variables are correctly configured.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tiny suggestion here :)

Suggested change
**Tip:** Before you start, use `check_databricks_odbc()` to verify that your connection and environment variables are correctly configured.
**Tip:** Before you start, use [`check_databricks_odbc()`](https://dfe-analytical-services.github.io/dfeR/reference/check_databricks_odbc.html) to verify that your connection and environment variables are correctly configured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants