Skip to content

Conversation

@jm-rivera
Copy link
Collaborator

This pull request introduces several enhancements and improvements to the oda_reader package, focusing on bulk downloading and reading of the DAC2A dataset, automatic file type and delimiter detection, and improved developer experience. The most significant changes are the addition of the bulk_download_dac2a() function, auto-detection of file types and delimiters, and deprecation of the is_txt parameter. Testing support is also improved with new dependencies and unit tests.

DAC2A Bulk Download Improvements

  • Added the bulk_download_dac2a() function to enable bulk downloading of the full DAC2A dataset, with support for saving to disk or streaming as an iterator. (src/oda_reader/dac2a.py, src/oda_reader/__init__.py, [1] [2] [3] [4]
  • Implemented the get_full_dac2a_parquet_id() helper to retrieve the correct file ID for the DAC2A bulk download. (src/oda_reader/dac2a.py, src/oda_reader/dac2a.pyR1-R35)
  • Added comprehensive unit tests for DAC2A bulk download logic using pytest and pytest-mock. (tests/datasets/dac2a/unit/test_dac2a_bulk.py, tests/datasets/dac2a/unit/test_dac2a_bulk.pyR1-R98)

File Type and Delimiter Auto-Detection

  • Enhanced bulk download logic to auto-detect file types (parquet vs txt/csv) in zip archives, removing the need for the is_txt parameter and supporting both formats transparently. (src/oda_reader/download/download_tools.py, [1] [2] [3] [4] [5] [6] [7]
  • Added _detect_delimiter() utility to automatically detect CSV delimiters (comma, pipe, tab, semicolon) when reading txt files from bulk downloads. (src/oda_reader/download/download_tools.py, src/oda_reader/download/download_tools.pyR55-R84)

API Changes and Deprecations

  • Deprecated the is_txt parameter in bulk_download_parquet(), emitting a warning when used and updating documentation to reflect auto-detection. (src/oda_reader/download/download_tools.py, src/oda_reader/download/download_tools.pyL400-R526)
  • Updated usages of bulk_download_parquet() in other modules to remove the is_txt argument. (src/oda_reader/crs.py, [1]; src/oda_reader/multisystem.py, [2]

Developer Experience and Testing

  • Added pytest and pytest-mock to development dependencies for improved testing support. (pyproject.toml, pyproject.tomlR45-R46)
  • Updated the changelog and project version to 1.4.0 to reflect new features and changes. (CHANGELOG.md, [1]; pyproject.toml, [2]

References:
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]

@jm-rivera jm-rivera merged commit dc96c69 into main Dec 19, 2025
9 checks passed
@jm-rivera jm-rivera deleted the bulk_donload_2a branch December 19, 2025 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants