[DRAFT] FEAT: LeMatRho data fetch, transform. Downsample rho during fetch. by speckhard · Pull Request #49 · LeMaterial/lematerial-fetcher

speckhard · 2026-02-25T16:10:07Z

This is a draft, only meant for sharing, not yet for review

High-Level Overview

We want to add a charge density HF dataset using charge densities from an S3 bucket. We use this code base so we don't need to code something from scratch and because the most recent Materials project fetcher is similar in that is pulled from an AWS S3 bucket. The rest of this summary is AI generated.

This PR adds a new data source — LeMatRho — to lematerial-fetcher. LeMatRho provides charge density data from resimulations of the Alexandria database using Materials Project settings (PBE functional). The raw data lives in an authenticated S3 bucket (lemat-rho) containing ~700k material folders (I think it might be less haven't counted myself), each with VASP outputs: CHGCAR, AECCAR0/1/2, and vasprun.xml.

The PR includes two pipeline architectures:

Traditional pipeline (lematrho fetch + lematrho transform): Follows the existing BaseFetcher/BaseTransformer pattern with PostgreSQL as intermediate storage. Included for consistency with the project's existing architecture.
Direct pipeline (lematrho run): A new single-pass S3→Parquet architecture that bypasses PostgreSQL entirely. This is the recommended pipeline for production use, as it eliminates double S3 downloads, removes PostgreSQL operational overhead, and writes Parquet files ready for HuggingFace upload.

Architecture: Direct Pipeline (`lematrho run`)

S3 bucket (lemat-rho)
    │
    ▼
[List material folders — paginated S3 listing]
    │
    ▼  ProcessPoolExecutor (4 workers, work-stealing)
Per material, in a single pass:
  1. Download vasprun.xml.gz → parse relaxed structure
  2. Download CHGCAR/AECCAR0/AECCAR1/AECCAR2 → compress via pyrho
  3. Run Bader analysis (if tools available, reuses same raw bytes)
  4. Run DDEC6 analysis (if tools available, reuses same raw bytes)
  5. Build OptimadeStructure (Pydantic validation)
  6. gc.collect() → return flat dict
    │
    ▼  Main process
Buffer rows → write Parquet chunk every N rows (atomic: .tmp → rename)
Append material_id to checkpoint file after each success
    │
    ▼  Optional
Load Parquet dir as HF Dataset → push_to_hub()

Key design decisions

Single-pass: Each S3 file is downloaded exactly once. The traditional pipeline downloads files twice (fetch downloads for pyrho compression; transform re-downloads for Bader/DDEC6). The direct pipeline keeps raw decompressed bytes in memory for both operations.
No PostgreSQL: The end target is Parquet files on HuggingFace. PostgreSQL adds schema migration, connection management, and disk space overhead with no benefit for this pipeline. The DirectPipelineConfig dataclass has zero database fields.
Crash-safe checkpointing: A text file (.checkpoint.txt) records each successfully processed material ID. On restart, the pipeline skips already-processed materials and resumes writing Parquet chunks from the next index.
Atomic Parquet writes: Chunks are written to a .tmp file first, then renamed via os.rename() (POSIX atomic). Stale .tmp files from crashed runs are ignored on resume.
Work-stealing parallelism: Uses concurrent.futures.wait(FIRST_COMPLETED) with bounded submission (2× num_workers futures in flight) to avoid creating 700k Future objects in memory.
Memory management: Sequential file processing within each worker, explicit del of raw bytes after use, gc.collect() after each material. Conservative default of 4 workers (each CHGCAR can be hundreds of MB).
Graceful tool degradation: External tools (bader, chargemol, perl+chgsum.pl) are validated at init. If any tool is missing, the corresponding fields are set to None rather than failing. Bader and DDEC6 are independent — one can fail while the other succeeds.
Consolidated constants: All shared constants (S3 paths, timeouts, grid key mappings) and the write_potcar() helper live in a single utils.py module — no duplication across fetch, transform, and pipeline.

Data Model Extensions

OptimadeStructure (Pydantic model)

8 new optional fields added after bawl_fingerprint:

Field	Type	Description
`compressed_charge_density`	`Optional[list]`	3D nested float list from pyrho lossy compression of CHGCAR
`compressed_aeccar0`	`Optional[list]`	Same for AECCAR0 (all-electron core charge density)
`compressed_aeccar1`	`Optional[list]`	Same for AECCAR1 (pseudo valence charge density)
`compressed_aeccar2`	`Optional[list]`	Same for AECCAR2 (pseudo core charge density)
`charge_density_grid_shape`	`Optional[list[int]]`	Shape of compressed grid, e.g. `[15, 15, 15]`
`bader_charges`	`Optional[list[float]]`	Bader charges per site (validated against nsites)
`bader_atomic_volume`	`Optional[list[float]]`	Bader atomic volumes per site
`ddec6_charges`	`Optional[list[float]]`	DDEC6 net atomic charges per site

All fields default to None, so existing data sources (MP, Alexandria, OQMD) are unaffected.

PostgreSQL Schema

Matching columns added to OptimadeDatabase.columns():

JSONB for compressed grids (4 columns)
INTEGER[] for grid shape
FLOAT[] for per-site charges/volumes (3 columns)

Migration SQL provided in migrations/001_add_charge_density_columns.sql.

Parquet Schema

33-column PyArrow schema defined in pipeline.py matching HuggingFace Features:

Compressed grids stored as JSON-serialized strings (pa.string())
Per-site arrays as pa.list_(pa.float64())
Grid shape as pa.list_(pa.int32())

HuggingFace Push

Charge density fields added to _get_optimade_features() in push.py. Removed from trajectories features (LeMatRho-specific). JSON serialization added for compressed grid columns during dataset loading.

Source Enum & S3 Authentication

Source.LEMATRHO = "lematrho" added to the Source enum
get_authenticated_aws_client() added to utils/aws.py: uses default boto3 credential chain (no UNSIGNED config), with adaptive retry (max_attempts=3, mode='adaptive')
Separate from the existing anonymous MP S3 client

Charge Density Processing

pyrho Compression

For each of CHGCAR, AECCAR0, AECCAR1, AECCAR2:

Parse with pymatgen.io.vasp.Chgcar
Convert to pyrho ChargeDensity via ChargeDensity.from_pmg()
Apply lossy smooth compression: pgrids["total"].lossy_smooth_compression(grid_shape)
Default grid shape: (15, 15, 15), configurable via --grid-shape

Bader Analysis (optional)

Requires: bader executable, perl, chgsum.pl script, PMG_VASP_PSP_DIR env var.

Write raw CHGCAR, AECCAR0, AECCAR2 to temp directory
Generate POTCAR via MatPESStaticSet(structure).potcar
Run perl chgsum.pl AECCAR0 AECCAR2 (timeout: 300s)
Run bader CHGCAR -ref CHGCAR_sum (timeout: 600s)
Parse ACF.dat for per-site charges and atomic volumes

DDEC6 Analysis (optional)

Requires: chargemol executable, atomic densities directory, PMG_VASP_PSP_DIR env var.

Write raw CHGCAR to temp directory
Generate POTCAR, write chargemol config with atomic densities path
Run chargemol (timeout: 600s)
Parse DDEC6 net atomic charges from output

Both analyses use subprocess.run() with explicit timeouts and check=True. On failure, fields are set to None and a warning is logged.

Code Organization

All LeMatRho-specific constants, helpers, and timeout values are consolidated in a single utils.py module. The three pipeline variants (fetch.py, transform.py, pipeline.py) import from utils.py — no duplication:

Constants: STATIC_CALC_TYPE, RELAX_CALC_TYPE, STATIC_FILES, VALID_PREFIXES, GRID_KEY_MAP, DEFAULT_MAX_WORKERS, BADER_TIMEOUT, CHGSUM_TIMEOUT, CHARGEMOL_TIMEOUT
Helpers: download_gz_file_from_s3(), parse_vasprun_structure(), compress_chgcar(), build_raw_structure(), write_potcar()

All docstrings follow Google style with Args/Returns/Raises sections.

Files Changed/Added

New files (source)

File	Lines	Description
`src/.../lematrho/__init__.py`	1	Package marker
`src/.../lematrho/fetch.py`	196	`LeMatRhoFetcher(BaseFetcher)` — traditional S3→PostgreSQL fetch
`src/.../lematrho/transform.py`	493	`LeMatRhoTransformer(BaseTransformer)` — PostgreSQL raw→OPTIMADE transform
`src/.../lematrho/utils.py`	150	Shared constants, helpers: `download_gz_file_from_s3`, `compress_chgcar`, `write_potcar`
`src/.../lematrho/pipeline.py`	800	`LeMatRhoDirectPipeline` — single-pass S3→Parquet pipeline

New files (tests)

File	Lines	Tests
`tests/fetcher/lematrho/test_lematrho_fetch.py`	616	22 tests
`tests/fetcher/lematrho/test_lematrho_transform.py`	728	26 tests
`tests/fetcher/lematrho/test_lematrho_pipeline.py`	1406	37 tests (+2 integration scaffold)

Modified files

File	Changes
`src/.../models/utils/enums.py`	Added `Source.LEMATRHO`
`src/.../models/optimade.py`	Added 8 charge density fields + nsites validation
`src/.../database/postgres.py`	Added 8 columns to schema + all 4 insert tuple definitions
`src/.../utils/aws.py`	Added `get_authenticated_aws_client()`
`src/.../utils/config.py`	Added `DirectPipelineConfig`, `load_direct_pipeline_config()`, extended `FetcherConfig` and `TransformerConfig`
`src/.../utils/cli.py`	Added `add_lematrho_fetch_options`, `add_lematrho_transform_options`, `add_lematrho_direct_options`
`src/.../cli.py`	Added `lematrho` command group with `fetch`, `transform`, `run` subcommands
`src/.../push.py`	Added charge density fields to HuggingFace Features + JSON serialization
`pyproject.toml`	Added `mp-pyrho>=0.3.1` dependency, `integration` pytest marker
`.gitignore`	Added `.env.*` with `!.env.example` exception
`.env.example`	Added LeMatRho, AWS, HuggingFace variable templates
`tests/test_cli.py`	Added 15 lematrho CLI tests (fetch, transform, run)
`tests/utils/test_aws.py`	Added authenticated client tests
`tests/utils/test_config.py`	Added `DirectPipelineConfig` tests
`tests/models/test_optimade_model.py`	Added charge density field validation tests

CLI Usage

# Direct pipeline (recommended) — no PostgreSQL needed
lematerial_fetcher lematrho run \
    --output-dir ./lematrho_output \
    --grid-shape 15 15 15 \
    --parquet-chunk-size 1000 \
    --num-workers 4 \
    --bader-path /path/to/bader \
    --chargemol-path /path/to/chargemol \
    --chgsum-script-path /path/to/chgsum.pl \
    --atomic-densities-path /path/to/atomic_densities \
    --hf-repo-id entalpic/lemat-rho \
    --hf-token $HF_TOKEN

# With --debug flag for sequential processing (useful for debugging)
lematerial_fetcher --debug lematrho run --output-dir ./output

# Traditional pipeline (fetch → transform with PostgreSQL)
lematerial_fetcher lematrho fetch --db-user user --table-name raw_lematrho
lematerial_fetcher lematrho transform --db-user user --table-name raw_lematrho --dest-table-name optimade_lematrho

External Dependencies

Dependency	Required for	Install
`mp-pyrho>=0.3.1`	Charge density compression	`pip install` (added to pyproject.toml)
`bader` executable	Bader charge analysis	External binary
`perl` + `chgsum.pl`	Bader charge summation	System package + script
`chargemol` executable	DDEC6 charge analysis	External binary
Atomic densities directory	DDEC6 input	Data files for chargemol
`PMG_VASP_PSP_DIR` env var	POTCAR generation	Points to VASP pseudopotentials
`AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY`	S3 authentication	Environment variables

All external tools are optional — missing tools result in None fields rather than pipeline failure.

Test Coverage

177 total tests (+ 2 integration scaffold), all passing
85 LeMatRho-specific tests across 3 test files (22 fetch + 26 transform + 37 pipeline)
15 CLI tests covering all lematrho subcommands (fetch, transform, run)
Key coverage areas:
- S3 listing with prefix filtering and checkpoint exclusion
- Happy-path processing with and without external tools
- Graceful failure modes (missing files, subprocess timeouts, tool unavailability)
- _validate_tools() with all tool combos (all present, none present, partial, missing PMG_VASP_PSP_DIR)
- DDEC6 happy path, timeout, and subprocess failure
- Bader happy path, timeout, and subprocess failure
- _structure_to_row with full fields, None fields, and partial fields
- _push_to_huggingface mock (called when configured, skipped when not)
- Checkpoint load/save/resume with blank line handling
- Atomic Parquet write (tmp→rename), chunk index resume, stale tmp cleanup
- Full integration tests (debug mode pipeline end-to-end, resume after crash)
- Cross-compatibility logic (Yb exclusion)
- Structure-to-row serialization (JSON fields, enum values, ISO dates)
Integration test scaffold: @pytest.mark.integration tests that connect to real S3 (deselected by default, requires .env.integration with AWS credentials)

Next Steps for Testing

Smoke test on real S3 data: lematerial_fetcher --debug lematrho run --output-dir ./test_output --num-workers 1 --parquet-chunk-size 5 with valid AWS credentials to verify end-to-end with real data (small batch first).
Memory profiling: Monitor RSS during processing with --num-workers 1, then scale up. Each CHGCAR can be hundreds of MB.
External tool integration: Test with actual bader/chargemol binaries on a machine with VASP pseudopotentials.
HuggingFace push: Test --hf-repo-id with a private test repo before pushing to production.
Resume after crash: Kill the pipeline mid-run and verify checkpoint-based resume produces correct output.

Open issues we need to fix.

Right now the code is setup as if LeMatRho is another datasource we will push into LeMatBulk, we want it to be it's own datasource, so we need to modify the push to HF script. We also need to verify the HuggingFace dataset schema includes the new columns and compressed fields serialize correctly as JSON strings.

Add `lematrho run` CLI command that downloads charge density data from S3, compresses via pyrho, optionally runs Bader/DDEC6 analysis, and writes Parquet files directly. Crash-safe via checkpoint file, atomic Parquet writes, work-stealing parallelism.

- Move shared constants (STATIC_CALC_TYPE, GRID_KEY_MAP, timeouts) and write_potcar() into utils.py as single source of truth - Convert all lematrho docstrings to Google style - Add 11 new pipeline tests: _validate_tools, DDEC6 happy path, _structure_to_row with None fields, _push_to_huggingface, DDEC6 in _process_material - Add @pytest.mark.integration scaffold with .env.integration loading - Remove unused --force flag from lematrho transform (was silently ignored) - Fix batch: Any -> batch: str type annotation in fetch.py - Add .env.* to .gitignore with !.env.example exception - Add LeMatRho/AWS/HuggingFace variables to .env.example - Add pytest integration marker config to pyproject.toml

- Vasprun and Chgcar.from_file require filesystem paths, not BytesIO objects. Write bytes to temp files before parsing. - Add --limit CLI option to cap number of materials processed (useful for smoke testing without processing all ~76k materials). - Verified with real S3 smoke test: 2 materials processed end-to-end with compressed charge densities at 10x10x10 grid shape.

- Batch-checkpoint after Parquet flush instead of per-material to prevent desync on crash (up to chunk_size materials could be lost) - Track failed materials in .failures.txt, skip on resume - Close S3 StreamingBody and free compressed buffer in download_gz_file_from_s3 - Replace NamedTemporaryFile(delete=True) with TemporaryDirectory for reliable temp file lifetime in parse_vasprun_structure and compress_chgcar - Fix _list_materials docstring ("Sorted" claim was incorrect) - Document memory trade-off for raw_files kept for Bader/DDEC6 - Add 5 new tests (batch checkpoint, failure load/append/resume)

FEAT: LeMatRho data fetch, transform. Downsample rho during fetch.

46d5581

speckhard self-assigned this Feb 25, 2026

speckhard added this to LeMaterial Feb 25, 2026

speckhard added the enhancement New feature or request label Feb 25, 2026

speckhard added 4 commits February 26, 2026 02:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] FEAT: LeMatRho data fetch, transform. Downsample rho during fetch.#49

[DRAFT] FEAT: LeMatRho data fetch, transform. Downsample rho during fetch.#49
speckhard wants to merge 5 commits intomainfrom
feat/LeMatRho

speckhard commented Feb 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

speckhard commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

High-Level Overview

Architecture: Direct Pipeline (lematrho run)

Key design decisions

Data Model Extensions

OptimadeStructure (Pydantic model)

PostgreSQL Schema

Parquet Schema

HuggingFace Push

Source Enum & S3 Authentication

Charge Density Processing

pyrho Compression

Bader Analysis (optional)

DDEC6 Analysis (optional)

Code Organization

Files Changed/Added

New files (source)

New files (tests)

Modified files

CLI Usage

External Dependencies

Test Coverage

Next Steps for Testing

Open issues we need to fix.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

speckhard commented Feb 25, 2026 •

edited

Loading

Architecture: Direct Pipeline (`lematrho run`)