Skip to content

[DRAFT] FEAT: LeMatRho data fetch, transform. Downsample rho during fetch.#49

Open
speckhard wants to merge 5 commits intomainfrom
feat/LeMatRho
Open

[DRAFT] FEAT: LeMatRho data fetch, transform. Downsample rho during fetch.#49
speckhard wants to merge 5 commits intomainfrom
feat/LeMatRho

Conversation

@speckhard
Copy link
Collaborator

@speckhard speckhard commented Feb 25, 2026

This is a draft, only meant for sharing, not yet for review

High-Level Overview

We want to add a charge density HF dataset using charge densities from an S3 bucket. We use this code base so we don't need to code something from scratch and because the most recent Materials project fetcher is similar in that is pulled from an AWS S3 bucket. The rest of this summary is AI generated.

This PR adds a new data source — LeMatRho — to lematerial-fetcher. LeMatRho provides charge density data from resimulations of the Alexandria database using Materials Project settings (PBE functional). The raw data lives in an authenticated S3 bucket (lemat-rho) containing ~700k material folders (I think it might be less haven't counted myself), each with VASP outputs: CHGCAR, AECCAR0/1/2, and vasprun.xml.

The PR includes two pipeline architectures:

  1. Traditional pipeline (lematrho fetch + lematrho transform): Follows the existing BaseFetcher/BaseTransformer pattern with PostgreSQL as intermediate storage. Included for consistency with the project's existing architecture.

  2. Direct pipeline (lematrho run): A new single-pass S3→Parquet architecture that bypasses PostgreSQL entirely. This is the recommended pipeline for production use, as it eliminates double S3 downloads, removes PostgreSQL operational overhead, and writes Parquet files ready for HuggingFace upload.


Architecture: Direct Pipeline (lematrho run)

S3 bucket (lemat-rho)
    │
    ▼
[List material folders — paginated S3 listing]
    │
    ▼  ProcessPoolExecutor (4 workers, work-stealing)
Per material, in a single pass:
  1. Download vasprun.xml.gz → parse relaxed structure
  2. Download CHGCAR/AECCAR0/AECCAR1/AECCAR2 → compress via pyrho
  3. Run Bader analysis (if tools available, reuses same raw bytes)
  4. Run DDEC6 analysis (if tools available, reuses same raw bytes)
  5. Build OptimadeStructure (Pydantic validation)
  6. gc.collect() → return flat dict
    │
    ▼  Main process
Buffer rows → write Parquet chunk every N rows (atomic: .tmp → rename)
Append material_id to checkpoint file after each success
    │
    ▼  Optional
Load Parquet dir as HF Dataset → push_to_hub()

Key design decisions

  • Single-pass: Each S3 file is downloaded exactly once. The traditional pipeline downloads files twice (fetch downloads for pyrho compression; transform re-downloads for Bader/DDEC6). The direct pipeline keeps raw decompressed bytes in memory for both operations.

  • No PostgreSQL: The end target is Parquet files on HuggingFace. PostgreSQL adds schema migration, connection management, and disk space overhead with no benefit for this pipeline. The DirectPipelineConfig dataclass has zero database fields.

  • Crash-safe checkpointing: A text file (.checkpoint.txt) records each successfully processed material ID. On restart, the pipeline skips already-processed materials and resumes writing Parquet chunks from the next index.

  • Atomic Parquet writes: Chunks are written to a .tmp file first, then renamed via os.rename() (POSIX atomic). Stale .tmp files from crashed runs are ignored on resume.

  • Work-stealing parallelism: Uses concurrent.futures.wait(FIRST_COMPLETED) with bounded submission (2× num_workers futures in flight) to avoid creating 700k Future objects in memory.

  • Memory management: Sequential file processing within each worker, explicit del of raw bytes after use, gc.collect() after each material. Conservative default of 4 workers (each CHGCAR can be hundreds of MB).

  • Graceful tool degradation: External tools (bader, chargemol, perl+chgsum.pl) are validated at init. If any tool is missing, the corresponding fields are set to None rather than failing. Bader and DDEC6 are independent — one can fail while the other succeeds.

  • Consolidated constants: All shared constants (S3 paths, timeouts, grid key mappings) and the write_potcar() helper live in a single utils.py module — no duplication across fetch, transform, and pipeline.


Data Model Extensions

OptimadeStructure (Pydantic model)

8 new optional fields added after bawl_fingerprint:

Field Type Description
compressed_charge_density Optional[list] 3D nested float list from pyrho lossy compression of CHGCAR
compressed_aeccar0 Optional[list] Same for AECCAR0 (all-electron core charge density)
compressed_aeccar1 Optional[list] Same for AECCAR1 (pseudo valence charge density)
compressed_aeccar2 Optional[list] Same for AECCAR2 (pseudo core charge density)
charge_density_grid_shape Optional[list[int]] Shape of compressed grid, e.g. [15, 15, 15]
bader_charges Optional[list[float]] Bader charges per site (validated against nsites)
bader_atomic_volume Optional[list[float]] Bader atomic volumes per site
ddec6_charges Optional[list[float]] DDEC6 net atomic charges per site

All fields default to None, so existing data sources (MP, Alexandria, OQMD) are unaffected.

PostgreSQL Schema

Matching columns added to OptimadeDatabase.columns():

  • JSONB for compressed grids (4 columns)
  • INTEGER[] for grid shape
  • FLOAT[] for per-site charges/volumes (3 columns)

Migration SQL provided in migrations/001_add_charge_density_columns.sql.

Parquet Schema

33-column PyArrow schema defined in pipeline.py matching HuggingFace Features:

  • Compressed grids stored as JSON-serialized strings (pa.string())
  • Per-site arrays as pa.list_(pa.float64())
  • Grid shape as pa.list_(pa.int32())

HuggingFace Push

Charge density fields added to _get_optimade_features() in push.py. Removed from trajectories features (LeMatRho-specific). JSON serialization added for compressed grid columns during dataset loading.


Source Enum & S3 Authentication

  • Source.LEMATRHO = "lematrho" added to the Source enum
  • get_authenticated_aws_client() added to utils/aws.py: uses default boto3 credential chain (no UNSIGNED config), with adaptive retry (max_attempts=3, mode='adaptive')
  • Separate from the existing anonymous MP S3 client

Charge Density Processing

pyrho Compression

For each of CHGCAR, AECCAR0, AECCAR1, AECCAR2:

  1. Parse with pymatgen.io.vasp.Chgcar
  2. Convert to pyrho ChargeDensity via ChargeDensity.from_pmg()
  3. Apply lossy smooth compression: pgrids["total"].lossy_smooth_compression(grid_shape)
  4. Default grid shape: (15, 15, 15), configurable via --grid-shape

Bader Analysis (optional)

Requires: bader executable, perl, chgsum.pl script, PMG_VASP_PSP_DIR env var.

  1. Write raw CHGCAR, AECCAR0, AECCAR2 to temp directory
  2. Generate POTCAR via MatPESStaticSet(structure).potcar
  3. Run perl chgsum.pl AECCAR0 AECCAR2 (timeout: 300s)
  4. Run bader CHGCAR -ref CHGCAR_sum (timeout: 600s)
  5. Parse ACF.dat for per-site charges and atomic volumes

DDEC6 Analysis (optional)

Requires: chargemol executable, atomic densities directory, PMG_VASP_PSP_DIR env var.

  1. Write raw CHGCAR to temp directory
  2. Generate POTCAR, write chargemol config with atomic densities path
  3. Run chargemol (timeout: 600s)
  4. Parse DDEC6 net atomic charges from output

Both analyses use subprocess.run() with explicit timeouts and check=True. On failure, fields are set to None and a warning is logged.


Code Organization

All LeMatRho-specific constants, helpers, and timeout values are consolidated in a single utils.py module. The three pipeline variants (fetch.py, transform.py, pipeline.py) import from utils.py — no duplication:

  • Constants: STATIC_CALC_TYPE, RELAX_CALC_TYPE, STATIC_FILES, VALID_PREFIXES, GRID_KEY_MAP, DEFAULT_MAX_WORKERS, BADER_TIMEOUT, CHGSUM_TIMEOUT, CHARGEMOL_TIMEOUT
  • Helpers: download_gz_file_from_s3(), parse_vasprun_structure(), compress_chgcar(), build_raw_structure(), write_potcar()

All docstrings follow Google style with Args/Returns/Raises sections.


Files Changed/Added

New files (source)

File Lines Description
src/.../lematrho/__init__.py 1 Package marker
src/.../lematrho/fetch.py 196 LeMatRhoFetcher(BaseFetcher) — traditional S3→PostgreSQL fetch
src/.../lematrho/transform.py 493 LeMatRhoTransformer(BaseTransformer) — PostgreSQL raw→OPTIMADE transform
src/.../lematrho/utils.py 150 Shared constants, helpers: download_gz_file_from_s3, compress_chgcar, write_potcar
src/.../lematrho/pipeline.py 800 LeMatRhoDirectPipeline — single-pass S3→Parquet pipeline

New files (tests)

File Lines Tests
tests/fetcher/lematrho/test_lematrho_fetch.py 616 22 tests
tests/fetcher/lematrho/test_lematrho_transform.py 728 26 tests
tests/fetcher/lematrho/test_lematrho_pipeline.py 1406 37 tests (+2 integration scaffold)

Modified files

File Changes
src/.../models/utils/enums.py Added Source.LEMATRHO
src/.../models/optimade.py Added 8 charge density fields + nsites validation
src/.../database/postgres.py Added 8 columns to schema + all 4 insert tuple definitions
src/.../utils/aws.py Added get_authenticated_aws_client()
src/.../utils/config.py Added DirectPipelineConfig, load_direct_pipeline_config(), extended FetcherConfig and TransformerConfig
src/.../utils/cli.py Added add_lematrho_fetch_options, add_lematrho_transform_options, add_lematrho_direct_options
src/.../cli.py Added lematrho command group with fetch, transform, run subcommands
src/.../push.py Added charge density fields to HuggingFace Features + JSON serialization
pyproject.toml Added mp-pyrho>=0.3.1 dependency, integration pytest marker
.gitignore Added .env.* with !.env.example exception
.env.example Added LeMatRho, AWS, HuggingFace variable templates
tests/test_cli.py Added 15 lematrho CLI tests (fetch, transform, run)
tests/utils/test_aws.py Added authenticated client tests
tests/utils/test_config.py Added DirectPipelineConfig tests
tests/models/test_optimade_model.py Added charge density field validation tests

CLI Usage

# Direct pipeline (recommended) — no PostgreSQL needed
lematerial_fetcher lematrho run \
    --output-dir ./lematrho_output \
    --grid-shape 15 15 15 \
    --parquet-chunk-size 1000 \
    --num-workers 4 \
    --bader-path /path/to/bader \
    --chargemol-path /path/to/chargemol \
    --chgsum-script-path /path/to/chgsum.pl \
    --atomic-densities-path /path/to/atomic_densities \
    --hf-repo-id entalpic/lemat-rho \
    --hf-token $HF_TOKEN

# With --debug flag for sequential processing (useful for debugging)
lematerial_fetcher --debug lematrho run --output-dir ./output

# Traditional pipeline (fetch → transform with PostgreSQL)
lematerial_fetcher lematrho fetch --db-user user --table-name raw_lematrho
lematerial_fetcher lematrho transform --db-user user --table-name raw_lematrho --dest-table-name optimade_lematrho

External Dependencies

Dependency Required for Install
mp-pyrho>=0.3.1 Charge density compression pip install (added to pyproject.toml)
bader executable Bader charge analysis External binary
perl + chgsum.pl Bader charge summation System package + script
chargemol executable DDEC6 charge analysis External binary
Atomic densities directory DDEC6 input Data files for chargemol
PMG_VASP_PSP_DIR env var POTCAR generation Points to VASP pseudopotentials
AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY S3 authentication Environment variables

All external tools are optional — missing tools result in None fields rather than pipeline failure.


Test Coverage

  • 177 total tests (+ 2 integration scaffold), all passing
  • 85 LeMatRho-specific tests across 3 test files (22 fetch + 26 transform + 37 pipeline)
  • 15 CLI tests covering all lematrho subcommands (fetch, transform, run)
  • Key coverage areas:
    • S3 listing with prefix filtering and checkpoint exclusion
    • Happy-path processing with and without external tools
    • Graceful failure modes (missing files, subprocess timeouts, tool unavailability)
    • _validate_tools() with all tool combos (all present, none present, partial, missing PMG_VASP_PSP_DIR)
    • DDEC6 happy path, timeout, and subprocess failure
    • Bader happy path, timeout, and subprocess failure
    • _structure_to_row with full fields, None fields, and partial fields
    • _push_to_huggingface mock (called when configured, skipped when not)
    • Checkpoint load/save/resume with blank line handling
    • Atomic Parquet write (tmp→rename), chunk index resume, stale tmp cleanup
    • Full integration tests (debug mode pipeline end-to-end, resume after crash)
    • Cross-compatibility logic (Yb exclusion)
    • Structure-to-row serialization (JSON fields, enum values, ISO dates)
  • Integration test scaffold: @pytest.mark.integration tests that connect to real S3 (deselected by default, requires .env.integration with AWS credentials)

Next Steps for Testing

  1. Smoke test on real S3 data: lematerial_fetcher --debug lematrho run --output-dir ./test_output --num-workers 1 --parquet-chunk-size 5 with valid AWS credentials to verify end-to-end with real data (small batch first).
  2. Memory profiling: Monitor RSS during processing with --num-workers 1, then scale up. Each CHGCAR can be hundreds of MB.
  3. External tool integration: Test with actual bader/chargemol binaries on a machine with VASP pseudopotentials.
  4. HuggingFace push: Test --hf-repo-id with a private test repo before pushing to production.
  5. Resume after crash: Kill the pipeline mid-run and verify checkpoint-based resume produces correct output.

Open issues we need to fix.

Right now the code is setup as if LeMatRho is another datasource we will push into LeMatBulk, we want it to be it's own datasource, so we need to modify the push to HF script. We also need to verify the HuggingFace dataset schema includes the new columns and compressed fields serialize correctly as JSON strings.

@speckhard speckhard self-assigned this Feb 25, 2026
@speckhard speckhard added the enhancement New feature or request label Feb 25, 2026
Add `lematrho run` CLI command that downloads charge density data from
S3, compresses via pyrho, optionally runs Bader/DDEC6 analysis, and
writes Parquet files directly. Crash-safe via checkpoint file, atomic
Parquet writes, work-stealing parallelism.
- Move shared constants (STATIC_CALC_TYPE, GRID_KEY_MAP, timeouts) and
  write_potcar() into utils.py as single source of truth
- Convert all lematrho docstrings to Google style
- Add 11 new pipeline tests: _validate_tools, DDEC6 happy path,
  _structure_to_row with None fields, _push_to_huggingface, DDEC6 in
  _process_material
- Add @pytest.mark.integration scaffold with .env.integration loading
- Remove unused --force flag from lematrho transform (was silently ignored)
- Fix batch: Any -> batch: str type annotation in fetch.py
- Add .env.* to .gitignore with !.env.example exception
- Add LeMatRho/AWS/HuggingFace variables to .env.example
- Add pytest integration marker config to pyproject.toml
- Vasprun and Chgcar.from_file require filesystem paths, not BytesIO
  objects. Write bytes to temp files before parsing.
- Add --limit CLI option to cap number of materials processed (useful
  for smoke testing without processing all ~76k materials).
- Verified with real S3 smoke test: 2 materials processed end-to-end
  with compressed charge densities at 10x10x10 grid shape.
- Batch-checkpoint after Parquet flush instead of per-material to prevent
  desync on crash (up to chunk_size materials could be lost)
- Track failed materials in .failures.txt, skip on resume
- Close S3 StreamingBody and free compressed buffer in download_gz_file_from_s3
- Replace NamedTemporaryFile(delete=True) with TemporaryDirectory for
  reliable temp file lifetime in parse_vasprun_structure and compress_chgcar
- Fix _list_materials docstring ("Sorted" claim was incorrect)
- Document memory trade-off for raw_files kept for Bader/DDEC6
- Add 5 new tests (batch checkpoint, failure load/append/resume)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant