[DRAFT] FEAT: LeMatRho data fetch, transform. Downsample rho during fetch.#49
Open
[DRAFT] FEAT: LeMatRho data fetch, transform. Downsample rho during fetch.#49
Conversation
Add `lematrho run` CLI command that downloads charge density data from S3, compresses via pyrho, optionally runs Bader/DDEC6 analysis, and writes Parquet files directly. Crash-safe via checkpoint file, atomic Parquet writes, work-stealing parallelism.
- Move shared constants (STATIC_CALC_TYPE, GRID_KEY_MAP, timeouts) and write_potcar() into utils.py as single source of truth - Convert all lematrho docstrings to Google style - Add 11 new pipeline tests: _validate_tools, DDEC6 happy path, _structure_to_row with None fields, _push_to_huggingface, DDEC6 in _process_material - Add @pytest.mark.integration scaffold with .env.integration loading - Remove unused --force flag from lematrho transform (was silently ignored) - Fix batch: Any -> batch: str type annotation in fetch.py - Add .env.* to .gitignore with !.env.example exception - Add LeMatRho/AWS/HuggingFace variables to .env.example - Add pytest integration marker config to pyproject.toml
- Vasprun and Chgcar.from_file require filesystem paths, not BytesIO objects. Write bytes to temp files before parsing. - Add --limit CLI option to cap number of materials processed (useful for smoke testing without processing all ~76k materials). - Verified with real S3 smoke test: 2 materials processed end-to-end with compressed charge densities at 10x10x10 grid shape.
- Batch-checkpoint after Parquet flush instead of per-material to prevent
desync on crash (up to chunk_size materials could be lost)
- Track failed materials in .failures.txt, skip on resume
- Close S3 StreamingBody and free compressed buffer in download_gz_file_from_s3
- Replace NamedTemporaryFile(delete=True) with TemporaryDirectory for
reliable temp file lifetime in parse_vasprun_structure and compress_chgcar
- Fix _list_materials docstring ("Sorted" claim was incorrect)
- Document memory trade-off for raw_files kept for Bader/DDEC6
- Add 5 new tests (batch checkpoint, failure load/append/resume)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a draft, only meant for sharing, not yet for review
High-Level Overview
We want to add a charge density HF dataset using charge densities from an S3 bucket. We use this code base so we don't need to code something from scratch and because the most recent Materials project fetcher is similar in that is pulled from an AWS S3 bucket. The rest of this summary is AI generated.
This PR adds a new data source — LeMatRho — to
lematerial-fetcher. LeMatRho provides charge density data from resimulations of the Alexandria database using Materials Project settings (PBE functional). The raw data lives in an authenticated S3 bucket (lemat-rho) containing ~700k material folders (I think it might be less haven't counted myself), each with VASP outputs: CHGCAR, AECCAR0/1/2, and vasprun.xml.The PR includes two pipeline architectures:
Traditional pipeline (
lematrho fetch+lematrho transform): Follows the existing BaseFetcher/BaseTransformer pattern with PostgreSQL as intermediate storage. Included for consistency with the project's existing architecture.Direct pipeline (
lematrho run): A new single-pass S3→Parquet architecture that bypasses PostgreSQL entirely. This is the recommended pipeline for production use, as it eliminates double S3 downloads, removes PostgreSQL operational overhead, and writes Parquet files ready for HuggingFace upload.Architecture: Direct Pipeline (
lematrho run)Key design decisions
Single-pass: Each S3 file is downloaded exactly once. The traditional pipeline downloads files twice (fetch downloads for pyrho compression; transform re-downloads for Bader/DDEC6). The direct pipeline keeps raw decompressed bytes in memory for both operations.
No PostgreSQL: The end target is Parquet files on HuggingFace. PostgreSQL adds schema migration, connection management, and disk space overhead with no benefit for this pipeline. The
DirectPipelineConfigdataclass has zero database fields.Crash-safe checkpointing: A text file (
.checkpoint.txt) records each successfully processed material ID. On restart, the pipeline skips already-processed materials and resumes writing Parquet chunks from the next index.Atomic Parquet writes: Chunks are written to a
.tmpfile first, then renamed viaos.rename()(POSIX atomic). Stale.tmpfiles from crashed runs are ignored on resume.Work-stealing parallelism: Uses
concurrent.futures.wait(FIRST_COMPLETED)with bounded submission (2× num_workers futures in flight) to avoid creating 700k Future objects in memory.Memory management: Sequential file processing within each worker, explicit
delof raw bytes after use,gc.collect()after each material. Conservative default of 4 workers (each CHGCAR can be hundreds of MB).Graceful tool degradation: External tools (bader, chargemol, perl+chgsum.pl) are validated at init. If any tool is missing, the corresponding fields are set to
Nonerather than failing. Bader and DDEC6 are independent — one can fail while the other succeeds.Consolidated constants: All shared constants (S3 paths, timeouts, grid key mappings) and the
write_potcar()helper live in a singleutils.pymodule — no duplication across fetch, transform, and pipeline.Data Model Extensions
OptimadeStructure (Pydantic model)
8 new optional fields added after
bawl_fingerprint:compressed_charge_densityOptional[list]compressed_aeccar0Optional[list]compressed_aeccar1Optional[list]compressed_aeccar2Optional[list]charge_density_grid_shapeOptional[list[int]][15, 15, 15]bader_chargesOptional[list[float]]bader_atomic_volumeOptional[list[float]]ddec6_chargesOptional[list[float]]All fields default to
None, so existing data sources (MP, Alexandria, OQMD) are unaffected.PostgreSQL Schema
Matching columns added to
OptimadeDatabase.columns():JSONBfor compressed grids (4 columns)INTEGER[]for grid shapeFLOAT[]for per-site charges/volumes (3 columns)Migration SQL provided in
migrations/001_add_charge_density_columns.sql.Parquet Schema
33-column PyArrow schema defined in
pipeline.pymatching HuggingFace Features:pa.string())pa.list_(pa.float64())pa.list_(pa.int32())HuggingFace Push
Charge density fields added to
_get_optimade_features()inpush.py. Removed from trajectories features (LeMatRho-specific). JSON serialization added for compressed grid columns during dataset loading.Source Enum & S3 Authentication
Source.LEMATRHO = "lematrho"added to the Source enumget_authenticated_aws_client()added toutils/aws.py: uses default boto3 credential chain (noUNSIGNEDconfig), with adaptive retry (max_attempts=3, mode='adaptive')Charge Density Processing
pyrho Compression
For each of CHGCAR, AECCAR0, AECCAR1, AECCAR2:
pymatgen.io.vasp.ChgcarChargeDensityviaChargeDensity.from_pmg()pgrids["total"].lossy_smooth_compression(grid_shape)(15, 15, 15), configurable via--grid-shapeBader Analysis (optional)
Requires:
baderexecutable,perl,chgsum.plscript,PMG_VASP_PSP_DIRenv var.MatPESStaticSet(structure).potcarperl chgsum.pl AECCAR0 AECCAR2(timeout: 300s)bader CHGCAR -ref CHGCAR_sum(timeout: 600s)ACF.datfor per-site charges and atomic volumesDDEC6 Analysis (optional)
Requires:
chargemolexecutable, atomic densities directory,PMG_VASP_PSP_DIRenv var.chargemol(timeout: 600s)Both analyses use
subprocess.run()with explicit timeouts andcheck=True. On failure, fields are set toNoneand a warning is logged.Code Organization
All LeMatRho-specific constants, helpers, and timeout values are consolidated in a single
utils.pymodule. The three pipeline variants (fetch.py,transform.py,pipeline.py) import fromutils.py— no duplication:STATIC_CALC_TYPE,RELAX_CALC_TYPE,STATIC_FILES,VALID_PREFIXES,GRID_KEY_MAP,DEFAULT_MAX_WORKERS,BADER_TIMEOUT,CHGSUM_TIMEOUT,CHARGEMOL_TIMEOUTdownload_gz_file_from_s3(),parse_vasprun_structure(),compress_chgcar(),build_raw_structure(),write_potcar()All docstrings follow Google style with Args/Returns/Raises sections.
Files Changed/Added
New files (source)
src/.../lematrho/__init__.pysrc/.../lematrho/fetch.pyLeMatRhoFetcher(BaseFetcher)— traditional S3→PostgreSQL fetchsrc/.../lematrho/transform.pyLeMatRhoTransformer(BaseTransformer)— PostgreSQL raw→OPTIMADE transformsrc/.../lematrho/utils.pydownload_gz_file_from_s3,compress_chgcar,write_potcarsrc/.../lematrho/pipeline.pyLeMatRhoDirectPipeline— single-pass S3→Parquet pipelineNew files (tests)
tests/fetcher/lematrho/test_lematrho_fetch.pytests/fetcher/lematrho/test_lematrho_transform.pytests/fetcher/lematrho/test_lematrho_pipeline.pyModified files
src/.../models/utils/enums.pySource.LEMATRHOsrc/.../models/optimade.pysrc/.../database/postgres.pysrc/.../utils/aws.pyget_authenticated_aws_client()src/.../utils/config.pyDirectPipelineConfig,load_direct_pipeline_config(), extendedFetcherConfigandTransformerConfigsrc/.../utils/cli.pyadd_lematrho_fetch_options,add_lematrho_transform_options,add_lematrho_direct_optionssrc/.../cli.pylematrhocommand group withfetch,transform,runsubcommandssrc/.../push.pypyproject.tomlmp-pyrho>=0.3.1dependency,integrationpytest marker.gitignore.env.*with!.env.exampleexception.env.exampletests/test_cli.pytests/utils/test_aws.pytests/utils/test_config.pyDirectPipelineConfigteststests/models/test_optimade_model.pyCLI Usage
External Dependencies
mp-pyrho>=0.3.1pip install(added to pyproject.toml)baderexecutableperl+chgsum.plchargemolexecutablePMG_VASP_PSP_DIRenv varAWS_ACCESS_KEY_ID+AWS_SECRET_ACCESS_KEYAll external tools are optional — missing tools result in
Nonefields rather than pipeline failure.Test Coverage
_validate_tools()with all tool combos (all present, none present, partial, missing PMG_VASP_PSP_DIR)_structure_to_rowwith full fields, None fields, and partial fields_push_to_huggingfacemock (called when configured, skipped when not)@pytest.mark.integrationtests that connect to real S3 (deselected by default, requires.env.integrationwith AWS credentials)Next Steps for Testing
lematerial_fetcher --debug lematrho run --output-dir ./test_output --num-workers 1 --parquet-chunk-size 5with valid AWS credentials to verify end-to-end with real data (small batch first).--num-workers 1, then scale up. Each CHGCAR can be hundreds of MB.--hf-repo-idwith a private test repo before pushing to production.Open issues we need to fix.
Right now the code is setup as if LeMatRho is another datasource we will push into LeMatBulk, we want it to be it's own datasource, so we need to modify the push to HF script. We also need to verify the HuggingFace dataset schema includes the new columns and compressed fields serialize correctly as JSON strings.