-
Notifications
You must be signed in to change notification settings - Fork 22
Proposal: post-generation structural analysis, screening pipeline, and compositional guided sampling #13
Description
Context
We've been using CrystalFormer for generative design of layered mineral membranes (Fe-Ni-S sulfides for proton transport). Over the past few weeks we generated ~7000+ structures using both targeted and wild sampling, and built an ad-hoc Voronoi analysis on top to filter candidates by layeredness and void size.
This worked well — 91% of our targeted generations were layered, 87% had voids > 1 Å — but the analysis code is messy, single-purpose, and not reusable. We think there's value in building proper post-generation analysis and screening tools directly into CrystalFormer, and wanted to discuss the idea before investing time in a PR.
The gap
CrystalFormer does an excellent job at generation, but the workflow after generation is entirely on the user:
CrystalFormer generates structures → ??? → user somehow evaluates them
For many use cases (ionic conductors, membranes, porous materials, MOFs, zeolites), the key question isn't just "is this a valid crystal?" but "does this crystal have the structural features I need?" — channels, voids, layered morphology, specific compositions, etc.
Proposal: three modules
We'd like to contribute three chemistry-agnostic modules. None of these are tied to our specific Fe-S use case — they work for any crystal system.
1. Structural analysis (crystalformer/analysis/)
Voronoi-based characterization of generated structures:
Void analysis (voronoi.py):
max_void_radius— largest inscribed sphere radius (Å)void_fraction— free volume fractionlayeredness_score— anisotropy of Voronoi vertices (PCA-based, 0 = isotropic, 1 = perfectly layered)interlayer_spacing— layer separation for layered structures (Å)
Channel/percolation analysis (percolation.py):
- Build Voronoi network graph (nodes = Voronoi vertices, edges = faces between cells)
- Filter edges by bottleneck >
r_probe(configurable, default 0.4 Å for H⁺) - Check percolation through periodic boundaries in a/b/c directions via BFS
- Report:
percolation_dimensionality(0D/1D/2D/3D),min_bottleneckalong best channel path - This answers the question "can an ion of radius r travel through this structure?" — useful for any ionic conductor screening
CLI:
python -m crystalformer.analysis structures.csv --r-probe 0.4 --check-percolation --output results.csvPython API:
from crystalformer.analysis import VoronoiAnalyzer, PercolationAnalyzer
va = VoronoiAnalyzer()
result = va.analyze(structure) # pymatgen Structure
# result.max_void_radius → 1.82 Å
# result.layeredness_score → 0.91
pa = PercolationAnalyzer(r_probe=0.4)
result = pa.analyze(structure)
# result.percolation_dimensionality → 2 (layered conductor)
# result.min_bottleneck → 0.62 Å
# result.percolates_c → True2. Compositional guided sampling
Bias atom-type logits during autoregressive decoding to steer generation toward desired compositions — without retraining or fine-tuning.
Mechanism: At each atom-type sampling step, add a bias vector to the logits before softmax. Positive bias encourages an element, negative suppresses it. Zero = unchanged (default behavior preserved).
# "I want more structures with Li and O"
structures = sample_crystal(
params, n_samples=1000,
composition_bias={"Li": 2.0, "O": 1.5, "P": 1.0}
)python -m crystalformer.sample --checkpoint model.pkl \
--n-samples 1000 \
--composition-bias "Li:2.0,O:1.5,P:1.0"This is deliberately simple — not a hard constraint, just a soft nudge. It turns "generate random structures and hope for the right composition" into "generate structures enriched in desired elements." For us this increased Fe-S hit rate from ~12% to ~60% without any model changes.
Scope: Only modifies the sampling function — no changes to model architecture, training, or weights.
3. Screening pipeline (crystalformer/screen/)
A pluggable filter chain for batch screening of generated structures.
Built-in filters:
| Filter | Criterion | Default |
|---|---|---|
validity |
Valid structure (atoms placed, no overlaps < 0.5 Å) | ON |
voronoi |
max_void_radius > threshold | r > 1.0 Å |
percolation |
percolation_dimensionality >= threshold | dim >= 1 |
composition |
Contains specified elements | OFF |
density |
Density within range | OFF |
Custom filters (user-extensible):
from crystalformer.screen import ScreeningPipeline, Filter
class StabilityFilter(Filter):
"""Example: user plugs in their own ML potential."""
def __call__(self, structure) -> bool:
energy = my_ml_model.predict(structure)
return energy < self.threshold
pipeline = ScreeningPipeline([
"validity",
"voronoi:r_min=0.4",
"percolation:dim=2",
StabilityFilter(threshold=0.05),
])
results = pipeline.run("generated.csv", output="screened.csv")
# 5000 → 4200 valid → 1890 voronoi → 340 percolating → 52 stableCLI:
python -m crystalformer.screen generated/ \
--filters validity,voronoi,percolation \
--r-probe 0.4 \
--output screened.csvThe pipeline prints a summary table showing how many structures pass each stage — makes it easy to see where the funnel narrows.
Dependencies
Only pymatgen (already used for I/O) and scipy (for scipy.spatial.Voronoi), as optional:
[project.optional-dependencies]
analysis = ["pymatgen>=2024.1.1", "scipy>=1.10"]No heavy ML dependencies. Core CrystalFormer stays lightweight.
File structure (all new files, minimal changes to existing code)
crystalformer/
├── analysis/
│ ├── __init__.py
│ ├── voronoi.py # VoronoiAnalyzer
│ ├── percolation.py # PercolationAnalyzer
│ └── __main__.py # CLI
├── screen/
│ ├── __init__.py
│ ├── pipeline.py # ScreeningPipeline, Filter base
│ ├── filters.py # Built-in filters
│ └── __main__.py # CLI
└── src/
└── sample.py # + optional composition_bias param
Questions before we start
- Do you already have internal analysis/screening tools? We don't want to duplicate existing work. If you have something similar internally, we'd be happy to build on it instead.
- Is this in scope for the main repo? If you'd prefer this to live as a separate package (e.g.,
crystalformer-analysis), that works too. - Single PR or incremental? We can do one PR with everything, or split into 2-3 (analysis → sampling → screening).
- Any naming/API conventions you'd like us to follow?
Happy to discuss any aspect of this. We've been enjoying working with CrystalFormer and would like to contribute back to the project.