Skip to content

Proposal: post-generation structural analysis, screening pipeline, and compositional guided sampling #13

@exopoiesis

Description

@exopoiesis

Context

We've been using CrystalFormer for generative design of layered mineral membranes (Fe-Ni-S sulfides for proton transport). Over the past few weeks we generated ~7000+ structures using both targeted and wild sampling, and built an ad-hoc Voronoi analysis on top to filter candidates by layeredness and void size.

This worked well — 91% of our targeted generations were layered, 87% had voids > 1 Å — but the analysis code is messy, single-purpose, and not reusable. We think there's value in building proper post-generation analysis and screening tools directly into CrystalFormer, and wanted to discuss the idea before investing time in a PR.

The gap

CrystalFormer does an excellent job at generation, but the workflow after generation is entirely on the user:

CrystalFormer generates structures → ??? → user somehow evaluates them

For many use cases (ionic conductors, membranes, porous materials, MOFs, zeolites), the key question isn't just "is this a valid crystal?" but "does this crystal have the structural features I need?" — channels, voids, layered morphology, specific compositions, etc.

Proposal: three modules

We'd like to contribute three chemistry-agnostic modules. None of these are tied to our specific Fe-S use case — they work for any crystal system.

1. Structural analysis (crystalformer/analysis/)

Voronoi-based characterization of generated structures:

Void analysis (voronoi.py):

  • max_void_radius — largest inscribed sphere radius (Å)
  • void_fraction — free volume fraction
  • layeredness_score — anisotropy of Voronoi vertices (PCA-based, 0 = isotropic, 1 = perfectly layered)
  • interlayer_spacing — layer separation for layered structures (Å)

Channel/percolation analysis (percolation.py):

  • Build Voronoi network graph (nodes = Voronoi vertices, edges = faces between cells)
  • Filter edges by bottleneck > r_probe (configurable, default 0.4 Å for H⁺)
  • Check percolation through periodic boundaries in a/b/c directions via BFS
  • Report: percolation_dimensionality (0D/1D/2D/3D), min_bottleneck along best channel path
  • This answers the question "can an ion of radius r travel through this structure?" — useful for any ionic conductor screening

CLI:

python -m crystalformer.analysis structures.csv --r-probe 0.4 --check-percolation --output results.csv

Python API:

from crystalformer.analysis import VoronoiAnalyzer, PercolationAnalyzer

va = VoronoiAnalyzer()
result = va.analyze(structure)  # pymatgen Structure
# result.max_void_radius → 1.82 Å
# result.layeredness_score → 0.91

pa = PercolationAnalyzer(r_probe=0.4)
result = pa.analyze(structure)
# result.percolation_dimensionality → 2 (layered conductor)
# result.min_bottleneck → 0.62 Å
# result.percolates_c → True

2. Compositional guided sampling

Bias atom-type logits during autoregressive decoding to steer generation toward desired compositions — without retraining or fine-tuning.

Mechanism: At each atom-type sampling step, add a bias vector to the logits before softmax. Positive bias encourages an element, negative suppresses it. Zero = unchanged (default behavior preserved).

# "I want more structures with Li and O"
structures = sample_crystal(
    params, n_samples=1000,
    composition_bias={"Li": 2.0, "O": 1.5, "P": 1.0}
)
python -m crystalformer.sample --checkpoint model.pkl \
    --n-samples 1000 \
    --composition-bias "Li:2.0,O:1.5,P:1.0"

This is deliberately simple — not a hard constraint, just a soft nudge. It turns "generate random structures and hope for the right composition" into "generate structures enriched in desired elements." For us this increased Fe-S hit rate from ~12% to ~60% without any model changes.

Scope: Only modifies the sampling function — no changes to model architecture, training, or weights.

3. Screening pipeline (crystalformer/screen/)

A pluggable filter chain for batch screening of generated structures.

Built-in filters:

Filter Criterion Default
validity Valid structure (atoms placed, no overlaps < 0.5 Å) ON
voronoi max_void_radius > threshold r > 1.0 Å
percolation percolation_dimensionality >= threshold dim >= 1
composition Contains specified elements OFF
density Density within range OFF

Custom filters (user-extensible):

from crystalformer.screen import ScreeningPipeline, Filter

class StabilityFilter(Filter):
    """Example: user plugs in their own ML potential."""
    def __call__(self, structure) -> bool:
        energy = my_ml_model.predict(structure)
        return energy < self.threshold

pipeline = ScreeningPipeline([
    "validity",
    "voronoi:r_min=0.4",
    "percolation:dim=2",
    StabilityFilter(threshold=0.05),
])

results = pipeline.run("generated.csv", output="screened.csv")
# 5000 → 4200 valid → 1890 voronoi → 340 percolating → 52 stable

CLI:

python -m crystalformer.screen generated/ \
    --filters validity,voronoi,percolation \
    --r-probe 0.4 \
    --output screened.csv

The pipeline prints a summary table showing how many structures pass each stage — makes it easy to see where the funnel narrows.

Dependencies

Only pymatgen (already used for I/O) and scipy (for scipy.spatial.Voronoi), as optional:

[project.optional-dependencies]
analysis = ["pymatgen>=2024.1.1", "scipy>=1.10"]

No heavy ML dependencies. Core CrystalFormer stays lightweight.

File structure (all new files, minimal changes to existing code)

crystalformer/
├── analysis/
│   ├── __init__.py
│   ├── voronoi.py          # VoronoiAnalyzer
│   ├── percolation.py      # PercolationAnalyzer
│   └── __main__.py         # CLI
├── screen/
│   ├── __init__.py
│   ├── pipeline.py         # ScreeningPipeline, Filter base
│   ├── filters.py          # Built-in filters
│   └── __main__.py         # CLI
└── src/
    └── sample.py           # + optional composition_bias param

Questions before we start

  1. Do you already have internal analysis/screening tools? We don't want to duplicate existing work. If you have something similar internally, we'd be happy to build on it instead.
  2. Is this in scope for the main repo? If you'd prefer this to live as a separate package (e.g., crystalformer-analysis), that works too.
  3. Single PR or incremental? We can do one PR with everything, or split into 2-3 (analysis → sampling → screening).
  4. Any naming/API conventions you'd like us to follow?

Happy to discuss any aspect of this. We've been enjoying working with CrystalFormer and would like to contribute back to the project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions