Proposal: post-generation structural analysis, screening pipeline, and compositional guided sampling

## Context

We've been using CrystalFormer for generative design of layered mineral membranes (Fe-Ni-S sulfides for proton transport). Over the past few weeks we generated ~7000+ structures using both targeted and wild sampling, and built an ad-hoc Voronoi analysis on top to filter candidates by layeredness and void size.

This worked well — 91% of our targeted generations were layered, 87% had voids > 1 Å — but the analysis code is messy, single-purpose, and not reusable. We think there's value in building proper post-generation analysis and screening tools directly into CrystalFormer, and wanted to discuss the idea before investing time in a PR.

## The gap

CrystalFormer does an excellent job at generation, but the workflow after generation is entirely on the user:

```
CrystalFormer generates structures → ??? → user somehow evaluates them
```

For many use cases (ionic conductors, membranes, porous materials, MOFs, zeolites), the key question isn't just "is this a valid crystal?" but "does this crystal have the structural features I need?" — channels, voids, layered morphology, specific compositions, etc.

## Proposal: three modules

We'd like to contribute three chemistry-agnostic modules. None of these are tied to our specific Fe-S use case — they work for any crystal system.

### 1. Structural analysis (`crystalformer/analysis/`)

Voronoi-based characterization of generated structures:

**Void analysis (`voronoi.py`):**
- `max_void_radius` — largest inscribed sphere radius (Å)
- `void_fraction` — free volume fraction
- `layeredness_score` — anisotropy of Voronoi vertices (PCA-based, 0 = isotropic, 1 = perfectly layered)
- `interlayer_spacing` — layer separation for layered structures (Å)

**Channel/percolation analysis (`percolation.py`):**
- Build Voronoi network graph (nodes = Voronoi vertices, edges = faces between cells)
- Filter edges by bottleneck > `r_probe` (configurable, default 0.4 Å for H⁺)
- Check percolation through periodic boundaries in a/b/c directions via BFS
- Report: `percolation_dimensionality` (0D/1D/2D/3D), `min_bottleneck` along best channel path
- This answers the question "can an ion of radius r travel through this structure?" — useful for any ionic conductor screening

**CLI:**
```bash
python -m crystalformer.analysis structures.csv --r-probe 0.4 --check-percolation --output results.csv
```

**Python API:**
```python
from crystalformer.analysis import VoronoiAnalyzer, PercolationAnalyzer

va = VoronoiAnalyzer()
result = va.analyze(structure)  # pymatgen Structure
# result.max_void_radius → 1.82 Å
# result.layeredness_score → 0.91

pa = PercolationAnalyzer(r_probe=0.4)
result = pa.analyze(structure)
# result.percolation_dimensionality → 2 (layered conductor)
# result.min_bottleneck → 0.62 Å
# result.percolates_c → True
```

### 2. Compositional guided sampling

Bias atom-type logits during autoregressive decoding to steer generation toward desired compositions — without retraining or fine-tuning.

**Mechanism:** At each atom-type sampling step, add a bias vector to the logits before softmax. Positive bias encourages an element, negative suppresses it. Zero = unchanged (default behavior preserved).

```python
# "I want more structures with Li and O"
structures = sample_crystal(
    params, n_samples=1000,
    composition_bias={"Li": 2.0, "O": 1.5, "P": 1.0}
)
```

```bash
python -m crystalformer.sample --checkpoint model.pkl \
    --n-samples 1000 \
    --composition-bias "Li:2.0,O:1.5,P:1.0"
```

This is deliberately simple — not a hard constraint, just a soft nudge. It turns "generate random structures and hope for the right composition" into "generate structures enriched in desired elements." For us this increased Fe-S hit rate from ~12% to ~60% without any model changes.

**Scope:** Only modifies the sampling function — no changes to model architecture, training, or weights.

### 3. Screening pipeline (`crystalformer/screen/`)

A pluggable filter chain for batch screening of generated structures.

**Built-in filters:**

| Filter | Criterion | Default |
|--------|-----------|---------|
| `validity` | Valid structure (atoms placed, no overlaps < 0.5 Å) | ON |
| `voronoi` | max_void_radius > threshold | r > 1.0 Å |
| `percolation` | percolation_dimensionality >= threshold | dim >= 1 |
| `composition` | Contains specified elements | OFF |
| `density` | Density within range | OFF |

**Custom filters (user-extensible):**
```python
from crystalformer.screen import ScreeningPipeline, Filter

class StabilityFilter(Filter):
    """Example: user plugs in their own ML potential."""
    def __call__(self, structure) -> bool:
        energy = my_ml_model.predict(structure)
        return energy < self.threshold

pipeline = ScreeningPipeline([
    "validity",
    "voronoi:r_min=0.4",
    "percolation:dim=2",
    StabilityFilter(threshold=0.05),
])

results = pipeline.run("generated.csv", output="screened.csv")
# 5000 → 4200 valid → 1890 voronoi → 340 percolating → 52 stable
```

**CLI:**
```bash
python -m crystalformer.screen generated/ \
    --filters validity,voronoi,percolation \
    --r-probe 0.4 \
    --output screened.csv
```

The pipeline prints a summary table showing how many structures pass each stage — makes it easy to see where the funnel narrows.

## Dependencies

Only `pymatgen` (already used for I/O) and `scipy` (for `scipy.spatial.Voronoi`), as optional:

```toml
[project.optional-dependencies]
analysis = ["pymatgen>=2024.1.1", "scipy>=1.10"]
```

No heavy ML dependencies. Core CrystalFormer stays lightweight.

## File structure (all new files, minimal changes to existing code)

```
crystalformer/
├── analysis/
│   ├── __init__.py
│   ├── voronoi.py          # VoronoiAnalyzer
│   ├── percolation.py      # PercolationAnalyzer
│   └── __main__.py         # CLI
├── screen/
│   ├── __init__.py
│   ├── pipeline.py         # ScreeningPipeline, Filter base
│   ├── filters.py          # Built-in filters
│   └── __main__.py         # CLI
└── src/
    └── sample.py           # + optional composition_bias param
```

## Questions before we start

1. **Do you already have internal analysis/screening tools?** We don't want to duplicate existing work. If you have something similar internally, we'd be happy to build on it instead.
2. **Is this in scope for the main repo?** If you'd prefer this to live as a separate package (e.g., `crystalformer-analysis`), that works too.
3. **Single PR or incremental?** We can do one PR with everything, or split into 2-3 (analysis → sampling → screening).
4. **Any naming/API conventions** you'd like us to follow?

Happy to discuss any aspect of this. We've been enjoying working with CrystalFormer and would like to contribute back to the project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: post-generation structural analysis, screening pipeline, and compositional guided sampling #13

Context

The gap

Proposal: three modules

1. Structural analysis (`crystalformer/analysis/`)

2. Compositional guided sampling

3. Screening pipeline (`crystalformer/screen/`)

Dependencies

File structure (all new files, minimal changes to existing code)

Questions before we start

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Filter	Criterion	Default
`validity`	Valid structure (atoms placed, no overlaps < 0.5 Å)	ON
`voronoi`	max_void_radius > threshold	r > 1.0 Å
`percolation`	percolation_dimensionality >= threshold	dim >= 1
`composition`	Contains specified elements	OFF
`density`	Density within range	OFF

Proposal: post-generation structural analysis, screening pipeline, and compositional guided sampling #13

Description

Context

The gap

Proposal: three modules

1. Structural analysis (crystalformer/analysis/)

2. Compositional guided sampling

3. Screening pipeline (crystalformer/screen/)

Dependencies

File structure (all new files, minimal changes to existing code)

Questions before we start

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Structural analysis (`crystalformer/analysis/`)

3. Screening pipeline (`crystalformer/screen/`)