Skip to content

Commit 23ec3a9

Browse files
committed
add new feature
1 parent 291f039 commit 23ec3a9

File tree

1 file changed

+178
-86
lines changed

1 file changed

+178
-86
lines changed

README.md

Lines changed: 178 additions & 86 deletions
Original file line numberDiff line numberDiff line change
@@ -1,116 +1,168 @@
1-
# pyXenium
1+
pyXenium
2+
========
23

3-
**pyXenium** is a Python library for loading and analyzing 10x Genomics Xenium *in situ* exports.
4-
It supports robust partial loading of incomplete exports and provides utilities for multi‑modal
5-
Xenium runs that include both RNA and protein measurements.
4+
pyXenium is a Python library for loading and analyzing **10x Genomics Xenium** in‑situ outputs.
5+
It supports **robust partial loading** of incomplete exports and provides utilities for **multi‑modal (RNA + Protein)** runs.
66

7-
> If you are already familiar with Xenium outputs, jump to:
8-
> - [Partial loading (incomplete exports)](#partial-loading-incomplete-exports)
9-
> - [RNA + Protein loader](#rna--protein-loader)
10-
> - [Gene–protein correlation](#gene–protein-correlation)
11-
> - [RNA/protein joint analysis](#rnaprotein-joint-analysis)
7+
Version: 0.1.1
128

13-
## Features
9+
---
1410

15-
- **Partial loading of incomplete exports** — Load what is available even when the
16-
`cell_feature_matrix` MEX is missing/partial; optional attachment of clusters and spatial centroids.
17-
- **RNA + Protein support** — Read combined cell-feature matrices, split features by type
18-
(Gene Expression vs Protein Expression), and return matched cell × gene/protein matrices.
19-
- **Protein–gene correlation** — Compute Pearson/Spearman correlations between gene expression
20-
and protein intensities across cells.
11+
Features
12+
--------
13+
- **Partial loading of incomplete exports** — assemble an `AnnData` even when some Xenium artifacts
14+
are missing; opportunistically attaches clusters (`analysis.zarr[.zip]`) and spatial centroids (`cells.zarr[.zip]`).
15+
- **RNA + Protein support** — read combined cell‑feature matrices from Zarr/HDF5/MEX, split features by type,
16+
and return matched cell × gene/protein data.
17+
- **Protein–gene spatial correlation** — compute correlations between protein intensity and gene transcript
18+
density across spatial bins; export plots and CSV summaries.
19+
- **Toy dataset included** — a minimal Xenium‑like dataset (`toy_slide`) to get started quickly.
2120

22-
## Installation
21+
Installation
22+
------------
23+
The package is organized as a standard `src/` layout. Until a PyPI release is available, install from source or Git:
2324

2425
```bash
25-
# From PyPI (if available)
26-
pip install pyXenium
27-
28-
# Or install the latest from GitHub
29-
pip install git+https://github.com/hutaobo/pyXenium.git
26+
# From GitHub (source)
27+
pip install "git+https://github.com/hutaobo/pyXenium.git"
3028
```
3129

32-
Python ≥3.9 is recommended.
30+
Requirements (typical): Python 3.9+; `anndata`, `numpy`, `pandas`, `scipy`, `zarr`, `fsspec`, `matplotlib`, `scikit-learn`, `click`.
31+
(Exact dependencies follow the project configuration and imports.)
3332

34-
## Quick start
33+
Quick Start
34+
-----------
3535

36-
### Partial loading (incomplete exports)
36+
### 1) Partial loading (incomplete exports)
3737

38-
`pyXenium.io.partial_xenium_loader.load_anndata_from_partial` tries to assemble an `AnnData`
39-
object from a Xenium export directory or HTTP(S) base. It attaches optional results when present
40-
(e.g. `analysis.zarr[.zip]` and `cells.zarr[.zip]`).
38+
Use `pyXenium.io.partial_xenium_loader.load_anndata_from_partial(...)` to assemble an `AnnData` from any available pieces.
4139

40+
**Local files example:**
4241
```python
4342
from pyXenium.io.partial_xenium_loader import load_anndata_from_partial
4443

45-
# Local export directory
4644
adata = load_anndata_from_partial(
47-
base_dir="/path/to/xenium_export",
48-
analysis_name="analysis.zarr", # optional
49-
cells_name="cells.zarr", # optional
45+
mex_dir="/path/to/xenium_export/cell_feature_matrix", # MEX triplet folder
46+
analysis_name="/path/to/xenium_export/analysis.zarr.zip", # optional
47+
cells_name="/path/to/xenium_export/cells.zarr.zip", # optional
48+
# transcripts_name="/path/to/xenium_export/transcripts.zarr.zip", # optional
5049
)
51-
52-
# Or remote base (files hosted under <BASE>/)
53-
# adata = load_anndata_from_partial(
54-
# base_url="https://example.org/xenium_run",
55-
# analysis_name="analysis.zarr.zip",
56-
# cells_name="cells.zarr.zip",
57-
# )
58-
print(adata) # cells × genes AnnData
50+
print(adata)
5951
```
6052

61-
**What gets loaded** (when available):
62-
- Counts from `cell_feature_matrix/{matrix.mtx.gz, features.tsv.gz, barcodes.tsv.gz}`
63-
- Clusters from `analysis.zarr[.zip]`
64-
- Spatial centroids from `cells.zarr[.zip]`
53+
**Remote base example:**
54+
```python
55+
adata = load_anndata_from_partial(
56+
base_url="https://example.org/xenium_run", # artifacts live under <base_url>/
57+
analysis_name="analysis.zarr.zip",
58+
cells_name="cells.zarr.zip",
59+
)
60+
```
6561

66-
If the MEX triplet is missing, the function still returns a valid `AnnData` (empty genes)
67-
and attaches clusters/spatial if found — useful for inspecting partial/early exports.
62+
Behavior:
63+
- If the MEX triplet is unavailable, the function still returns a valid `AnnData` (empty genes) and attaches
64+
clusters/spatial information when possible.
65+
- Zarr roots are auto‑detected inside `*.zarr.zip` even when the root metadata sits in a subfolder.
66+
67+
**Signature (summary):**
68+
```text
69+
load_anndata_from_partial(
70+
base_url: str | None = None,
71+
analysis_name: str | None = None,
72+
cells_name: str | None = None,
73+
transcripts_name: str | None = None,
74+
mex_dir: str | None = None,
75+
mex_matrix_name: str = "matrix.mtx.gz",
76+
mex_features_name: str = "features.tsv.gz",
77+
mex_barcodes_name: str = "barcodes.tsv.gz",
78+
build_counts_if_missing: bool = True,
79+
) -> anndata.AnnData
80+
```
6881

69-
### RNA + Protein loader
82+
### 2) RNA + Protein loader
7083

71-
Use the dedicated loader in `pyXenium.io.xenium_gene_protein_loader` to read a Xenium export
72-
that includes protein measurements. It separates features by type and aligns cells across modalities.
84+
Use `pyXenium.io.xenium_gene_protein_loader.load_xenium_gene_protein(...)` to load Xenium exports with protein measurements.
7385

7486
```python
7587
from pyXenium.io.xenium_gene_protein_loader import load_xenium_gene_protein
7688

7789
adata = load_xenium_gene_protein(
78-
base_path="/mnt/taobo.hu/long/10X_datasets/Xenium/Xenium_Kidney/Xenium_V1_Human_Kidney_FFPE_Protein"
90+
base_path="/path/to/xenium_export",
91+
prefer="auto", # auto | zarr | h5 | mex
7992
)
93+
# adata.X: RNA counts (CSR); adata.layers["rna"] may hold RNA counts explicitly
94+
# adata.obsm["protein"]: DataFrame of protein intensities
95+
# adata.obsm["spatial"]: cell centroids when available
8096
```
8197

8298
Notes:
83-
- The loader expects a combined MEX under `cell_feature_matrix/` where the 3rd column in
84-
`features.tsv.gz` indicates the feature type (e.g., `"Gene Expression"`, `"Protein Expression"`).
85-
- Invalid/control entries (e.g., blank/unassigned codewords) are filtered by default.
86-
- Both matrices share **identical cell order**, enabling 1:1 comparisons across modalities.
99+
- Supported matrix formats: Zarr (`cell_feature_matrix.zarr/` or `cell_feature_matrix/`), HDF5 (`cell_feature_matrix.h5`), or MEX (`matrix.mtx.gz` triplet).
100+
- Feature types are split using the 3rd column of `features.tsv.gz` (e.g., "Gene Expression", "Protein Expression").
101+
- Optionally attaches centroids/boundaries into `adata.obsm["spatial"]` and `adata.uns`.
102+
- If present, clustering results at `analysis/clustering/gene_expression_graphclust/clusters.csv` are merged into `adata.obs["cluster"]` by default.
103+
104+
**Signature (summary):**
105+
```text
106+
load_xenium_gene_protein(
107+
base_path: str,
108+
*,
109+
prefer: str = "auto", # "auto" | "zarr" | "h5" | "mex"
110+
mex_dirname: str = "cell_feature_matrix",
111+
mex_matrix_name: str = "matrix.mtx.gz",
112+
mex_features_name: str = "features.tsv.gz",
113+
mex_barcodes_name: str = "barcodes.tsv.gz",
114+
cells_csv: str = "cells.csv.gz",
115+
cells_parquet: str | None = None,
116+
read_morphology: bool = False,
117+
attach_boundaries: bool = True,
118+
clusters_relpath: str | None = "analysis/clustering/gene_expression_graphclust/clusters.csv",
119+
cluster_column_name: str = "cluster",
120+
) -> anndata.AnnData
121+
```
87122

88-
### Gene–protein correlation
123+
### 3) Protein–gene spatial correlation
89124

90-
Compute correlations between gene and protein across cells.
125+
`pyXenium.analysis.protein_gene_correlation.protein_gene_correlation(...)` computes Pearson correlations between
126+
**protein average intensity** and **gene transcript density** across spatial bins; it saves per‑pair figures and CSVs,
127+
plus a summary CSV.
91128

92129
```python
93-
BASE = "/mnt/taobo.hu/long/10X_datasets/Xenium/Xenium_Kidney/Xenium_V1_Human_Kidney_FFPE_Protein"
94-
pairs = [("CD3E", "CD3E"), ("E-Cadherin", "CDH1")] # (protein, gene)
95-
96130
from pyXenium.analysis.protein_gene_correlation import protein_gene_correlation
131+
132+
pairs = [("CD3E", "CD3E"), ("E-Cadherin", "CDH1")] # (protein, gene)
97133
summary = protein_gene_correlation(
98134
adata=adata,
99-
transcripts_zarr_path=BASE + "/transcripts.zarr.zip",
135+
transcripts_zarr_path="/path/to/transcripts.zarr.zip",
100136
pairs=pairs,
101137
output_dir="./protein_gene_corr",
102-
grid_size=(50, 50), # 可自定义网格
103-
pixel_size_um=0.2125, # Xenium 常见像素尺寸
138+
grid_size=(50, 50), # μm per bin (used if grid_counts is None)
139+
pixel_size_um=0.2125,
104140
qv_threshold=20,
105-
overwrite=False
141+
overwrite=False,
142+
auto_detect_cell_units=True,
106143
)
107-
print(summary)
144+
print(summary.head())
108145
```
109146

110-
### RNA/protein joint analysis
147+
**Signature (summary):**
148+
```text
149+
protein_gene_correlation(
150+
adata,
151+
transcripts_zarr_path,
152+
pairs,
153+
output_dir,
154+
grid_size=(50, 50),
155+
grid_counts=(50, 50),
156+
pixel_size_um=0.2125,
157+
qv_threshold=20,
158+
overwrite=False,
159+
auto_detect_cell_units=True,
160+
) -> pandas.DataFrame
161+
```
162+
163+
### 4) RNA/protein joint analysis
111164

112-
Cluster cells using RNA expression, then explain within-cluster protein
113-
heterogeneity by training neural network classifiers on the RNA latent space.
165+
Train small classifiers on the RNA latent space to explain within‑cluster protein heterogeneity:
114166

115167
```python
116168
from pyXenium.analysis import rna_protein_cluster_analysis
@@ -123,33 +175,73 @@ summary, models = rna_protein_cluster_analysis(
123175
min_cells_per_group=30,
124176
hidden_layer_sizes=(128, 64),
125177
)
126-
127-
# Inspect metrics for the first few cluster × protein combinations
128178
print(summary.head())
179+
```
129180

130-
# Retrieve the fitted model for a specific cluster and protein
131-
podocin_model = models["cluster_3"]["Podocin"]
132-
print(podocin_model.test_accuracy)
181+
**Signature (summary):**
182+
```text
183+
rna_protein_cluster_analysis(
184+
adata: anndata.AnnData,
185+
*,
186+
n_clusters: int = 12,
187+
n_pcs: int = 30,
188+
cluster_key: str = "rna_cluster",
189+
random_state: int | None = 0,
190+
target_sum: float = 1e4,
191+
min_cells_per_cluster: int = 50,
192+
min_cells_per_group: int = 20,
193+
protein_split_method: str = "median",
194+
protein_quantile: float = 0.75,
195+
test_size: float = 0.2,
196+
hidden_layer_sizes: tuple[int, ...] = (64, 32),
197+
max_iter: int = 200,
198+
early_stopping: bool = True,
199+
) -> tuple[pandas.DataFrame, dict]
133200
```
134201

135-
## Data format expectations
202+
Command‑line
203+
------------
136204

137-
- **Cell-feature matrix (MEX)** under `cell_feature_matrix/`:
138-
- `matrix.mtx.gz`: sparse counts/intensities
139-
- `features.tsv.gz`: 3 columns: `id`, `name`, `feature_type`
140-
- `barcodes.tsv.gz`: cell barcodes (one per row)
141-
- **Optional**: `analysis.zarr[.zip]` (clusters), `cells.zarr[.zip]` (spatial centroids)
205+
A small CLI is provided via `python -m pyXenium` (requires `click`).
142206

143-
## API reference (summary)
207+
```bash
208+
# Print a quick sanity check on the toy dataset
209+
python -m pyXenium demo
144210

145-
- `pyXenium.io.partial_xenium_loader.load_anndata_from_partial(base_dir=None, base_url=None, mex_dir=None, analysis_name=None, cells_name=None)`
146-
- `pyXenium.io.xenium_gene_protein_loader.load_gene_protein(base_dir, mex_dir=None, drop_controls=True)`
147-
- `pyXenium.analysis.protein_gene_correlation.compute(gene_expr, protein_expr, method='pearson')`
211+
# Fetch a toy dataset to a cache directory
212+
python -m pyXenium datasets --name toy_slide --dest ~/.cache/pyXenium
213+
```
214+
215+
Data layout expectations
216+
------------------------
217+
- **cell_feature_matrix/**
218+
`matrix.mtx.gz`, `features.tsv.gz` (≥3 columns: id, name, feature_type), `barcodes.tsv.gz`
219+
- Optional: `analysis.zarr[.zip]` (clusters), `cells.zarr[.zip]` (spatial centroids)
220+
- `transcripts.zarr[.zip]` for spatial transcript coordinates used in correlation analyses.
221+
222+
Minimal API reference (index)
223+
-----------------------------
224+
- `pyXenium.io.partial_xenium_loader.load_anndata_from_partial(...)`
225+
- `pyXenium.io.xenium_gene_protein_loader.load_xenium_gene_protein(...)`
226+
- `pyXenium.analysis.protein_gene_correlation.protein_gene_correlation(...)`
227+
- `pyXenium.analysis.rna_protein_cluster_analysis.rna_protein_cluster_analysis(...)`
148228

149-
## Contributing
229+
Example data
230+
------------
231+
The package ships with a tiny Xenium‑like toy dataset. Programmatic access:
150232

151-
Issues and pull requests are welcome. Please include minimal examples and tests where possible.
233+
```python
234+
from pyXenium.io.io import load_toy
235+
z = load_toy()
236+
cells = z["cells"] # zarr group
237+
transcripts = z["transcripts"]
238+
analysis = z["analysis"]
239+
```
152240

153-
## License
241+
Citations
242+
---------
243+
If this toolkit helps your work, please cite the project and the 10x Genomics Xenium platform as appropriate.
154244

155-
MIT. See `LICENSE`.
245+
License
246+
-------
247+
All rights reserved by the author.

0 commit comments

Comments
 (0)