hutaobo
diff --git a/‎manuscript/availability_and_implementation_paragraph.md‎
Lines changed: 1 addition & 0 deletions b/‎manuscript/availability_and_implementation_paragraph.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎manuscript/bioinformatics_application_note_draft.md‎
Lines changed: 75 additions & 0 deletions b/‎manuscript/bioinformatics_application_note_draft.md‎
Lines changed: 75 additions & 0 deletions
diff --git a/‎manuscript/cover_note.md‎
Lines changed: 71 additions & 0 deletions b/‎manuscript/cover_note.md‎
Lines changed: 71 additions & 0 deletions
diff --git a/‎manuscript/evidence/loader_auto_structure.json‎
Lines changed: 57 additions & 0 deletions b/‎manuscript/evidence/loader_auto_structure.json‎
Lines changed: 57 additions & 0 deletions
diff --git a/‎manuscript/evidence/partial_loader_mex_only.json‎
Lines changed: 29 additions & 0 deletions b/‎manuscript/evidence/partial_loader_mex_only.json‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎manuscript/evidence/pytest_q.txt‎
Lines changed: 8 additions & 0 deletions b/‎manuscript/evidence/pytest_q.txt‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎manuscript/evidence/smoke_auto/largest_clusters.csv‎
Lines changed: 6 additions & 0 deletions b/‎manuscript/evidence/smoke_auto/largest_clusters.csv‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎manuscript/evidence/smoke_auto/report.md‎
Lines changed: 50 additions & 0 deletions b/‎manuscript/evidence/smoke_auto/report.md‎
Lines changed: 50 additions & 0 deletions
@@ -0,0 +1 @@
+pyXenium is implemented in Python and uses `anndata`, `numpy`, `pandas`, `scipy`, `scikit-learn`, `zarr`, `fsspec`, `requests`, `aiohttp` and `click` according to `pyproject.toml`. The current repository version is `0.1.0`, and the declared Python requirement is `>=3.8`. Source code is available at `https://github.com/hutaobo/pyXenium` and may also be distributed through [PyPI URL placeholder]. Documentation source files are present under `docs/`; the deployed documentation URL should be inserted here as [documentation URL placeholder]. The current license is `LicenseRef-Proprietary-NonCommercial`, which permits non-commercial source use; if a different release license is chosen before submission, this sentence should be updated accordingly. Operating-system support should be stated explicitly at submission as [validated platform statement placeholder].
@@ -0,0 +1,75 @@
+# pyXenium: robust loading and multimodal analysis of 10x Xenium outputs
+
+[Author 1]^1, [Author 2]^1,* and [Author 3]^2
+
+^1[Affiliation placeholder]
+
+^2[Affiliation placeholder]
+
+*To whom correspondence should be addressed.
+
+## Abstract
+
+**Summary:** 10x Genomics Xenium outputs combine sparse count matrices, cell tables, clustering results and optional boundary files, and practical analyses often begin with exports that are incomplete or stored in different matrix backends. pyXenium is a Python package for loading these outputs into `AnnData` while preserving modality separation and spatial annotations. In a validated smoke test on the public Xenium FFPE human renal cell carcinoma RNA+Protein dataset, pyXenium recovered 465,545 cells, 405 RNA features and 27 protein markers, reproduced the reported detected-cell count, and attached spatial coordinates and cluster labels under both automatic and explicit HDF5 loading. A separate partial loader supports counts-first recovery from incomplete exports and degrades to structured metadata rather than immediate failure when key artifacts are absent.
+
+**Availability and implementation:** Implemented in Python. Source code: `https://github.com/hutaobo/pyXenium`. Package index: [PyPI URL placeholder if published]. Documentation: [documentation URL placeholder]. Current repository version: `0.1.0`. License: `LicenseRef-Proprietary-NonCommercial`.
+
+**Contact:** [corresponding.author@institution.edu]
+
+**Supplementary information:** Figure-generation code and validation outputs for this draft are stored under `manuscript/` in the repository. Additional supplementary information: [supplementary materials placeholder].
+
+## 1 Introduction
+
+10x Genomics Xenium experiments produce multiple output components rather than a single analysis-ready table. A typical run may include a cell-feature matrix in Zarr, HDF5 or MEX form, a per-cell table, clustering results, spatial centroid information and optional boundary files. Downstream single-cell and spatial analysis workflows, however, usually begin from a single `AnnData`-like object. In practice, loading Xenium data is therefore not only a file-parsing step but also an object-reconstruction step.
+
+Two practical problems motivated pyXenium. First, Xenium RNA+Protein experiments need explicit separation of RNA counts from protein measurements while keeping both modalities aligned at the cell level. Second, users often work with incomplete exports, copied subsets of a run, or archives in which only part of the expected directory structure is available. Under those conditions, a loader that assumes one exact file layout can fail before any analysis begins.
+
+pyXenium addresses these problems as an engineering and reproducibility contribution rather than as a new statistical method. The package provides a multimodal Xenium loader, a second loader for partial exports, a small command-line interface, a bundled toy dataset and optional downstream modules for protein-gene spatial correlation and RNA/protein joint analysis. This application note focuses on the loading and validation layers of the repository, because those are the components directly supported by real-data smoke testing and by the current test suite (Fig. 1).
+
+## 2 Implementation
+
+The public API exposed in `src/pyXenium` centers on two loader functions. `load_xenium_gene_protein` is designed for Xenium RNA+Protein outputs. It searches for a usable `cell_feature_matrix` in Zarr, HDF5 or MEX format, with `prefer="auto"` trying Zarr first, then HDF5, then MEX. The loader reads the matrix, inspects the `feature_type` annotation and splits the resulting features into RNA and protein modalities. RNA counts are stored in `adata.X` and mirrored in `adata.layers["rna"]`; protein measurements are stored as a per-cell `DataFrame` in `adata.obsm["protein"]`.
+
+The same loader then enriches the object with run-level metadata. It reindexes the cell table to Xenium barcodes, reads clustering assignments from `analysis/clustering/gene_expression_graphclust/clusters.csv` when available, stores them in `adata.obs["cluster"]`, and adds centroid coordinates to `adata.obsm["spatial"]` when centroid columns are present in the cell table. If `cell_boundaries.csv.gz` or `nucleus_boundaries.csv.gz` exist, the raw boundary tables are attached in `adata.uns`. The loader also records modality metadata in `adata.uns["modality"]`, including the protein value type `"scaled_mean_intensity"`.
+
+`load_anndata_from_partial` addresses incomplete exports. This entry point can combine any available MEX triplet with optional `analysis.zarr[.zip]`, `cells.zarr[.zip]` and `transcripts.zarr[.zip]` inputs. The implementation includes a ZIP-aware Zarr opener that detects nested Zarr roots inside archives, supports both local and remote paths, and assembles an `AnnData` object even when optional attachments are missing. When a MEX triplet is available, the function returns a counts-bearing object with feature metadata. When the MEX triplet is absent and `build_counts_if_missing=True`, it returns an empty `AnnData` together with parsed `analysis` and `cells` summaries in `adata.uns`, allowing callers to inspect what was available rather than receiving an immediate hard failure.
+
+The repository includes additional reproducibility infrastructure around these loaders. `pyXenium.validation.renal_ffpe_protein` implements a smoke-test workflow for a public 10x Genomics Xenium FFPE human renal cell carcinoma RNA+Protein dataset. The smoke test can be called directly from `examples/smoke_test_10x_renal_ffpe_protein.py` or through the CLI command `pyxenium validate-renal-ffpe-protein` (equivalently `python -m pyXenium validate-renal-ffpe-protein`). Both routes produce machine-readable JSON and optional Markdown/CSV summaries. The package also ships a tiny bundled `toy_slide` dataset and CLI commands for demonstration and dataset copying.
+
+## 3 Validation and use case
+
+We validated the main multimodal loader on the public 10x Genomics dataset `Xenium In Situ Gene and Protein Expression data for FFPE Human Renal Cell Carcinoma`, using the repository smoke-test workflow. The smoke test was executed locally against a downloaded copy of the dataset with `prefer="auto"` and again with `prefer="h5"`. Both runs produced the same summary: 465,545 cells, 405 RNA features, 27 protein markers and 16,454,170 non-zero RNA matrix entries. In both runs, `adata.obsm["spatial"]` and `adata.obs["cluster"]` were present, `metrics_summary.csv` reported `num_cells_detected=465545`, and the validation payload contained no issues. This agreement matters because the default path and the explicit HDF5 path exercise different loader branches while arriving at the same cell, feature and metadata totals.
+
+Direct inspection of the loaded object confirmed the structure expected by downstream workflows. On the validated renal dataset, `load_xenium_gene_protein` returned an `AnnData` object of shape 465,545 x 405, with `adata.layers["rna"]`, `adata.obsm["protein"]` and `adata.obsm["spatial"]` present. The spatial matrix had shape 465,545 x 2. `adata.obs["cluster"]` was categorical, and both `cell_boundaries` and `nucleus_boundaries` were attached under `adata.uns`. The loader therefore preserved not only the RNA matrix but also protein measurements, spatial centroids, clustering assignments and raw boundary tables in one aligned object.
+
+We also evaluated the partial-loading path in two conditions. First, when `load_anndata_from_partial` was given only the real dataset `cell_feature_matrix` MEX directory and no `cells` or `analysis` attachments, it still returned an `AnnData` object of shape 465,545 x 543 with a `counts` layer and feature annotations. The recovered features comprised 405 gene-expression rows, 27 protein-expression rows and 111 control or unassigned rows, showing that a counts-first workflow can proceed even when optional spatial or clustering files are unavailable. Second, on the bundled toy dataset with no MEX triplet, the same function returned an empty `AnnData` but still populated `adata.uns["analysis"]` and `adata.uns["cells"]`, demonstrating the intended metadata-preserving fallback behavior for severely partial exports.
+
+Repository-level checks support these data-level validations. In the current local repository state used for this manuscript draft, `pytest -q` collected and passed six tests. These tests cover the demo CLI, the toy dataset loader, bundled dataset copying, the public dataset catalog, the smoke-report rendering helper and the CLI wrapper for the renal FFPE validation command. Taken together, the real-data smoke test, the partial-loader checks and the passing test suite support a conservative claim: pyXenium provides a reproducible way to turn Xenium outputs into analysis-ready Python objects, with explicit support for RNA+Protein data and for incomplete-export recovery.
+
+## 4 Availability and implementation
+
+pyXenium is implemented in Python and uses `anndata`, `numpy`, `pandas`, `scipy`, `scikit-learn`, `zarr`, `fsspec`, `requests`, `aiohttp` and `click` according to `pyproject.toml`. The current repository version is `0.1.0`, and the declared Python requirement is `>=3.8`. Source code is available at `https://github.com/hutaobo/pyXenium` and may also be distributed through [PyPI URL placeholder]. Documentation source files are present under `docs/`; the deployed documentation URL should be inserted here as [documentation URL placeholder]. The current license is `LicenseRef-Proprietary-NonCommercial`, which permits non-commercial source use; if a different release license is chosen before submission, this sentence should be updated accordingly. Operating-system support should be stated explicitly at submission as [validated platform statement placeholder].
+
+## Figure Legends
+
+**Figure 1. Evidence-backed summary of pyXenium loading and validation.**  
+**(A)** Real-data smoke-test summary for the public 10x Genomics FFPE human renal cell carcinoma Xenium RNA+Protein dataset. The automatic loader path and explicit HDF5 path produced identical summaries, recovering 465,545 cells, 405 RNA features and 27 protein markers, while preserving spatial coordinates and cluster labels and returning no validation issues.  
+**(B)** Top five RNA features by total counts in the validated `prefer="auto"` smoke test. Bars show total counts in millions, and text annotations report the number of cells with non-zero counts for each feature.  
+**(C)** Top five protein markers by mean signal in the validated `prefer="auto"` smoke test. Bars show mean protein signal, and text annotations report the number of positive cells for each marker.  
+**(D)** Additional evidence from the same repository validation workflow. Left: sizes of the five largest graph-based clusters recovered from the validated renal dataset. Right: feature-type composition of the real MEX-only partial load, which returned a 465,545 x 543 counts object containing gene-expression, protein-expression and control features without requiring optional spatial or clustering attachments.
+
+## Funding
+
+This work was supported by [Funding information placeholder].
+
+## Conflict of Interest
+
+Conflict of Interest: [Authors to complete. If none, use "none declared."]
+
+## References
+
+1. 10x Genomics. Xenium In Situ Gene and Protein Expression data for FFPE Human Renal Cell Carcinoma. Available at: https://www.10xgenomics.com/datasets/xenium-protein-ffpe-human-renal-carcinoma
+2. [Xenium platform and file-format reference placeholder]
+3. [AnnData citation placeholder]
+4. [scikit-learn citation placeholder if retained]
+5. [Additional spatial transcriptomics context reference placeholder]
@@ -0,0 +1,71 @@
+# Cover Note
+
+## Evidence used
+
+- Repository inspection:
+  - `README.md`
+  - `pyproject.toml`
+  - `src/pyXenium/io/partial_xenium_loader.py`
+  - `src/pyXenium/io/xenium_gene_protein_loader.py`
+  - `src/pyXenium/validation/renal_ffpe_protein.py`
+  - `src/pyXenium/__main__.py`
+  - `src/pyXenium/datasets/catalog.py`
+  - `tests/`
+  - `docs/`
+- Real-data runs performed locally on the public 10x FFPE renal carcinoma Xenium RNA+Protein dataset:
+  - smoke test with `prefer="auto"`
+  - smoke test with `prefer="h5"`
+  - direct object inspection with `load_xenium_gene_protein`
+  - `load_anndata_from_partial` on the real `cell_feature_matrix` MEX directory
+- Additional code-behavior check:
+  - `load_anndata_from_partial` on the bundled toy dataset without MEX
+- Reproducibility checks:
+  - `pytest --collect-only -q` collected 6 tests
+  - `pytest -q` passed locally with 6 tests
+
+## Claims strongly supported
+
+- pyXenium can load the validated renal FFPE Xenium RNA+Protein dataset into `AnnData`.
+- The `auto` and explicit `h5` loader paths recovered the same validated summary:
+  - `465545` cells
+  - `405` RNA features
+  - `27` protein markers
+  - `16454170` RNA non-zero entries
+  - spatial coordinates present
+  - cluster labels present
+  - no smoke-test issues
+- The validated loaded object contains:
+  - `adata.layers["rna"]`
+  - `adata.obsm["protein"]`
+  - `adata.obsm["spatial"]`
+  - categorical `adata.obs["cluster"]`
+  - `cell_boundaries` and `nucleus_boundaries` in `adata.uns`
+- `load_anndata_from_partial` can recover a counts object from the real MEX directory alone:
+  - shape `465545 x 543`
+  - counts layer present
+  - feature metadata preserved
+  - optional spatial and clustering attachments not required
+- The repository currently includes a bundled toy dataset, a validation module, a validation CLI command and a passing local pytest run with 6 tests.
+
+## Claims intentionally avoided
+
+- I did not claim algorithmic novelty beyond software robustness and data handling, because the strongest direct evidence is for loading, validation and reproducibility rather than for a new statistical method.
+- I did not claim runtime or memory advantages, because no benchmark suite for those quantities was generated here.
+- I did not claim biological findings from the renal dataset beyond the observed loaded counts, top-feature summaries and presence of metadata, because this manuscript draft is positioned as a software note.
+- I did not describe the current license as open source, because the repository metadata states `LicenseRef-Proprietary-NonCommercial`.
+
+## Metadata to verify before submission
+
+- Author list, affiliations and corresponding author email
+- Final GitHub, PyPI and documentation URLs
+- Whether to archive the release on Zenodo and insert a DOI
+- Final package version to cite in the manuscript
+- Funding statement
+- Conflict-of-interest statement
+- Preferred software/data references
+- Operating-system support statement for the availability paragraph
+
+## Submission optics to consider
+
+- `pyproject.toml` currently describes pyXenium as `"A toy Python package for analyzing 10x Xenium data."` This wording is not suitable for manuscript or release metadata and should be strengthened before submission.
+- The current non-commercial source-available license is compatible with the manuscript's factual description, but it may read less favorably than a standard open-source license in reviewer and editor assessment. If the license changes, update the manuscript availability paragraph accordingly.
@@ -0,0 +1,57 @@
+{
+  "shape": [
+    465545,
+    405
+  ],
+  "layers": [
+    "rna"
+  ],
+  "obsm_keys": [
+    "protein",
+    "spatial"
+  ],
+  "uns_keys_subset": [
+    "cell_boundaries",
+    "modality",
+    "nucleus_boundaries"
+  ],
+  "obs_columns_sample": [
+    "x_centroid",
+    "y_centroid",
+    "transcript_counts",
+    "control_probe_counts",
+    "genomic_control_counts",
+    "control_codeword_counts",
+    "unassigned_codeword_counts",
+    "deprecated_codeword_counts",
+    "total_counts",
+    "cell_area",
+    "nucleus_area",
+    "nucleus_count"
+  ],
+  "has_cluster": true,
+  "cluster_dtype": "category",
+  "protein_columns_sample": [
+    "PD-1",
+    "VISTA",
+    "PD-L1",
+    "LAG-3",
+    "CD16",
+    "GranzymeB",
+    "CD163",
+    "CD4"
+  ],
+  "spatial_shape": [
+    465545,
+    2
+  ],
+  "modality_uns": {
+    "rna": {
+      "feature_type": "Gene Expression"
+    },
+    "protein": {
+      "feature_type": "Protein Expression",
+      "value": "scaled_mean_intensity"
+    }
+  }
+}
@@ -0,0 +1,29 @@
+{
+  "shape": [
+    465545,
+    543
+  ],
+  "layers": [
+    "counts"
+  ],
+  "has_spatial": false,
+  "has_cluster": false,
+  "uns_keys": [
+    "io"
+  ],
+  "var_columns": [
+    "feature_id",
+    "feature_name",
+    "feature_type",
+    "feature_types",
+    "total_counts"
+  ],
+  "feature_type_counts": {
+    "Gene Expression": 405,
+    "Negative Control Codeword": 41,
+    "Unassigned Codeword": 35,
+    "Protein Expression": 27,
+    "Negative Control Probe": 20,
+    "Genomic Control": 15
+  }
+}
@@ -0,0 +1,8 @@
+......                                                                   [100%]
+============================== warnings summary ===============================
+C:\Users\taobo.hu\AppData\Local\miniconda3\Lib\site-packages\anndata\_settings.py:16
+  C:\Users\taobo.hu\AppData\Local\miniconda3\Lib\site-packages\anndata\_settings.py:16: DeprecationWarning: anndata will no longer support zarr v2 in the near future. Please prepare to upgrade to zarr>=3.
+    from .compat import is_zarr_v2, old_positionals
+
+-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
+6 passed, 1 warning in 5.77s
@@ -0,0 +1,6 @@
+cluster,n_cells
+1,87757
+2,67261
+3,59896
+4,53975
+5,35331
@@ -0,0 +1,50 @@
+# pyXenium Smoke Test Report
+
+Dataset: Xenium In Situ Gene and Protein Expression data for FFPE Human Renal Cell Carcinoma
+Source: https://www.10xgenomics.com/datasets/xenium-protein-ffpe-human-renal-carcinoma
+Local path: `Y:/long/10X_datasets/Xenium/Xenium_Renal/Xenium_V1_Human_Kidney_FFPE_Protein`
+Backend preference: `auto`
+
+## Core Results
+
+- Cells: `465545`
+- RNA features: `405`
+- Protein markers: `27`
+- Sparse matrix nnz: `16454170`
+- Spatial coordinates present: `True`
+- Cluster labels present: `True`
+- metrics_summary.csv detected cells: `465545`
+
+## Validated Reference
+
+- Expected cells: `465545`
+- Expected RNA features: `405`
+- Expected protein markers: `27`
+
+## Largest Clusters
+
+- `1`: `87757` cells
+- `2`: `67261` cells
+- `3`: `59896` cells
+- `4`: `53975` cells
+- `5`: `35331` cells
+
+## Top RNA Features by Total Counts
+
+- `VIM`: total counts `9627261`, detected cells `438708`
+- `HLA-DRA`: total counts `3621954`, detected cells `382120`
+- `HLA-DRB1`: total counts `980650`, detected cells `278995`
+- `CXCL6`: total counts `963037`, detected cells `128336`
+- `FCGR3A`: total counts `943799`, detected cells `213534`
+
+## Top Protein Markers by Mean Signal
+
+- `Vimentin`: mean signal `234.7700`, positive cells `455851`
+- `CD45`: mean signal `206.9365`, positive cells `446921`
+- `PTEN`: mean signal `149.3354`, positive cells `464946`
+- `CD3E`: mean signal `142.4783`, positive cells `285619`
+- `CD68`: mean signal `120.7367`, positive cells `244801`
+
+## Issues
+
+- No issues detected.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+pyXenium is implemented in Python and uses `anndata`, `numpy`, `pandas`, `scipy`, `scikit-learn`, `zarr`, `fsspec`, `requests`, `aiohttp` and `click` according to `pyproject.toml`. The current repository version is `0.1.0`, and the declared Python requirement is `>=3.8`. Source code is available at `https://github.com/hutaobo/pyXenium` and may also be distributed through [PyPI URL placeholder]. Documentation source files are present under `docs/`; the deployed documentation URL should be inserted here as [documentation URL placeholder]. The current license is `LicenseRef-Proprietary-NonCommercial`, which permits non-commercial source use; if a different release license is chosen before submission, this sentence should be updated accordingly. Operating-system support should be stated explicitly at submission as [validated platform statement placeholder].
-Original file line number
+Diff line change
@@ @@ -0,0 +1,6 @@ @@
 +cluster,n_cells
 +1,87757
 +2,67261
 +3,59896
 +4,53975
 +5,35331