Improve packaging and document validated 10x dataset

hutaobo · hutaobo · commit 588a1241b54d · 2026-04-09T16:22:56.000+02:00
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -8,5 +8,7 @@ jobs:
       - uses: actions/setup-python@v5
         with:
           python-version: "3.11"
-      - run: pip install -U "pyXenium>=0.1.0" pytest
-      - run: pytest
+      - run: python -m pip install -U pip
+      - run: pip install -e ".[dev]"
+      - run: pytest -q
+      - run: python -m build
diff --git a/LICENSE b/LICENSE
@@ -1,3 +1,43 @@
-All rights reserved by the authors.
+pyXenium Non-Commercial License
 
-Copyright (c) 2025
+Copyright (c) 2025 Taobo Hu. All rights reserved.
+
+This software and associated documentation files (the "Software") are
+proprietary and are licensed, not sold.
+
+Permission is granted to use, reproduce, modify, and redistribute the Software
+solely for non-commercial purposes, subject to the following conditions:
+
+1. You must retain this license text, copyright notice, and all existing
+   attribution notices in any copy of the Software or substantial portion of
+   the Software.
+2. Any modified version that you share must be clearly marked as modified.
+3. You may not use the Software, or any derivative work of the Software, for
+   any commercial purpose without prior written permission from Taobo Hu.
+4. You may not sublicense or impose terms that expand the permissions granted
+   by this license.
+5. No trademark, patent, or other intellectual property rights are granted
+   except for the limited copyright license expressly stated here.
+
+For purposes of this license, "commercial purpose" includes any use that is
+primarily intended for or directed toward commercial advantage or monetary
+compensation. Commercial purpose includes, without limitation:
+
+- selling, licensing, sublicensing, or distributing the Software for a fee;
+- providing the Software, or a service substantially based on the Software, to
+  third parties for a fee or other commercial benefit;
+- using the Software to operate, support, or develop a product or service that
+  is sold, licensed, hosted, or otherwise commercialized;
+- internal use by or for a for-profit entity in connection with revenue-
+  generating activity, client work, consulting, or managed services.
+
+If you need commercial rights, you must obtain prior written permission from
+the copyright holder.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE, TITLE, AND NON-INFRINGEMENT. IN NO EVENT
+SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY CLAIM, DAMAGES,
+OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE,
+ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -4,8 +4,6 @@ pyXenium
 pyXenium is a Python library for loading and analyzing **10x Genomics Xenium** in‑situ outputs.
 It supports **robust partial loading** of incomplete exports and provides utilities for **multi‑modal (RNA + Protein)** runs.
 
-Version: 0.1.1
-
 ---
 
 Features
@@ -20,16 +18,39 @@ Features
 
 Installation
 ------------
-The package is organized as a standard `src/` layout. Until a PyPI release is available, install from source or Git:
+Install from PyPI or directly from GitHub:
 
 ```bash
+# From PyPI
+pip install pyXenium
+
 # From GitHub (source)
 pip install "git+https://github.com/hutaobo/pyXenium.git"
 ```
 
 Requirements (typical): Python 3.9+; `anndata`, `numpy`, `pandas`, `scipy`, `zarr`, `fsspec`, `matplotlib`, `scikit-learn`, `click`.
 (Exact dependencies follow the project configuration and imports.)
 
+Validated Public Dataset
+------------------------
+pyXenium has been smoke-tested against the official 10x Genomics dataset
+`Xenium In Situ Gene and Protein Expression data for FFPE Human Renal Cell Carcinoma`:
+
+- Source page: https://www.10xgenomics.com/datasets/xenium-protein-ffpe-human-renal-carcinoma
+- Provider: 10x Genomics
+- Modality: Xenium RNA + Protein
+- Software: Xenium Onboard Analysis 4.0.0
+- Upstream data license: CC BY 4.0
+
+Validation summary from a local download of the public bundle:
+
+- `load_xenium_gene_protein(..., prefer="auto")` loaded the Zarr-backed dataset successfully.
+- `load_xenium_gene_protein(..., prefer="h5")` loaded the HDF5-backed dataset successfully.
+- The validated bundle produced an `AnnData` with `465545` cells, `405` RNA features,
+  `27` protein markers, spatial centroids in `adata.obsm["spatial"]`, and merged cluster labels in `adata.obs["cluster"]`.
+- In the downloaded bundle used for validation, `metrics_summary.csv` reports `num_cells_detected=465545`,
+  and pyXenium reproduced that value from both supported matrix backends.
+
 Quick Start
 -----------
 
@@ -202,14 +223,17 @@ rna_protein_cluster_analysis(
 Command‑line
 ------------
 
-A small CLI is provided via `python -m pyXenium` (requires `click`).
+A small CLI is provided via `python -m pyXenium` or the installed `pyxenium` command.
 
 ```bash
 # Print a quick sanity check on the toy dataset
 python -m pyXenium demo
 
 # Fetch a toy dataset to a cache directory
 python -m pyXenium datasets --name toy_slide --dest ~/.cache/pyXenium
+
+# Equivalent console script
+pyxenium demo
 ```
 
 Data layout expectations
@@ -244,4 +268,9 @@ If this toolkit helps your work, please cite the project and the 10x Genomics Xe
 
 License
 -------
-All rights reserved by the author.
+Copyright (c) 2025 Taobo Hu. All rights reserved.
+
+This project is source-available, not open source. You may use, modify, and
+redistribute it only for non-commercial purposes under the terms of the
+[LICENSE](LICENSE) file. Commercial use requires prior written permission from
+the copyright holder.
diff --git a/pyproject.toml b/pyproject.toml
@@ -8,7 +8,8 @@ version = "0.1.0"
 description = "A toy Python package for analyzing 10x Xenium data."
 readme = "README.md"
 requires-python = ">=3.8"
-license = { text = "MIT" }
+license = "LicenseRef-Proprietary-NonCommercial"
+license-files = ["LICENSE"]
 authors = [{ name = "Taobo Hu" }]
 
 # 运行时依赖（按你项目真实需要填写；示例）
@@ -22,8 +23,12 @@ dependencies = [
   "fsspec>=2024.6.0",
   "requests>=2.31",
   "aiohttp",
+  "click>=8.1",
 ]
 
+[project.scripts]
+pyxenium = "pyXenium.__main__:main"
+
 [project.optional-dependencies]
 # ★ 新增：CI 里用到的 .[dev]（pytest、构建/发布、文档等常见开发依赖）
 dev = [
@@ -38,6 +43,13 @@ dev = [
 [tool.setuptools]
 # ★ 明确 src 布局
 package-dir = {"" = "src"}
+include-package-data = true
+
+[tool.setuptools.package-data]
+pyXenium = [
+  "config/*.yaml",
+  "datasets/toy_slide/*.zip",
+]
 
 [tool.setuptools.packages.find]
 where = ["src"]
diff --git a/src/pyXenium/__init__.py b/src/pyXenium/__init__.py
@@ -1,12 +1,13 @@
 from ._version import __version__
+from .analysis import protein_gene_correlation
+from .datasets import PUBLIC_DATASET_SOURCES, get_public_dataset_sources
 from .io.partial_xenium_loader import load_anndata_from_partial
 from .io.xenium_gene_protein_loader import load_xenium_gene_protein
-from .analysis import protein_gene_correlation
 
-# src/pyXenium/__init__.py
 __all__ = [
-    *globals().get("__all__", []),
     "__version__",
+    "PUBLIC_DATASET_SOURCES",
+    "get_public_dataset_sources",
     "load_xenium_gene_protein",
     "load_anndata_from_partial",
     "protein_gene_correlation",
diff --git a/src/pyXenium/__main__.py b/src/pyXenium/__main__.py
@@ -1,36 +1,46 @@
-import shutil
 from pathlib import Path
+
 import click
-from src.pyXenium.io.io import load_toy
+
+from .io.io import copy_bundled_dataset, load_toy
+
 
 @click.group()
 def app():
     """pyXenium: Xenium toolkit (toy data included)"""
 
+
 @app.command()
 def demo():
     ds = load_toy()
     click.echo(f"Loaded groups: {list(ds)}")
 
+
 @app.command()
 @click.option("--name", default="toy_slide", show_default=True)
 @click.option("--url", default=None, help="Optional URL to download a dataset archive")
-@click.option("--dest", default=str(Path.home()/".cache"/"pyXenium"), show_default=True)
+@click.option("--dest", default=str(Path.home() / ".cache" / "pyXenium"), show_default=True)
 def datasets(name, url, dest):
     """Fetch example datasets to a local cache."""
-    cache = Path(dest); cache.mkdir(parents=True, exist_ok=True)
-    target = cache / name
+    cache = Path(dest)
+    cache.mkdir(parents=True, exist_ok=True)
     if url:
         import urllib.request
+
+        target = cache / name
         urllib.request.urlretrieve(url, str(target))
         click.echo(f"Downloaded to {target}")
     else:
-        from importlib import resources
-        base = resources.files("pyXenium.datasets.toy_slide")
-        target.mkdir(parents=True, exist_ok=True)
-        for fn in ["cells.zarr.zip", "transcripts.zarr.zip", "analysis.zarr.zip"]:
-            shutil.copyfile(base/fn, target/fn)
+        try:
+            target = copy_bundled_dataset(name=name, dest=cache)
+        except FileNotFoundError as exc:
+            raise click.ClickException(str(exc)) from exc
         click.echo(f"Copied bundled toy dataset to {target}")
 
-if __name__ == "__main__":
+
+def main():
     app()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/src/pyXenium/analysis/protein_microenvironment.py b/src/pyXenium/analysis/protein_microenvironment.py
@@ -24,7 +24,7 @@
   hundreds of thousands of cells. The intra-cluster adjacency (CSR) is only used for Moran's I on a subset.
 
 Author: (c) 2025
-License: All rights reserved.
+License: Proprietary; non-commercial use only.
 """
 
 from __future__ import annotations
diff --git a/src/pyXenium/datasets/__init__.py b/src/pyXenium/datasets/__init__.py
@@ -0,0 +1,15 @@
+"""Bundled example datasets and curated public source metadata shipped with pyXenium."""
+
+from .catalog import (
+    PUBLIC_DATASET_SOURCES,
+    RENAL_FFPE_PROTEIN_10X_DATASET,
+    PublicDatasetSource,
+    get_public_dataset_sources,
+)
+
+__all__ = [
+    "PublicDatasetSource",
+    "RENAL_FFPE_PROTEIN_10X_DATASET",
+    "PUBLIC_DATASET_SOURCES",
+    "get_public_dataset_sources",
+]
diff --git a/src/pyXenium/datasets/catalog.py b/src/pyXenium/datasets/catalog.py
@@ -0,0 +1,56 @@
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+
+@dataclass(frozen=True)
+class PublicDatasetSource:
+    slug: str
+    title: str
+    provider: str
+    url: str
+    modality: str
+    software: str
+    species: str
+    tissue: str
+    preservation_method: str
+    disease_state: str
+    upstream_data_license: str
+    first_published: str
+    current_release_date: str
+    release_notes: str
+    local_validation_summary: str
+
+
+RENAL_FFPE_PROTEIN_10X_DATASET = PublicDatasetSource(
+    slug="xenium-protein-ffpe-human-renal-carcinoma",
+    title="Xenium In Situ Gene and Protein Expression data for FFPE Human Renal Cell Carcinoma",
+    provider="10x Genomics",
+    url="https://www.10xgenomics.com/datasets/xenium-protein-ffpe-human-renal-carcinoma",
+    modality="RNA + Protein",
+    software="Xenium Onboard Analysis 4.0.0",
+    species="Human",
+    tissue="Kidney",
+    preservation_method="FFPE",
+    disease_state="Renal cell carcinoma",
+    upstream_data_license="CC BY 4.0",
+    first_published="2025-07-17",
+    current_release_date="2025-09-25",
+    release_notes=(
+        "10x Genomics states that the dataset was first published on July 17, 2025, "
+        "reanalyzed with the final Xenium Onboard Analysis v4.0 pipeline on August 27, 2025, "
+        "and replaced again on September 25, 2025 to fix a bug with no changes to the biological results."
+    ),
+    local_validation_summary=(
+        "pyXenium successfully loaded a local copy of the public bundle through both the Zarr-backed "
+        "and HDF5-backed cell_feature_matrix inputs, producing an AnnData object with 465545 cells, "
+        "405 RNA features, 27 protein markers, spatial centroids, and merged cluster labels."
+    ),
+)
+
+PUBLIC_DATASET_SOURCES = (RENAL_FFPE_PROTEIN_10X_DATASET,)
+
+
+def get_public_dataset_sources() -> tuple[PublicDatasetSource, ...]:
+    """Return curated public dataset sources used to validate pyXenium."""
+    return PUBLIC_DATASET_SOURCES
diff --git a/src/pyXenium/datasets/toy_slide/__init__.py b/src/pyXenium/datasets/toy_slide/__init__.py
@@ -0,0 +1 @@
+"""Toy Xenium-like dataset used for smoke tests and demos."""
diff --git a/src/pyXenium/datasets/toy_slide/analysis.zarr.zip b/src/pyXenium/datasets/toy_slide/analysis.zarr.zip
diff --git a/src/pyXenium/datasets/toy_slide/cells.zarr.zip b/src/pyXenium/datasets/toy_slide/cells.zarr.zip
diff --git a/src/pyXenium/datasets/toy_slide/transcripts.zarr.zip b/src/pyXenium/datasets/toy_slide/transcripts.zarr.zip
diff --git a/src/pyXenium/io/io.py b/src/pyXenium/io/io.py
@@ -1,14 +1,46 @@
+from __future__ import annotations
+
+import shutil
 from importlib import resources
+from pathlib import Path
+
 import zarr
 
+_BUNDLED_DATASETS = {
+    "toy_slide": ("pyXenium.datasets.toy_slide", ("cells.zarr.zip", "transcripts.zarr.zip", "analysis.zarr.zip")),
+}
+
+
+def _bundled_dataset_resource(name: str):
+    try:
+        package, _ = _BUNDLED_DATASETS[name]
+    except KeyError as exc:
+        available = ", ".join(sorted(_BUNDLED_DATASETS))
+        raise FileNotFoundError(f"Unknown bundled dataset '{name}'. Available datasets: {available}.") from exc
+    return resources.files(package)
+
+
 def open_zarr_zip(zip_path):
     store = zarr.storage.ZipStore(str(zip_path), mode="r")
-    return zarr.group(store=store)
+    return zarr.open_group(store=store, mode="r")
+
 
 def load_toy():
-    base = resources.files("pyXenium.datasets.toy_slide")
+    base = _bundled_dataset_resource("toy_slide")
     return {
         "cells": open_zarr_zip(base / "cells.zarr.zip"),
         "transcripts": open_zarr_zip(base / "transcripts.zarr.zip"),
         "analysis": open_zarr_zip(base / "analysis.zarr.zip"),
     }
+
+
+def copy_bundled_dataset(name: str, dest: str | Path) -> Path:
+    base = _bundled_dataset_resource(name)
+    _, filenames = _BUNDLED_DATASETS[name]
+
+    target = Path(dest) / name
+    target.mkdir(parents=True, exist_ok=True)
+    for filename in filenames:
+        with resources.as_file(base / filename) as source_path:
+            shutil.copyfile(source_path, target / filename)
+    return target
diff --git a/tests/conftest.py b/tests/conftest.py
@@ -0,0 +1,7 @@
+from pathlib import Path
+import sys
+
+SRC = Path(__file__).resolve().parents[1] / "src"
+
+if str(SRC) not in sys.path:
+    sys.path.insert(0, str(SRC))
diff --git a/tests/test_cli.py b/tests/test_cli.py
diff --git a/tests/test_dataset_catalog.py b/tests/test_dataset_catalog.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+"""Toy Xenium-like dataset used for smoke tests and demos."""`