Last Updated: 2026-01-27 Status: Method validated across 9 different synthetic data scenarios with varying difficulty
cd /home/yutonose/CASTOR/experiments/14_inverse_prediction
micromamba run -n castor python run_all.py --quick # Test run
micromamba run -n castor python experiments/exp10_multi_scenario.py --quick # Multi-scenario- RESOLVED: Unsupervised learning works (exp09: AUC 0.995)
- RESOLVED: Multi-scenario validation complete (exp10: 9 scenarios)
- Implement real data validation (Slide-seq/Visium)
- Generate publication-ready figures
- Analyze cell_type_based scenario (Intrinsic AUC only 0.596)
| Method | Ectopic AUC | Intrinsic AUC | Selectivity |
|---|---|---|---|
| Inv_PosError | 0.997 ± 0.003 | 0.549 ± 0.028 | +0.449 |
| PCA_Error | 0.429 ± 0.029 | 1.000 ± 0.000 | -0.571 |
| Neighbor_Diff | 0.896 ± 0.002 | 1.000 ± 0.000 | -0.103 |
| LISA | 0.917 ± 0.004 | 0.992 ± 0.001 | -0.075 |
| LOF | 0.438 ± 0.027 | 1.000 ± 0.000 | -0.562 |
Key Finding: Inv_PosError is the ONLY method with positive selectivity for Ectopic.
| Metric | Value |
|---|---|
| Fraction closer to donor | 0.987 ± 0.010 |
| Mean dist to true position | 35.48 ± 0.76 |
| Mean dist to donor position | 4.56 ± 0.65 |
| Normal spots mean dist | 2.99 ± 0.32 |
Key Finding: 98.7% of ectopic predictions point to donor location.
| Metric | Value |
|---|---|
| Correlation (Inv_PosError vs PCA_Error) | -0.028 ± 0.013 |
| Correlation on normal spots | 0.074 ± 0.022 |
| Separation AUC (Ectopic vs Intrinsic) | 1.000 ± 0.000 |
Key Finding: Near-zero correlation confirms orthogonal detection axes.
| λ_pos | Ectopic AUC |
|---|---|
| 0.0 | 0.511 ± 0.048 (random!) |
| 0.1 | 0.996 ± 0.002 |
| 0.5 | 0.998 ± 0.002 |
| 1.0 | 0.998 ± 0.002 |
Key Finding: Position prediction loss is ESSENTIAL. Without it, detection fails.
Original exp08 (BUGGY - different spatial patterns):
| Training Data | Ectopic AUC |
|---|---|
| Clean (normal only) | 0.495 ± 0.107 (random - BUG!) |
| Contaminated | 0.996 ± 0.005 |
Fixed exp09 (CORRECT - same spatial structure):
| Training Data | Ectopic AUC |
|---|---|
| Clean (normal only) | 0.995 ± 0.004 |
| Contaminated | 0.997 ± 0.003 |
ROOT CAUSE: exp08 used random_state=seed for train and random_state=seed+1000 for test,
creating completely DIFFERENT spatial patterns. The model couldn't generalize across patterns.
FIX: exp09 generates ONE spatial structure, splits into train/test, adds anomalies only to test.
CONCLUSION: Unsupervised learning WORKS when using proper experimental design!
Tests robustness across different synthetic data configurations:
| Scenario | Ectopic Type | Intrinsic Type | Noise | Ectopic AUC | Donor Frac | Intrinsic AUC |
|---|---|---|---|---|---|---|
| baseline | exact_copy | large | Gaussian | 0.987 ± 0.011 | 97.8% | 1.000 |
| noisy_ectopic | noisy_copy (σ=0.2) | large | Gaussian | 0.987 ± 0.009 | 97.9% | 1.000 |
| partial_ectopic | 70% donor mix | large | Gaussian | 0.942 ± 0.020 | 75.0% | 1.000 |
| hard_ectopic | 50% mix + noise | large | Gaussian | 0.874 ± 0.036 | 48.3% | 1.000 |
| medium_intrinsic | exact_copy | medium | Gaussian | 0.996 ± 0.003 | 98.6% | 1.000 |
| hard_intrinsic | exact_copy | small | Gaussian | 0.998 ± 0.002 | 98.9% | 0.996 |
| realistic_counts | noisy_copy | medium | Neg.Binom | 0.994 ± 0.004 | 98.7% | 1.000 |
| cell_type_based | marker_swap | stress_module | Neg.Binom | 0.969 ± 0.026 | 94.0% | 0.596 |
| hardest | 50% mix + noise | small | Neg.Binom | 0.782 ± 0.040 | 46.2% | 0.929 |
Key Findings from Multi-Scenario Validation:
- Noisy ectopic = Exact copy: Adding 20% noise doesn't degrade performance (0.987 vs 0.987)
- Partial mixing degrades gracefully: 70% donor → 0.942 AUC, 50% donor → 0.874 AUC
- Realistic count noise works: Negative binomial noise gives 0.994 AUC
- Cell-type marker swapping: Still detectable at 0.969 AUC
- Hardest scenario: Even with 50% mix + small intrinsic + NB noise → 0.782 AUC (still useful!)
- Intrinsic detection robust: PCA achieves >0.99 AUC except for cell_type_based (0.596)
Scenario Definitions:
noisy_copy: donor expression + Gaussian noise (σ = 20% of gene std)partial_mix: α × donor + (1-α) × original expressionmarker_swap: swap cell-type specific marker genes between regionsstress_module: coordinated upregulation of stress response genes (small effect)negative_binomial: realistic count noise with overdispersion
Semi-synthetic validation on real 10x Visium data (mouse brain sagittal posterior):
| Difficulty | Ectopic Type | Inv_PosError AUC | Donor Frac | PCA AUC |
|---|---|---|---|---|
| easy | exact copy | 0.884 ± 0.083 | 86.4% | 0.09 |
| medium_noise | +10% noise | 0.889 ± 0.112 | 87.3% | 0.07 |
| medium_mix | 70% donor | 0.793 ± 0.150 | 57.8% | 0.02 |
| hard | 70% + noise | 0.795 ± 0.154 | 57.6% | 0.02 |
| hardest | 50% + noise | 0.744 ± 0.150 | 42.5% | 0.06 |
Key Findings from Real Data Validation:
- Method works on real ST data: AUC 0.88 on exact copy scenario
- Noise doesn't hurt: medium_noise (0.889) ≈ easy (0.884)
- Partial mixing degrades gracefully: 70% → 0.79, 50% → 0.74
- Position prediction interpretable: 86% of predictions point toward donor region
- PCA fails on real data: AUC < 0.10 (near random, due to no intrinsic anomalies)
- Real data is harder: AUC 0.88 vs 0.99 on synthetic (expected)
Biological Significance:
- Ectopic detection = detecting cells with expression from "wrong" spatial location
- Applications: tumor invasion, immune infiltration, developmental heterotopia
- Position prediction tells us where the expression "should be"
Original Problem: exp08 showed AUC 0.495 for clean training.
Root Cause: EXPERIMENTAL BUG, not a fundamental limitation!
- exp08 generated train data with
random_state=seed - exp08 generated test data with
random_state=seed+1000 - This created COMPLETELY DIFFERENT spatial patterns
- Model learned pattern A, tested on pattern B → failure
Solution (exp09):
- Generate ONE spatial structure
- Split spots into train/test
- Inject anomalies ONLY into test spots
- Train on clean train split, test on anomaly-containing test split
Result: AUC 0.995 ± 0.004 (essentially same as contaminated training!)
Implication: The method IS a valid unsupervised anomaly detector
Current Ectopic Generation:
X[idx] = X[donor].copy() # Exact copy - unrealistic!Problems:
- Real ectopic cells wouldn't have exact donor expression
- No noise, no partial transfer
- Makes detection artificially easy
Proposed Fix:
# Add noise to ectopic
noise = np.random.normal(0, 0.1 * X[donor].std(), X[donor].shape)
X[idx] = X[donor].copy() + noise
# Partial ectopic (mix original and donor)
alpha = np.random.uniform(0.3, 0.7)
X[idx] = alpha * X[donor] + (1 - alpha) * X[idx]Current Intrinsic Generation:
effect_size = np.random.uniform(2.0, 4.0) * global_std # Very large!
X[idx, affected_genes] += np.random.exponential(effect_size, n_affected)Problem: Effect size is so large that PCA trivially achieves 1.000 AUC.
Proposed Fix: Vary effect size to create difficulty gradient.
All results are on synthetic data. Need:
- Slide-seq hippocampus (known cell types)
- Visium cancer samples (known tumor cells)
- Semi-synthetic: real data + injected anomalies
-
exp09: Clean Training ✓
- Unsupervised learning validated (AUC 0.995)
- Bug in exp08 identified and fixed
-
exp10: Multi-Scenario Validation ✓
- 9 scenarios with varying difficulty
- Noise, partial mixing, realistic counts all tested
- Method is robust (AUC 0.78-0.99)
- exp11: Real Data Validation ✓ (NEW)
- Uses 10x Genomics Visium public data via scanpy
- Semi-synthetic validation: real data + artificial ectopic injection
- Supports multiple datasets: mouse brain, human lymph node, breast cancer
- Preliminary results: AUC 0.69-0.85 on mouse brain sagittal
Run exp11 with full 30 seeds across multiple datasets:
micromamba run -n castor python experiments/exp11_real_data.py \
--datasets mouse_brain_sagittal_posterior human_lymph_node# figures/
# 1. Conceptual diagram: Forward vs Inverse prediction
# 2. Cross-detection AUC heatmap (exp01)
# 3. Position prediction interpretability (exp03)
# 4. Multi-scenario robustness (exp10) - bar chart with error bars
# 5. Real data validation (exp11)
# 6. Ablation: λ_pos effect (exp07)The cell_type_based scenario shows Intrinsic AUC = 0.596 (much lower than other scenarios). This is due to the stress_module intrinsic type which uses coordinated but small effects. May need to analyze whether this reflects a limitation or expected behavior.
- Inv_PosError achieves 0.997 AUC for Ectopic (strong)
- Near-zero correlation with PCA (-0.028) (strong)
- 98.7% of predictions point to donor (strong)
- Position loss is essential (λ=0 fails) (strong)
- Robust to noise/dropout (strong)
- UNSUPERVISED learning works (0.995 AUC with clean training!) - exp09
- ROBUST across scenarios (0.78-0.99 AUC across 9 scenarios) - exp10
- WORKS ON REAL DATA (0.88 AUC on Visium mouse brain) - exp11 NEW
- Publication-ready figures
- Test on additional real datasets (breast cancer, lymph node)
- Investigate cell_type_based Intrinsic AUC drop (0.596)
Title: "Inverse Spatial Prediction Enables Unsupervised Detection of Ectopic Anomalies in Spatial Transcriptomics"
Key Claims:
- First application of expression→position prediction for anomaly detection
- Works WITHOUT labeled anomalies (AUC 0.995 with clean training)
- Provides interpretable results (98% of predictions point to donor)
- Orthogonal to global methods (r = -0.028 with PCA)
- Robust to noise (AUC 0.987 with 20% Gaussian noise on ectopic expression)
- Robust across realistic scenarios (AUC 0.78-0.99 across 9 synthetic configurations)
- Graceful degradation with partial signal (50% donor mix → 0.87 AUC)
- Validated on real Visium data (AUC 0.88 on mouse brain) - NEW
14_inverse_prediction/
├── core/ # Core modules
│ ├── __init__.py # Exports
│ ├── utils.py # Graph, normalization
│ ├── preprocessing.py # Data preparation
│ ├── models.py # InversePredictionModel
│ ├── baselines.py # LISA, LOF, PCA, etc.
│ ├── evaluation.py # AUC, statistics
│ ├── data_generation.py # Synthetic data
│ ├── scenarios.py # Multi-scenario definitions
│ └── real_data.py # Real ST data loading (NEW)
├── experiments/ # Experiment scripts
│ ├── exp01_cross_detection.py
│ ├── exp02_competitor.py
│ ├── exp03_position_accuracy.py
│ ├── exp04_noise_robustness.py
│ ├── exp05_independence.py
│ ├── exp07_ablation.py
│ ├── exp08_clean_training.py # BUGGY - do not use
│ ├── exp09_clean_training_fixed.py # Unsupervised validation
│ ├── exp10_multi_scenario.py # Multi-scenario validation
│ └── exp11_real_data.py # Real data validation (NEW)
├── config.py # Configuration
├── run_all.py # Run all experiments
├── data_cache/ # Downloaded datasets (auto-created)
├── results/ # CSV outputs
└── figures/ # PDF/PNG figures
# Activate environment
micromamba run -n castor python ...
# Quick test (3 seeds)
micromamba run -n castor python run_all.py --quick
# Full run (30 seeds)
micromamba run -n castor python run_all.py
# Specific experiments
micromamba run -n castor python run_all.py --exp 1 3 5
# Individual experiment
micromamba run -n castor python experiments/exp01_cross_detection.py --quick- NumPy 2.0 compatibility:
coords.ptp()→np.ptp(coords) - KeyError s_recon: Unified score key names in compute_scores()
- neighbor_diff axis: Fixed to
X[indices].mean(axis=1) - Silent ectopic injection: Only set label when injection succeeds
- Ectopic injection requires sufficient distant spots (min_distance_factor)
- Model uses BatchNorm - needs sufficient batch size
- Position normalization assumes rectangular spatial layout
This is experiment 14 for inverse spatial prediction anomaly detection.
The method is FULLY VALIDATED:
- Unsupervised learning works (exp09: AUC 0.995)
- Robust across 9 synthetic scenarios (exp10: AUC 0.78-0.99)
Next steps: real data validation (Slide-seq/Visium) and publication figures.
Read the README.md first for full context.
README.md(this file)results/exp10_multi_scenario.csv(multi-scenario validation)results/exp11_real_data.csv(real data validation)core/real_data.py(real data loading utilities)experiments/exp11_real_data.py(real data experiment)
human_lymph_node: Human Lymph Node (4035 spots)mouse_brain_sagittal_posterior: Mouse Brain Sagittal (3355 spots)mouse_brain_sagittal_anterior: Mouse Brain Sagittal Anteriormouse_brain_coronal: Mouse Brain Coronalhuman_breast_cancer: Human Breast Cancer
- Why does cell_type_based scenario have low Intrinsic AUC (0.596)?
- How to interpret real data results (lower AUC than synthetic)?
- Publication-ready figure generation
| Experiment | Result |
|---|---|
| Clean training (exp09) | Ectopic AUC 0.995 ± 0.004 |
| Baseline scenario (exp10) | Ectopic AUC 0.987 ± 0.011 |
| Noisy ectopic (exp10) | Ectopic AUC 0.987 ± 0.009 |
| Hardest scenario (exp10) | Ectopic AUC 0.782 ± 0.040 |
| Realistic counts (exp10) | Ectopic AUC 0.994 ± 0.004 |
| Real data - easy (exp11) | Ectopic AUC 0.884 ± 0.083 |
| Real data - hard (exp11) | Ectopic AUC 0.795 ± 0.154 |
Unsupervised learning + Multi-scenario + Real data validated!