diff --git a/.claude/skills/add-omics-runtime-pack/SKILL.md b/.claude/skills/add-omics-runtime-pack/SKILL.md new file mode 100644 index 0000000..c365767 --- /dev/null +++ b/.claude/skills/add-omics-runtime-pack/SKILL.md @@ -0,0 +1,149 @@ +--- +name: add-omics-runtime-pack +description: Audit or refresh a curated pack of eight high-signal omics runtime skills in a BioClaw installation. Use when the user wants stronger built-in guidance for common omics analyses inside agent containers without changing BioClaw source code. Ensures the eight runtime skill folders exist under `container/skills/` with the expected flat file layout. +disable-model-invocation: true +--- + +# Add Omics Runtime Pack + +This skill verifies that eight strong runtime skills are present under `container/skills/` for common BioClaw analysis tasks. + +## What This Adds + +- `container/skills/scrna-preprocessing-clustering/` +- `container/skills/cell-annotation/` +- `container/skills/chip-seq/` +- `container/skills/atac-seq/` +- `container/skills/differential-expression/` +- `container/skills/proteomics/` +- `container/skills/metagenomics/` +- `container/skills/structural-biology/` + +Each runtime skill must contain only root-level files: + +- `SKILL.md` +- `technical_reference.md` +- `commands_and_thresholds.md` + +## What This Must Not Change + +- Do not modify `src/`, `container/agent-runner/`, `Dockerfile`, or any application code. +- Do not modify `src/container-runner.ts`. +- Do not add Python packages, R packages, or other dependencies. +- Do not add nested `references/` directories under `container/skills//`. + +The contribution is delivered as runtime skill content plus this installer skill, without any source-code changes. + +## Why The Runtime Skills Must Stay Flat + +BioClaw syncs `container/skills//` into `/home/node/.claude/skills//` inside the container, but the sync only copies the first directory level. + +That means: + +- `container/skills//SKILL.md` will sync +- `container/skills//technical_reference.md` will sync +- `container/skills//commands_and_thresholds.md` will sync +- `container/skills//references/...` will **not** sync + +So every installed runtime skill must be flat. + +## Runtime Skill Source Of Truth + +The runtime-ready versions now live directly in: + +```text +container/skills/ +``` + +Treat those directories as the source of truth. Do not recreate alternate copies under `.claude/skills/`. + +## Implementation Steps + +Run all steps directly. Only pause if one of the target runtime skill directories already exists and appears user-modified. + +### 1. Verify Current State + +Check: + +```bash +pwd +ls -la container/skills +for skill in \ + scrna-preprocessing-clustering \ + cell-annotation \ + chip-seq \ + atac-seq \ + differential-expression \ + proteomics \ + metagenomics \ + structural-biology +do + test -e "container/skills/$skill" && echo "$skill already exists" || echo "$skill missing" +done +``` + +If any target directory already exists, inspect it before changing anything. + +### 2. Create Or Update The Eight Runtime Skill Directories + +For each skill listed below, ensure `container/skills//` exists and contains exactly the three required root-level files. + +| Runtime skill | Required files | +|---|---| +| `scrna-preprocessing-clustering` | `SKILL.md`, `technical_reference.md`, `commands_and_thresholds.md` | +| `cell-annotation` | `SKILL.md`, `technical_reference.md`, `commands_and_thresholds.md` | +| `chip-seq` | `SKILL.md`, `technical_reference.md`, `commands_and_thresholds.md` | +| `atac-seq` | `SKILL.md`, `technical_reference.md`, `commands_and_thresholds.md` | +| `differential-expression` | `SKILL.md`, `technical_reference.md`, `commands_and_thresholds.md` | +| `proteomics` | `SKILL.md`, `technical_reference.md`, `commands_and_thresholds.md` | +| `metagenomics` | `SKILL.md`, `technical_reference.md`, `commands_and_thresholds.md` | +| `structural-biology` | `SKILL.md`, `technical_reference.md`, `commands_and_thresholds.md` | + +Do not invent alternate content unless the committed runtime files clearly conflict with the current repository state. + +### 3. Preserve The Runtime-Ready Shape + +For every installed runtime skill: + +- keep only the three root-level files above +- do not create `README.md` +- do not create nested `references/` +- keep the relative links in `SKILL.md` pointing to `technical_reference.md` and `commands_and_thresholds.md` + +### 4. Validate The Installed Pack + +Run these checks: + +```bash +find container/skills -maxdepth 2 -type f | sort +find container/skills -maxdepth 3 -type d -name references +``` + +The second command should produce no output for the eight installed skills. + +Also confirm that no duplicate copy of the runtime pack remains under `.claude/skills/add-omics-runtime-pack/`. + +Also confirm no external-path residue remains: + +```bash +grep -RniE "/Users/|omics-skills-repo-template|bioSkills-main|OpenClaw-Medical-Skills-main|claude-scientific-skills-main" \ + container/skills/scrna-preprocessing-clustering \ + container/skills/cell-annotation \ + container/skills/chip-seq \ + container/skills/atac-seq \ + container/skills/differential-expression \ + container/skills/proteomics \ + container/skills/metagenomics \ + container/skills/structural-biology +``` + +That search should return no matches. + +### 5. Report The Result + +Summarize: + +- which runtime skill directories were created +- whether all eight contain exactly the three required files +- whether any pre-existing directories needed conflict resolution +- that no source files were modified diff --git a/README.md b/README.md index 9a161d2..ba9f927 100644 --- a/README.md +++ b/README.md @@ -17,6 +17,9 @@ Built on the [NanoClaw](https://github.com/qwibitai/nanoclaw) architecture with bioinformatics tools and skills from the [STELLA](https://github.com/zaixizhang/STELLA) project, powered by the [Claude Agent SDK](https://docs.anthropic.com/en/docs/agents-sdk). + +New BioClaw-compatible skills are first contributed to Bioclaw_Skills_Hub, where they can be iterated and tested before being promoted into the main BioClaw repository. If you want to contribute new skills, please submit them there first. Skills that prove useful and stable in practice may later be integrated into BioClaw itself. To get newly promoted skills and updates from BioClaw, pull the latest version of this repository with git pull. + ## Join WeChat Group diff --git a/README.zh-CN.md b/README.zh-CN.md index 45531bf..db91b34 100644 --- a/README.zh-CN.md +++ b/README.zh-CN.md @@ -12,6 +12,10 @@ [![Paper](https://img.shields.io/badge/bioRxiv-STELLA-b31b1b.svg)](https://www.biorxiv.org/content/10.1101/2025.07.01.662467v2) [![arXiv](https://img.shields.io/badge/arXiv-2507.02004-b31b1b.svg)](https://arxiv.org/abs/2507.02004) +

+ 新的 BioClaw 兼容 skills 会先提交到 Bioclaw_Skills_Hub 仓库,在那里先完成迭代和测试,再视效果同步到主 BioClaw 仓库中。如果你想贡献新的 skills,请优先提交到该仓库。经过验证、效果稳定的 skills,后续会逐步整合进 BioClaw。若想获取这些后续合入的 skills 和更新内容,请在本仓库中执行 git pull。 +

+ --- diff --git a/container/skills/atac-seq/SKILL.md b/container/skills/atac-seq/SKILL.md new file mode 100644 index 0000000..0e5d0ab --- /dev/null +++ b/container/skills/atac-seq/SKILL.md @@ -0,0 +1,164 @@ +--- +name: atac-seq +description: ATAC-seq processing with assay QC, MACS3 peak calling, consensus peak matrices, differential accessibility, and motif or footprint follow-up. +tool_type: mixed +primary_tool: MACS3 +--- + +# ATAC Seq + +## Version Compatibility + +Reference examples assume: + +- `macs3` 3.0+ +- `samtools` 1.18+ +- `deepTools` 3.5+ + +Verify the runtime first: + +- CLI: `macs3 --version`, `samtools --version`, `bamCoverage --version` + +## Overview + +Use this skill when the user needs: + +- bulk ATAC-seq QC +- peak calling +- accessibility counting +- differential accessibility +- motif deviation or footprint follow-up + +## When To Use This Skill + +- the task is bulk ATAC-seq rather than ChIP-seq +- TSS enrichment, fragment periodicity, or FRiP need review +- the output should include peaks, counts, and downstream accessibility summaries + +## Quick Route + +- paired-end bulk ATAC: use `BAMPE` +- call peaks without control using ATAC-specific settings +- if TSS enrichment is poor, stop and flag data quality before interpretation + +## Progressive Disclosure + +- Read [technical_reference.md](technical_reference.md) for QC gates and assay-specific caveats. +- Read [commands_and_thresholds.md](commands_and_thresholds.md) for peak-calling commands, thresholds, and output conventions. + +## Prerequisites + +| Check | Guidance | +|---|---:| +| uniquely mapped reads | `>= 20M` preferred for strong bulk ATAC | +| TSS enrichment | `> 7` acceptable, `> 10` strong | +| FRiP | `> 0.2` often strong for good bulk ATAC | + +## Expected Inputs + +- paired-end ATAC BAM or FASTQ +- reference genome +- sample groups for comparisons + +## Expected Outputs + +- `results/peaks/sample_peaks.narrowPeak` +- `results/matrix/consensus_peak_counts.tsv` +- `results/diff_accessibility.tsv` +- `figures/tss_enrichment.pdf` +- `figures/fragment_size_distribution.pdf` + +## Starter Pattern + +```bash +macs3 callpeak \ + -t atac.bam \ + -f BAMPE \ + -g hs \ + -n sample \ + --nomodel \ + --shift -100 \ + --extsize 200 \ + -q 0.01 \ + --outdir results/peaks +``` + +## Key Parameters + +| Parameter | Typical value | Notes | +|---|---|---| +| `-f` | `BAMPE` | paired-end ATAC should use fragment-aware mode | +| `--nomodel` | on | standard for ATAC | +| `--shift` | `-100` | common Tn5 offset convention | +| `--extsize` | `200` | common first-pass extension | +| `-q` | `0.01` | starting FDR threshold | + +## Workflow + +### 1. Validate assay QC + +Review: + +- TSS enrichment +- fragment size periodicity +- duplication +- mapped read depth + +### 2. Call peaks with ATAC-specific settings + +Use fragment-aware paired-end mode and Tn5-aware shifting or equivalent settings. + +### 3. Build a consensus peak matrix + +Merge peaks across samples, count fragments into consensus intervals, then produce a peak-by-sample matrix. + +### 4. Test differential accessibility + +Use replicate-aware statistics and report both effect size and adjusted significance. + +### 5. Run motif or footprint follow-up + +Only after peak quality and read depth support it. + +## Output Artifacts + +```text +results/ +├── peaks/ +│ ├── sample_peaks.narrowPeak +│ └── sample_summits.bed +├── matrix/ +│ └── consensus_peak_counts.tsv +└── diff_accessibility.tsv +qc/ +├── tss_enrichment.tsv +└── fragment_metrics.tsv +figures/ +├── tss_enrichment.pdf +└── fragment_size_distribution.pdf +``` + +## Quality Review + +- TSS enrichment below `7` should trigger caution. +- Strong nucleosome periodicity supports a good bulk ATAC library. +- FRiP below `0.1` is usually weak and needs scrutiny. +- Footprinting should not be trusted on low-depth or poor-quality libraries. + +## Anti-Patterns + +- using generic ChIP peak-calling defaults for ATAC +- running footprinting on weak libraries +- skipping TSS enrichment review +- merging peaks from mixed reference builds + +## Related Skills + +- ChIP Seq +- Gene Regulatory Networks +- Multiome And scATAC + +## Optional Supplements + +- `deeptools` +- `pysam` diff --git a/container/skills/atac-seq/commands_and_thresholds.md b/container/skills/atac-seq/commands_and_thresholds.md new file mode 100644 index 0000000..0c82623 --- /dev/null +++ b/container/skills/atac-seq/commands_and_thresholds.md @@ -0,0 +1,35 @@ +# ATAC Seq Commands And Thresholds + +## Peak Calling + +```bash +macs3 callpeak \ + -t atac.bam \ + -f BAMPE \ + -g hs \ + -n sample \ + --nomodel \ + --shift -100 \ + --extsize 200 \ + -q 0.01 \ + --outdir results/peaks +``` + +## Suggested QC Gates + +- uniquely mapped reads: `>= 20M` +- TSS enrichment: `> 7` +- strong TSS enrichment: `> 10` +- FRiP: `> 0.2` + +## Output Convention + +```text +results/ +├── peaks/sample_peaks.narrowPeak +├── matrix/consensus_peak_counts.tsv +└── diff_accessibility.tsv +qc/ +├── tss_enrichment.tsv +└── fragment_metrics.tsv +``` diff --git a/container/skills/atac-seq/technical_reference.md b/container/skills/atac-seq/technical_reference.md new file mode 100644 index 0000000..f88b5c1 --- /dev/null +++ b/container/skills/atac-seq/technical_reference.md @@ -0,0 +1,39 @@ +# ATAC Seq Technical Reference + +## QC Interpretation + +### TSS Enrichment + +| TSS enrichment | Interpretation | +|---|---| +| `< 5` | poor | +| `5-7` | weak | +| `> 7` | acceptable | +| `> 10` | strong | + +### FRiP + +| FRiP | Interpretation | +|---|---| +| `< 0.1` | weak | +| `0.1-0.2` | usable | +| `> 0.2` | strong | + +## Why ATAC Uses Different Peak Settings + +ATAC libraries often use: + +- `--nomodel` +- `--shift -100` +- `--extsize 200` + +because transposition footprints differ from ChIP fragment modeling assumptions. + +## Failure Modes + +- weak TSS enrichment and no periodicity + - likely poor library quality +- many peaks but poor FRiP + - likely noisy open chromatin signal +- footprinting requested on shallow data + - report that confidence is low before proceeding diff --git a/container/skills/cell-annotation/SKILL.md b/container/skills/cell-annotation/SKILL.md new file mode 100644 index 0000000..5470a8a --- /dev/null +++ b/container/skills/cell-annotation/SKILL.md @@ -0,0 +1,154 @@ +--- +name: cell-annotation +description: Automated and marker-guided single-cell cell type annotation using CellTypist, marker review, reference transfer, and confidence-aware label curation. +tool_type: python +primary_tool: CellTypist +--- + +# Cell Annotation + +## Version Compatibility + +Reference examples assume: + +- `scanpy` 1.10+ +- `celltypist` 1.6+ +- `pandas` 2.2+ + +Before using code patterns, verify installed versions match the environment: + +- Python: `python -c "import scanpy, celltypist; print(scanpy.__version__, celltypist.__version__)"` +- If APIs differ, inspect the installed docs and adapt the pattern instead of retrying unchanged. + +## Overview + +Use this skill when the user wants cluster labels or per-cell labels for scRNA-seq. The default stance is: + +1. inspect markers first +2. run reference-based annotation +3. keep uncertainty explicit +4. export both raw predicted labels and a curated final label column + +## When To Use This Skill + +- clusters already exist and need biological labels +- the dataset has a relevant reference atlas or known marker panels +- the user wants CellTypist or similar automated annotation + +## Quick Route + +- If clusters are unstable or clearly QC-driven, fix preprocessing before annotation. +- If the atlas mismatch is severe, prefer broad lineage labels over overconfident fine labels. +- If multiple methods disagree, mark labels as uncertain instead of forcing a consensus. + +## Progressive Disclosure + +- Read [technical_reference.md](technical_reference.md) for strategy selection, confidence interpretation, and disagreement handling. +- Read [commands_and_thresholds.md](commands_and_thresholds.md) for concrete CellTypist code, score thresholds, and output columns. + +## Default Rules + +- Never accept automated labels without checking marker expression. +- Keep per-cell predictions and cluster-level curated labels separate. +- Use `Unknown`, `Uncertain`, or `Ambiguous` when evidence is weak. +- Document the reference model or atlas used. + +## Expected Inputs + +- processed `h5ad` with clusters and embeddings +- marker gene lists or known lineage markers +- optional reference atlas or model + +## Expected Outputs + +- `results/annotated.h5ad` +- `results/cell_labels.tsv` +- `results/cluster_annotation_summary.tsv` +- `figures/umap_cell_types.pdf` +- `figures/marker_dotplot.pdf` + +## Preferred Tools + +- `scanpy` +- `celltypist` +- `pandas` +- `matplotlib` + +## Starter Pattern + +```python +import scanpy as sc +import celltypist + +adata = sc.read_h5ad("results/processed.h5ad") +pred = celltypist.annotate(adata, model="Immune_All_Low.pkl", majority_voting=True) +adata = pred.to_adata() +adata.obs["cell_type_raw"] = adata.obs["majority_voting"] +adata.obs["cell_type_confidence"] = adata.obs["conf_score"] +adata.write("results/annotated.h5ad") +``` + +## Workflow + +### 1. Inspect markers before automation + +Check canonical lineage markers on UMAP, dotplots, or heatmaps. If clusters do not support a plausible biological separation, do not lock in labels yet. + +### 2. Choose the annotation level + +- broad lineage labels when the reference is imperfect +- fine-grained labels only when markers and reference agree +- cluster-level labels for noisy or sparse datasets + +### 3. Run reference-based annotation + +Use CellTypist or another compatible reference transfer method. Store: + +- raw label +- confidence score +- model name + +### 4. Curate with markers and cluster context + +Review top markers per cluster and compare them against predicted labels. Rename or collapse labels if fine categories are not robust. + +### 5. Export both raw and final labels + +At minimum, keep: + +- `cell_type_raw` +- `cell_type_confidence` +- `cell_type_final` + +## Output Artifacts + +- `results/annotated.h5ad` +- `results/cell_labels.tsv` +- `results/cluster_annotation_summary.tsv` +- `figures/umap_cell_types.pdf` +- `figures/marker_dotplot.pdf` + +## Quality Review + +- `CellTypist conf_score > 0.5` is usually comfortable for a provisional label. +- `0.2-0.5` should be manually reviewed against markers. +- `< 0.2` should usually remain `Unknown` or `Uncertain` unless markers are compelling. +- Every final label should have either marker support, reference support, or both. + +## Anti-Patterns + +- assigning fine-grained labels only because the model returned them +- overwriting raw labels so the original prediction is lost +- treating low-confidence single-cell labels as publication-ready without review +- hiding disagreements between marker evidence and reference transfer + +## Related Skills + +- scRNA Preprocessing And Clustering +- Cell Communication +- Trajectory And Lineage + +## Optional Supplements + +- `scanpy` +- `scvi-tools` diff --git a/container/skills/cell-annotation/commands_and_thresholds.md b/container/skills/cell-annotation/commands_and_thresholds.md new file mode 100644 index 0000000..d67a167 --- /dev/null +++ b/container/skills/cell-annotation/commands_and_thresholds.md @@ -0,0 +1,50 @@ +# Cell Annotation Commands And Thresholds + +## CellTypist Example + +```python +import scanpy as sc +import celltypist + +adata = sc.read_h5ad("results/processed.h5ad") +model = "Immune_All_Low.pkl" +pred = celltypist.annotate(adata, model=model, majority_voting=True) +adata = pred.to_adata() + +adata.obs["cell_type_raw"] = adata.obs["majority_voting"] +adata.obs["cell_type_confidence"] = adata.obs["conf_score"] + +adata.obs["cell_type_final"] = adata.obs["cell_type_raw"] +adata.obs.loc[adata.obs["cell_type_confidence"] < 0.2, "cell_type_final"] = "Unknown" + +adata.write("results/annotated.h5ad") +adata.obs[["cell_type_raw", "cell_type_confidence", "cell_type_final"]].to_csv( + "results/cell_labels.tsv", + sep="\t", +) +``` + +## Plot Set + +```python +sc.pl.umap(adata, color=["leiden_r05", "cell_type_final", "cell_type_confidence"], save="_cell_types.pdf") +sc.pl.dotplot(adata, marker_dict, groupby="cell_type_final", save="_marker_dotplot.pdf") +``` + +## Threshold Defaults + +- high-confidence provisional label: `conf_score > 0.5` +- manual review zone: `0.2-0.5` +- default unknown zone: `< 0.2` + +## Output Convention + +```text +results/ +├── annotated.h5ad +├── cell_labels.tsv +└── cluster_annotation_summary.tsv +figures/ +├── umap_cell_types.pdf +└── marker_dotplot.pdf +``` diff --git a/container/skills/cell-annotation/technical_reference.md b/container/skills/cell-annotation/technical_reference.md new file mode 100644 index 0000000..3600e0e --- /dev/null +++ b/container/skills/cell-annotation/technical_reference.md @@ -0,0 +1,54 @@ +# Cell Annotation Technical Reference + +## Strategy Selection + +### Use broad labels when + +- the atlas is from a different tissue or disease context +- cluster markers support lineage identity but not subtype identity +- confidence scores are modest and fine categories disagree across methods + +### Use fine labels when + +- cluster markers support the proposed subtype +- the reference atlas is closely matched +- the confidence pattern is stable across neighboring cells + +## Confidence Interpretation + +For `celltypist`: + +| Confidence | Suggested action | +|---|---| +| `> 0.5` | keep as provisional label | +| `0.2-0.5` | review with markers and cluster context | +| `< 0.2` | usually mark `Unknown` or `Uncertain` | + +These are heuristics, not universal rules. + +## Minimum Marker Review + +For every major cluster, verify at least one positive marker and one exclusion marker when possible. + +Examples: + +- T cell: `CD3D`, `IL7R` +- B cell: `MS4A1`, `CD79A` +- Monocyte: `LYZ`, `S100A8`, `FCGR3A` + +## Recommended Output Columns + +- `cell_type_raw` +- `cell_type_confidence` +- `cell_type_final` +- `annotation_notes` +- `annotation_model` + +## Failure Modes + +- label confidence is high but markers disagree + - likely atlas mismatch or cluster artifact +- every cluster gets a different fine immune subtype + - clustering may be too fine +- one cluster maps to multiple unrelated labels + - keep the cluster uncertain and revisit preprocessing diff --git a/container/skills/chip-seq/SKILL.md b/container/skills/chip-seq/SKILL.md new file mode 100644 index 0000000..5f440dc --- /dev/null +++ b/container/skills/chip-seq/SKILL.md @@ -0,0 +1,170 @@ +--- +name: chip-seq +description: ChIP-seq peak calling and downstream interpretation with MACS3, signal track export, annotation, motif analysis, and differential binding review. +tool_type: mixed +primary_tool: MACS3 +--- + +# ChIP Seq + +## Version Compatibility + +Reference examples assume: + +- `macs3` 3.0+ +- `samtools` 1.18+ +- `deepTools` 3.5+ + +Before using commands, verify the installed environment: + +- CLI: `macs3 --version`, `samtools --version`, `bamCoverage --version` +- If flags differ, inspect `--help` and adapt rather than forcing the example unchanged. + +## Overview + +Use this skill for: + +- narrow or broad peak calling +- input-normalized signal tracks +- peak annotation +- motif follow-up +- differential binding review when replicates exist + +## When To Use This Skill + +- the user has aligned ChIP and optional input BAM files +- the deliverable includes peaks, browser tracks, or motif results +- the assay is TF ChIP or histone-mark ChIP and needs standard peak-centric processing + +## Quick Route + +- TF or narrow marks: use narrow peak mode first. +- H3K27me3, H3K36me3, or other broad marks: use `--broad`. +- Paired-end BAM: prefer `-f BAMPE`. +- No input control: still possible, but report the limitation explicitly. + +## Progressive Disclosure + +- Read [technical_reference.md](technical_reference.md) for QC gates, narrow-versus-broad logic, and replicate handling. +- Read [commands_and_thresholds.md](commands_and_thresholds.md) for MACS3 commands, parameter defaults, and output file conventions. + +## Prerequisites + +| Requirement | Narrow TF-style | Broad histone-style | +|---|---:|---:| +| usable uniquely mapped reads | `>= 10M` | `>= 20M` | +| matched input recommended | yes | yes | +| biological replicates recommended | `>= 2` | `>= 2` | + +## Expected Inputs + +- `chip.bam` +- `input.bam` when available +- reference genome build +- chromosome sizes if bigWig export is needed + +## Expected Outputs + +- `results/peaks/sample_peaks.narrowPeak` or `.broadPeak` +- `results/peaks/sample_summits.bed` +- `results/tracks/sample_treat_pileup.bw` +- `results/annotation/peak_annotation.tsv` +- `qc/chip_qc_summary.tsv` + +## Starter Pattern + +```bash +macs3 callpeak \ + -t chip.bam \ + -c input.bam \ + -f BAMPE \ + -g hs \ + -n sample \ + -q 0.01 \ + --outdir results/peaks +``` + +## Key Parameters + +| Parameter | Typical value | Meaning | +|---|---|---| +| `-f` | `BAM` or `BAMPE` | paired-end should use `BAMPE` | +| `-g` | `hs`, `mm`, or numeric | effective genome size | +| `-q` | `0.01` or `0.05` | FDR cutoff for narrow peaks | +| `--broad` | broad marks only | broad peak mode | +| `--broad-cutoff` | `0.1` | broad-peak FDR cutoff | +| `-B --SPMR` | enabled for tracks | bedGraph for normalized signal | + +## Workflow + +### 1. Validate BAMs and replicate structure + +Check: + +- mapped read counts +- duplicate burden +- whether input control exists +- whether the mark is narrow or broad + +### 2. Call peaks with MACS3 + +- narrow marks: `-q 0.01` is a good starting point +- broad marks: use `--broad --broad-cutoff 0.1` +- paired-end: `-f BAMPE` + +### 3. Export signal tracks + +Use `-B --SPMR`, sort the resulting bedGraph, then convert to bigWig for browser use. + +### 4. Annotate and inspect peaks + +Map peaks to promoters, gene bodies, or distal intervals and review top loci in a genome browser or track plot. + +### 5. Run motif or differential follow-up + +Only after peak quality looks credible and replicate structure supports the downstream question. + +## Output Artifacts + +```text +results/ +├── peaks/ +│ ├── sample_peaks.narrowPeak +│ ├── sample_summits.bed +│ └── sample_model.r +├── tracks/ +│ ├── sample_treat_pileup.bdg +│ └── sample_treat_pileup.bw +└── annotation/ + └── peak_annotation.tsv +qc/ +└── chip_qc_summary.tsv +``` + +## Quality Review + +- TF ChIP-seq FRiP: + - `< 0.01` poor + - `0.01-0.05` usable but weak + - `> 0.05` generally solid +- Histone broad-mark FRiP often differs; compare within assay type rather than against TF expectations. +- Use replicate concordance when available. Do not trust a single noisy replicate just because peaks were called. +- Check that top peaks occur in plausible loci and not only blacklisted or artifactual regions. + +## Anti-Patterns + +- treating broad and narrow marks with the same peak-calling setup +- calling peaks on unsorted or low-quality BAMs +- presenting motif hits without showing peak quality +- hiding that no input control was available + +## Related Skills + +- ATAC Seq +- Methylation Analysis +- Gene Regulatory Networks + +## Optional Supplements + +- `deeptools` +- `pysam` diff --git a/container/skills/chip-seq/commands_and_thresholds.md b/container/skills/chip-seq/commands_and_thresholds.md new file mode 100644 index 0000000..b31b017 --- /dev/null +++ b/container/skills/chip-seq/commands_and_thresholds.md @@ -0,0 +1,67 @@ +# ChIP Seq Commands And Thresholds + +## Narrow Peak Calling + +```bash +macs3 callpeak \ + -t chip.bam \ + -c input.bam \ + -f BAMPE \ + -g hs \ + -n sample \ + -q 0.01 \ + --outdir results/peaks +``` + +## Broad Peak Calling + +```bash +macs3 callpeak \ + -t chip.bam \ + -c input.bam \ + -f BAMPE \ + -g hs \ + -n sample_broad \ + --broad \ + --broad-cutoff 0.1 \ + --outdir results/peaks +``` + +## Signal Track Export + +```bash +macs3 callpeak \ + -t chip.bam \ + -c input.bam \ + -f BAMPE \ + -g hs \ + -n sample \ + -B --SPMR \ + --outdir results/peaks + +sort -k1,1 -k2,2n results/peaks/sample_treat_pileup.bdg > results/tracks/sample.sorted.bdg +bedGraphToBigWig results/tracks/sample.sorted.bdg chrom.sizes results/tracks/sample_treat_pileup.bw +``` + +## Default Thresholds + +- narrow peak q-value: `0.01` +- fallback narrow peak q-value: `0.05` +- broad peak cutoff: `0.1` +- recommended unique mapped reads: + - narrow: `>= 10M` + - broad: `>= 20M` + +## Output Convention + +```text +results/ +├── peaks/ +│ ├── sample_peaks.narrowPeak or sample_broad_peaks.broadPeak +│ ├── sample_summits.bed +│ └── sample_model.r +├── tracks/ +│ └── sample_treat_pileup.bw +qc/ +└── chip_qc_summary.tsv +``` diff --git a/container/skills/chip-seq/technical_reference.md b/container/skills/chip-seq/technical_reference.md new file mode 100644 index 0000000..053bceb --- /dev/null +++ b/container/skills/chip-seq/technical_reference.md @@ -0,0 +1,54 @@ +# ChIP Seq Technical Reference + +## Narrow Versus Broad Marks + +### Narrow marks + +Examples: + +- TF ChIP +- H3K4me3 +- H3K27ac + +Use standard narrow peak mode and start with: + +- `-q 0.01` +- `-f BAMPE` for paired-end data + +### Broad marks + +Examples: + +- H3K27me3 +- H3K36me3 +- H3K9me3 + +Use: + +- `--broad` +- `--broad-cutoff 0.1` + +## QC Priorities + +- mapped unique reads +- duplication burden +- FRiP +- replicate concordance +- enrichment at biologically plausible loci + +## Practical Thresholds + +| Metric | Narrow mark guidance | +|---|---:| +| uniquely mapped reads | `>= 10M` | +| FRiP | `> 0.01` minimum, `> 0.05` better | +| replicates | `>= 2` preferred | + +For broad marks, read depth generally needs to be higher and FRiP values are not directly comparable to TF ChIP. + +## When To Escalate Review + +- very few peaks despite high read depth +- extremely many peaks in nearly every region +- strong enrichment only in blacklist-like regions +- motif or annotation results that do not fit the biology at all diff --git a/container/skills/differential-expression/SKILL.md b/container/skills/differential-expression/SKILL.md new file mode 100644 index 0000000..9124dc8 --- /dev/null +++ b/container/skills/differential-expression/SKILL.md @@ -0,0 +1,164 @@ +--- +name: differential-expression +description: Bulk transcriptomics differential expression with count-aware modeling, design validation, contrast handling, thresholded exports, and publication-ready DE figures. +tool_type: python +primary_tool: PyDESeq2 +--- + +# Differential Expression + +## Version Compatibility + +Reference examples assume: + +- `pydeseq2` 0.4+ +- `pandas` 2.2+ +- `numpy` 1.26+ +- `matplotlib` 3.8+ + +Verify before use: + +- Python: `python -c "import pydeseq2, pandas; print(pydeseq2.__version__, pandas.__version__)"` + +## Overview + +Use this skill for count-based DE from bulk RNA-seq or similar count matrices when the user needs: + +- robust model fitting +- explicit contrasts +- ranked gene tables +- volcano and MA plots +- pathway-ready output tables + +## When To Use This Skill + +- raw count matrix and sample metadata are available +- the task is condition, treatment, or genotype comparison +- batch or pairing terms may need explicit modeling + +## Quick Route + +- no replicates: do not pretend formal DE is robust +- 2 replicates per group: possible but conservative interpretation +- 3 or more replicates per group: standard starting point + +## Progressive Disclosure + +- Read [technical_reference.md](technical_reference.md) for design formulas, confounding checks, and contrast logic. +- Read [commands_and_thresholds.md](commands_and_thresholds.md) for PyDESeq2 code, recommended filters, and output file conventions. + +## Prerequisites + +| Requirement | Recommendation | +|---|---:| +| minimum replicates per group | `>= 2` | +| preferred replicates per group | `>= 3` | +| input values | raw integer counts | + +## Expected Inputs + +- raw count matrix +- sample metadata +- explicit contrast such as treated vs control + +## Expected Outputs + +- `results/de_results.tsv` +- `results/de_ranked_genes.tsv` +- `figures/volcano.pdf` +- `figures/ma_plot.pdf` +- `qc/sample_pca.pdf` + +## Starter Pattern + +```python +from pydeseq2.dds import DeseqDataSet +from pydeseq2.ds import DeseqStats + +dds = DeseqDataSet( + counts=counts_df, + metadata=metadata_df, + design_factors=["condition", "batch"], +) +dds.deseq2() +stats = DeseqStats(dds, contrast=("condition", "treated", "control")) +stats.summary() +res = stats.results_df.sort_values("padj") +res.to_csv("results/de_results.tsv", sep="\t") +``` + +## Workflow + +### 1. Validate the design + +Check: + +- replicate counts +- factor levels +- batch balance +- paired structure +- confounded variables + +### 2. Fit a count-aware model + +Use raw counts, not TPM or log-normalized expression, for count-based DE frameworks. + +### 3. Apply explicit filtering and ranking + +Common reporting thresholds: + +- `padj < 0.05` +- `abs(log2FoldChange) >= 1` + +Export both the full table and a thresholded table. + +### 4. Visualize results + +At minimum: + +- sample PCA +- volcano plot +- MA plot + +### 5. Export pathway-ready artifacts + +Produce a ranked gene list sorted by signed effect or Wald statistic for enrichment workflows. + +## Output Artifacts + +```text +results/ +├── de_results.tsv +├── de_significant.tsv +└── de_ranked_genes.tsv +figures/ +├── sample_pca.pdf +├── volcano.pdf +└── ma_plot.pdf +qc/ +└── design_check.tsv +``` + +## Quality Review + +- raw counts only for model fitting +- no fully confounded batch and condition +- outlier samples reviewed before publication claims +- all final tables should include `baseMean`, `log2FoldChange`, `pvalue`, and `padj` + +## Anti-Patterns + +- running DE on TPM as if it were count-based +- omitting batch or pairing terms that clearly exist +- showing only thresholded genes and hiding the full table +- using p-value alone without effect size + +## Related Skills + +- Bulk RNA Expression +- RNA Quantification +- Pathway Analysis + +## Optional Supplements + +- `pydeseq2` diff --git a/container/skills/differential-expression/commands_and_thresholds.md b/container/skills/differential-expression/commands_and_thresholds.md new file mode 100644 index 0000000..39c7f50 --- /dev/null +++ b/container/skills/differential-expression/commands_and_thresholds.md @@ -0,0 +1,44 @@ +# Differential Expression Commands And Thresholds + +## PyDESeq2 Example + +```python +from pydeseq2.dds import DeseqDataSet +from pydeseq2.ds import DeseqStats + +dds = DeseqDataSet( + counts=counts_df, + metadata=metadata_df, + design_factors=["condition", "batch"], +) +dds.deseq2() + +stats = DeseqStats(dds, contrast=("condition", "treated", "control")) +stats.summary() +res = stats.results_df + +res.to_csv("results/de_results.tsv", sep="\t") +sig = res[(res["padj"] < 0.05) & (res["log2FoldChange"].abs() >= 1)] +sig.to_csv("results/de_significant.tsv", sep="\t") +res.sort_values("stat", ascending=False).to_csv("results/de_ranked_genes.tsv", sep="\t") +``` + +## Recommended Thresholds + +- `padj < 0.05` +- `abs(log2FoldChange) >= 1` +- minimum replicates per group: `2` +- preferred replicates per group: `3` + +## Output Convention + +```text +results/ +├── de_results.tsv +├── de_significant.tsv +└── de_ranked_genes.tsv +figures/ +├── sample_pca.pdf +├── volcano.pdf +└── ma_plot.pdf +``` diff --git a/container/skills/differential-expression/technical_reference.md b/container/skills/differential-expression/technical_reference.md new file mode 100644 index 0000000..b29fa22 --- /dev/null +++ b/container/skills/differential-expression/technical_reference.md @@ -0,0 +1,33 @@ +# Differential Expression Technical Reference + +## Design Checks + +Verify before fitting: + +- each contrast group has at least 2 replicates +- batch is not perfectly confounded with condition +- paired samples are encoded explicitly +- factor reference level is correct + +## Default Reporting Thresholds + +| Metric | Common default | +|---|---:| +| adjusted p-value | `< 0.05` | +| absolute log2 fold-change | `>= 1` | + +Do not claim these are universal. Tighten or loosen only with justification. + +## Ranked Gene Export + +For enrichment, export either: + +- all genes ranked by Wald statistic +- all genes ranked by signed log2 fold-change with significance columns retained + +## Failure Modes + +- no significant genes and PCA shows weak separation + - likely biology is subtle or the design is underpowered +- many significant genes but batch drives PCA + - model specification or batch handling needs review diff --git a/container/skills/metagenomics/SKILL.md b/container/skills/metagenomics/SKILL.md new file mode 100644 index 0000000..66e737b --- /dev/null +++ b/container/skills/metagenomics/SKILL.md @@ -0,0 +1,145 @@ +--- +name: metagenomics +description: Shotgun metagenomics workflow with host-depletion-aware QC, taxonomic profiling, functional profiling, AMR follow-up, and reproducible community output tables. +tool_type: mixed +primary_tool: Kraken2 +--- + +# Metagenomics + +## Version Compatibility + +Reference examples assume: + +- `fastp` 0.23+ +- `kraken2` 2.1+ +- `bracken` 2.8+ +- `metaphlan` 4+ +- `humann` 3.9+ + +Verify the environment first: + +- CLI: `kraken2 --version`, `bracken -v`, `metaphlan --version`, `humann --version` + +## Overview + +Use this skill for shotgun metagenomics when the user needs: + +- QC and host depletion review +- taxonomic abundance tables +- functional pathway profiles +- AMR or strain-level follow-up + +## When To Use This Skill + +- the data are shotgun metagenomics rather than amplicon sequencing +- the user wants species or genus abundances, function, or resistance summaries +- multiple samples need cohort-level comparison + +## Quick Route + +- host-associated samples: perform host depletion before interpretation +- taxonomy only: `kraken2 + bracken` is a common pragmatic route +- function only or plus taxonomy: add `humann` +- strain claims require more evidence than top-level taxonomy calls + +## Progressive Disclosure + +- Read [technical_reference.md](technical_reference.md) for database choice, host contamination review, and functional profiling caveats. +- Read [commands_and_thresholds.md](commands_and_thresholds.md) for command-line patterns, thresholds, and output layout. + +## Expected Inputs + +- paired or single-end metagenomic FASTQ +- sample metadata +- taxonomy and optional function databases + +## Expected Outputs + +- `results/taxonomy/bracken_species.tsv` +- `results/taxonomy/bracken_genus.tsv` +- `results/function/pathabundance.tsv` +- `results/amr/amr_summary.tsv` +- `qc/read_processing_summary.tsv` + +## Starter Pattern + +```bash +fastp \ + -i sample_R1.fastq.gz \ + -I sample_R2.fastq.gz \ + -o qc/sample.clean.R1.fastq.gz \ + -O qc/sample.clean.R2.fastq.gz \ + --html qc/sample.fastp.html \ + --json qc/sample.fastp.json + +kraken2 \ + --db $KRAKEN_DB \ + --paired qc/sample.clean.R1.fastq.gz qc/sample.clean.R2.fastq.gz \ + --report results/taxonomy/sample.kraken.report \ + --output results/taxonomy/sample.kraken.out \ + --confidence 0.1 +``` + +## Workflow + +### 1. Run read QC and optional host depletion + +At minimum, inspect read quality, adapter content, and retained reads. For host-associated samples, remove host reads before community interpretation. + +### 2. Profile taxonomy + +Use a k-mer or marker-based profiler. Document the database and version because abundance results depend strongly on the reference. + +### 3. Refine abundance tables + +Convert raw classification to species or genus abundance tables suitable for cohort comparison. + +### 4. Add function or AMR when requested + +Run pathway or AMR profiling only after confirming taxonomic QC and read retention are reasonable. + +### 5. Export cohort-ready outputs + +Save per-sample tables and merged matrices with clear metadata joins. + +## Output Artifacts + +```text +results/ +├── taxonomy/ +│ ├── sample.kraken.report +│ ├── bracken_species.tsv +│ └── bracken_genus.tsv +├── function/ +│ └── pathabundance.tsv +└── amr/ + └── amr_summary.tsv +qc/ +├── read_processing_summary.tsv +└── sample.fastp.html +``` + +## Quality Review + +- retained reads after QC should be reported explicitly +- host-associated samples with large host contamination need a clear host depletion statement +- avoid over-interpreting taxa with extremely low abundance +- abundance comparisons should state whether values are relative abundance, counts, or normalized function estimates + +## Anti-Patterns + +- comparing outputs from different databases as if they were directly interchangeable +- making strain-level claims from genus-level evidence +- ignoring host contamination in human-associated or plant-associated samples +- mixing taxonomy-only and pathway outputs without clarifying what each table means + +## Related Skills + +- Microbiome Amplicon +- Pathogen Epidemiological Genomics +- Phylogenetics + +## Optional Supplements + +- `scikit-bio` diff --git a/container/skills/metagenomics/commands_and_thresholds.md b/container/skills/metagenomics/commands_and_thresholds.md new file mode 100644 index 0000000..1b0ee43 --- /dev/null +++ b/container/skills/metagenomics/commands_and_thresholds.md @@ -0,0 +1,51 @@ +# Metagenomics Commands And Thresholds + +## QC And Taxonomy + +```bash +fastp \ + -i sample_R1.fastq.gz \ + -I sample_R2.fastq.gz \ + -o qc/sample.clean.R1.fastq.gz \ + -O qc/sample.clean.R2.fastq.gz \ + --html qc/sample.fastp.html \ + --json qc/sample.fastp.json + +kraken2 \ + --db $KRAKEN_DB \ + --paired qc/sample.clean.R1.fastq.gz qc/sample.clean.R2.fastq.gz \ + --report results/taxonomy/sample.kraken.report \ + --output results/taxonomy/sample.kraken.out \ + --confidence 0.1 +``` + +## Bracken Refinement + +```bash +bracken \ + -d $KRAKEN_DB \ + -i results/taxonomy/sample.kraken.report \ + -o results/taxonomy/sample.bracken.species.tsv \ + -r 150 \ + -l S +``` + +## Common Defaults + +- `kraken2 --confidence 0.1` as a practical first pass +- report retained reads after QC and after host depletion +- do not emphasize taxa with vanishing abundance without a reason + +## Output Convention + +```text +results/ +├── taxonomy/ +│ ├── sample.kraken.report +│ ├── sample.bracken.species.tsv +│ └── bracken_species.tsv +├── function/pathabundance.tsv +└── amr/amr_summary.tsv +qc/ +└── read_processing_summary.tsv +``` diff --git a/container/skills/metagenomics/technical_reference.md b/container/skills/metagenomics/technical_reference.md new file mode 100644 index 0000000..c240d83 --- /dev/null +++ b/container/skills/metagenomics/technical_reference.md @@ -0,0 +1,39 @@ +# Metagenomics Technical Reference + +## Taxonomy Strategy + +### `kraken2 + bracken` + +Use when: + +- you want a practical species or genus abundance table +- speed matters +- a broad classification database is available + +### `MetaPhlAn` + +Use when: + +- marker-based profiling is preferred +- lower false-positive behavior is more important than broad k-mer sensitivity + +## Host Depletion Guidance + +For host-associated samples: + +- remove host reads before abundance interpretation +- report pre- and post-depletion read counts +- if host reads dominate, say so clearly in the final summary + +## Practical Cautions + +- very low-abundance taxa are unstable +- database choice strongly changes taxonomic output +- strain tracking should not be claimed from simple species tables + +## Minimum QC Reporting + +- raw reads +- retained reads after QC +- retained reads after host depletion if used +- database version diff --git a/container/skills/proteomics/SKILL.md b/container/skills/proteomics/SKILL.md new file mode 100644 index 0000000..25c0483 --- /dev/null +++ b/container/skills/proteomics/SKILL.md @@ -0,0 +1,139 @@ +--- +name: proteomics +description: Mass spectrometry proteomics QC, quantification, comparative analysis, and export for DDA, DIA, and protein-level result tables. +tool_type: python +primary_tool: pyopenms +--- + +# Proteomics + +## Version Compatibility + +Reference examples assume: + +- `pyopenms` 3.0+ +- `pandas` 2.2+ +- `numpy` 1.26+ +- `seaborn` 0.13+ + +## Overview + +Use this skill when the user needs: + +- proteomics QC +- protein table cleanup +- replicate review +- differential abundance analysis +- publication-ready proteomics figures + +## When To Use This Skill + +- MaxQuant, FragPipe, DIA-NN, or similar outputs exist +- the task is protein-level quantification or comparative proteomics +- missingness, batch effects, and replicate quality need review before interpretation + +## Quick Route + +- DDA and DIA should not be treated identically +- protein-level tables should remain distinct from peptide-level tables +- QC comes before differential analysis + +## Progressive Disclosure + +- Read [technical_reference.md](technical_reference.md) for assay branching, QC interpretation, and missingness handling. +- Read [commands_and_thresholds.md](commands_and_thresholds.md) for table-loading patterns, QC thresholds, and output conventions. + +## Expected Inputs + +- protein or peptide result table +- sample metadata +- assay context: DDA, DIA, PTM-enriched, or targeted + +## Expected Outputs + +- `results/protein_abundance.tsv` +- `qc/proteomics_qc_summary.tsv` +- `figures/correlation_heatmap.pdf` +- `figures/missingness.pdf` +- `results/differential_proteins.tsv` + +## Starter Pattern + +```python +import pandas as pd + +protein_df = pd.read_csv("protein_groups.tsv", sep="\t") +sample_cols = [c for c in protein_df.columns if c.startswith("LFQ intensity")] +matrix = protein_df[sample_cols].replace(0, pd.NA) +qc = pd.DataFrame({ + "n_proteins": matrix.notna().sum(), + "missing_pct": matrix.isna().mean() * 100, +}) +qc.to_csv("qc/proteomics_qc_summary.tsv", sep="\t") +``` + +## Workflow + +### 1. Clarify assay and table level + +- DDA versus DIA +- peptide versus protein table +- PTM-enriched versus unenriched data + +### 2. Run QC before comparisons + +Inspect: + +- missingness +- replicate correlation +- batch effects +- intensity distributions + +### 3. Normalize and summarize consistently + +Keep the normalization approach explicit and do not collapse peptides into proteins without documenting the rule. + +### 4. Perform comparative analysis + +Use replicate-aware differential abundance with clear filtering and missingness policy. + +### 5. Export interpretable artifacts + +Save both the cleaned abundance matrix and the differential results table. + +## Output Artifacts + +```text +results/ +├── protein_abundance.tsv +└── differential_proteins.tsv +qc/ +└── proteomics_qc_summary.tsv +figures/ +├── correlation_heatmap.pdf +├── missingness.pdf +└── intensity_density.pdf +``` + +## Quality Review + +- overall missingness `> 30%` should trigger caution +- technical replicate correlation should usually be `> 0.9` +- biological replicate correlation much below `0.8` deserves review +- do not trust differential calls before batch structure and missingness are understood + +## Anti-Patterns + +- mixing peptide and protein tables in one downstream matrix +- running differential abundance before QC +- ignoring missingness patterns +- hiding whether values are raw, normalized, or imputed + +## Related Skills + +- Metabolomics +- Structural Biology + +## Optional Supplements + +- `pyopenms` diff --git a/container/skills/proteomics/commands_and_thresholds.md b/container/skills/proteomics/commands_and_thresholds.md new file mode 100644 index 0000000..18f0db1 --- /dev/null +++ b/container/skills/proteomics/commands_and_thresholds.md @@ -0,0 +1,31 @@ +# Proteomics Commands And Thresholds + +## Protein Table Loading + +```python +import pandas as pd + +protein_df = pd.read_csv("protein_groups.tsv", sep="\t") +sample_cols = [c for c in protein_df.columns if c.startswith("LFQ intensity")] +matrix = protein_df[sample_cols].replace(0, pd.NA) +``` + +## QC Defaults + +- overall missingness: `< 30%` +- technical replicate correlation: `> 0.9` +- biological replicate correlation: `> 0.8` + +## Output Convention + +```text +results/ +├── protein_abundance.tsv +└── differential_proteins.tsv +qc/ +└── proteomics_qc_summary.tsv +figures/ +├── correlation_heatmap.pdf +├── missingness.pdf +└── intensity_density.pdf +``` diff --git a/container/skills/proteomics/technical_reference.md b/container/skills/proteomics/technical_reference.md new file mode 100644 index 0000000..c20da9e --- /dev/null +++ b/container/skills/proteomics/technical_reference.md @@ -0,0 +1,35 @@ +# Proteomics Technical Reference + +## Assay Branching + +### DDA + +- identification completeness may be lower +- missingness often needs careful handling + +### DIA + +- often more complete matrices +- still review batch and library effects + +### PTM-enriched + +- do not interpret as global proteome abundance +- keep PTM site-level results separate from protein-level abundance + +## QC Thresholds + +| Metric | Rule of thumb | +|---|---| +| overall missingness | `< 30%` preferred | +| technical replicate correlation | `> 0.9` | +| biological replicate correlation | usually `> 0.8` | + +## Failure Modes + +- one sample has much higher missingness + - likely technical outlier +- strong group separation but also strong batch separation + - revisit normalization and design +- peptide and protein tables disagree strongly + - check aggregation and protein inference assumptions diff --git a/container/skills/scrna-preprocessing-clustering/SKILL.md b/container/skills/scrna-preprocessing-clustering/SKILL.md new file mode 100644 index 0000000..f13bc44 --- /dev/null +++ b/container/skills/scrna-preprocessing-clustering/SKILL.md @@ -0,0 +1,195 @@ +--- +name: scrna-preprocessing-clustering +description: Standard scRNA-seq preprocessing and clustering with Scanpy. Use for QC, normalization, HVG selection, PCA, neighbor graph construction, UMAP, Leiden clustering, and export of an analysis-ready AnnData object. +tool_type: python +primary_tool: scanpy +--- + +# scRNA Preprocessing And Clustering + +## Version Compatibility + +Reference examples assume: + +- `scanpy` 1.10+ +- `anndata` 0.10+ +- `pandas` 2.2+ +- `matplotlib` 3.8+ + +Before using code patterns, verify installed versions match the environment: + +- Python: `python -c "import scanpy, anndata; print(scanpy.__version__, anndata.__version__)"` +- If signatures differ, inspect the installed API and adapt the pattern instead of retrying unchanged. + +## Overview + +Use this skill to turn raw or minimally processed scRNA-seq data into an analysis-ready object with: + +- QC-filtered cells and genes +- normalized expression values +- highly variable genes +- PCA and UMAP embeddings +- Leiden clusters +- saved `h5ad` artifact for annotation, DE, integration, or trajectory analysis + +## When To Use This Skill + +- raw 10x matrices, filtered count matrices, or `h5ad` inputs need standard preprocessing +- the user wants UMAP, clustering, or marker discovery +- downstream tasks depend on a stable single-cell object rather than ad hoc plots + +## Quick Route + +- If the input is already a processed `h5ad`, inspect `adata.raw`, embeddings, cluster columns, and QC columns before rerunning preprocessing. +- If the input is raw counts, do QC first and only normalize after filtering obvious low-quality cells. +- If multiple batches are present, preprocess cleanly first, then consider integration instead of hiding batch effects with aggressive filtering. + +## Progressive Disclosure + +- Read [technical_reference.md](technical_reference.md) for QC decision rules, assay caveats, and integration branching. +- Read [commands_and_thresholds.md](commands_and_thresholds.md) for concrete Scanpy code, default thresholds, and output conventions. + +## Default Rules + +- Keep raw counts recoverable. Prefer `adata.raw = adata.copy()` before regression or scaling. +- Report thresholds explicitly. Do not silently drop cells or genes. +- Show QC distributions before applying hard filters. +- Use vector outputs such as `.pdf` or `.svg` for final figures when possible. + +## Expected Inputs + +- 10x directory, `.h5`, `.h5ad`, or count matrix +- cell metadata if available +- species context for mitochondrial or ribosomal gene detection + +## Expected Outputs + +- `results/processed.h5ad` +- `qc/cell_qc_metrics.tsv` +- `qc/gene_qc_metrics.tsv` +- `figures/qc_violin.pdf` +- `figures/pca_variance_ratio.pdf` +- `figures/umap_leiden.pdf` + +## Preferred Tools + +- `scanpy` +- `anndata` +- `pandas` +- `matplotlib` +- `seaborn` + +## Starter Pattern + +```python +import scanpy as sc + +adata = sc.read_10x_mtx("counts/") +adata.var_names_make_unique() +adata.var["mt"] = adata.var_names.str.upper().str.startswith("MT-") +sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True) + +adata = adata[ + (adata.obs["n_genes_by_counts"] >= 200) + & (adata.obs["n_genes_by_counts"] <= 6000) + & (adata.obs["pct_counts_mt"] < 15), + : +].copy() + +sc.pp.filter_genes(adata, min_cells=3) +sc.pp.normalize_total(adata, target_sum=1e4) +sc.pp.log1p(adata) +adata.raw = adata.copy() + +sc.pp.highly_variable_genes(adata, n_top_genes=3000, flavor="seurat_v3") +adata = adata[:, adata.var["highly_variable"]].copy() +sc.pp.scale(adata, max_value=10) +sc.tl.pca(adata, svd_solver="arpack") +sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30) +sc.tl.umap(adata) +sc.tl.leiden(adata, resolution=0.5, key_added="leiden_r05") +adata.write("results/processed.h5ad") +``` + +## Workflow + +### 1. Load and validate the object + +- confirm orientation is cells by genes +- make gene names unique +- record sample IDs and batch labels before merging or filtering + +### 2. Compute QC metrics and inspect distributions + +- `n_genes_by_counts` +- `total_counts` +- `pct_counts_mt` +- optional ribosomal or hemoglobin fractions + +Plot distributions before filtering. Thresholds vary by chemistry, tissue, and nucleus versus whole-cell assay. + +### 3. Filter cells and genes + +Use dataset-aware thresholds. Good first-pass defaults: + +- `min_genes >= 200` +- `max_genes <= 5000-8000` to remove likely doublets in many droplet datasets +- `pct_counts_mt < 10-20` depending on tissue stress +- `min_cells >= 3` for genes + +### 4. Normalize, log-transform, and select HVGs + +- normalize with `target_sum=1e4` +- `log1p` +- select `2000-4000` HVGs +- save raw counts before heavy transformations + +### 5. Reduce dimensions and cluster + +- PCA on HVGs +- neighbor graph using `10-30` PCs and `10-30` neighbors as a starting range +- UMAP for visualization +- Leiden across a small resolution grid such as `0.2`, `0.5`, `0.8`, `1.0` + +### 6. Export analysis-ready artifacts + +Always save: + +- processed `h5ad` +- QC tables +- cluster assignments +- publication-ready QC and UMAP figures + +## Output Artifacts + +- `results/processed.h5ad`: main reusable AnnData object +- `results/cluster_assignments.tsv`: barcode plus cluster labels +- `qc/filter_summary.tsv`: counts before and after filtering +- `figures/umap_leiden.pdf`: main embedding figure + +## Quality Review + +- Median genes per cell should be plausible for the chemistry and tissue. +- Mitochondrial fraction should not dominate retained cells. +- PCA variance should decay smoothly rather than showing obvious technical axes only. +- UMAP should be reviewed together with QC metrics and batch labels, not alone. +- Cluster labels should not be finalized before marker inspection. + +## Anti-Patterns + +- reprocessing an already integrated object as if it were raw counts +- using a single universal mitochondrial threshold for every tissue +- interpreting UMAP separation as biology before checking batch and QC covariates +- discarding raw counts needed later for DE or pseudobulk + +## Related Skills + +- Cell Annotation +- Cell Communication +- Trajectory And Lineage +- Multiome And scATAC + +## Optional Supplements + +- `anndata` +- `scanpy` diff --git a/container/skills/scrna-preprocessing-clustering/commands_and_thresholds.md b/container/skills/scrna-preprocessing-clustering/commands_and_thresholds.md new file mode 100644 index 0000000..12c48b6 --- /dev/null +++ b/container/skills/scrna-preprocessing-clustering/commands_and_thresholds.md @@ -0,0 +1,69 @@ +# scRNA Preprocessing And Clustering Commands And Thresholds + +## Canonical Scanpy Flow + +```python +import scanpy as sc + +adata = sc.read_h5ad("input.h5ad") +adata.var_names_make_unique() +adata.var["mt"] = adata.var_names.str.upper().str.startswith("MT-") +sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True) + +sc.pl.violin( + adata, + ["n_genes_by_counts", "total_counts", "pct_counts_mt"], + jitter=0.4, + multi_panel=True, + save="_qc_violin.pdf", +) + +adata = adata[ + (adata.obs["n_genes_by_counts"] >= 200) + & (adata.obs["n_genes_by_counts"] <= 6000) + & (adata.obs["pct_counts_mt"] < 15), + : +].copy() + +sc.pp.filter_genes(adata, min_cells=3) +sc.pp.normalize_total(adata, target_sum=1e4) +sc.pp.log1p(adata) +adata.raw = adata.copy() + +sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=3000) +adata = adata[:, adata.var["highly_variable"]].copy() +sc.pp.scale(adata, max_value=10) +sc.tl.pca(adata, svd_solver="arpack") +sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30) +sc.tl.umap(adata) + +for res in [0.2, 0.5, 0.8, 1.0]: + sc.tl.leiden(adata, resolution=res, key_added=f"leiden_{res}") + +adata.write("results/processed.h5ad") +``` + +## Threshold Defaults + +- `min_genes`: `200` +- `max_genes`: `6000` +- `pct_counts_mt`: `15` +- `n_top_genes`: `3000` +- `n_neighbors`: `15` +- `n_pcs`: `30` +- first-pass Leiden resolution: `0.5` + +## Output Convention + +```text +results/ +├── processed.h5ad +├── cluster_assignments.tsv +qc/ +├── filter_summary.tsv +├── cell_qc_metrics.tsv +figures/ +├── qc_violin.pdf +├── pca_variance_ratio.pdf +└── umap_leiden_0.5.pdf +``` diff --git a/container/skills/scrna-preprocessing-clustering/technical_reference.md b/container/skills/scrna-preprocessing-clustering/technical_reference.md new file mode 100644 index 0000000..1d1c266 --- /dev/null +++ b/container/skills/scrna-preprocessing-clustering/technical_reference.md @@ -0,0 +1,79 @@ +# scRNA Preprocessing And Clustering Technical Reference + +## Purpose + +Use this file when the main skill is not enough to choose thresholds or handle nonstandard inputs. + +## Input Branching + +### Raw 10x droplet data + +- Start with raw counts. +- Calculate QC before normalization. +- Use droplet-aware cutoffs and consider ambient RNA correction only if contamination is obvious. + +### Existing `h5ad` + +- Inspect: + - `adata.raw` + - `adata.layers` + - existing embeddings in `adata.obsm` + - cluster columns in `adata.obs` +- Avoid rerunning normalization if the object already stores processed values and raw counts separately. + +### Multi-sample merged object + +- Keep `sample_id` and `batch` columns. +- Filter within sample if one sample is much lower quality than the others. +- Do not use a single hard upper gene cutoff across all samples if chemistry differs. + +## Threshold Heuristics + +### Common first-pass thresholds + +| Metric | Typical starting range | Notes | +|---|---:|---| +| `min_genes` | 200-500 | Raise for high-depth data | +| `max_genes` | 5000-8000 | Helps catch doublets, assay-dependent | +| `pct_counts_mt` | < 10-20% | Tissue- and assay-dependent | +| `min_cells per gene` | 3-10 | Higher for larger cohorts | +| HVGs | 2000-4000 | 3000 is a common default | +| PCs | 20-50 | Depends on cohort size and complexity | +| neighbors | 10-30 | Smaller for cleaner manifolds | + +### Nucleus data + +- Expect lower mitochondrial fraction. +- Expect lower genes per nucleus than whole-cell data. +- Avoid over-filtering low-RNA nuclei. + +### Stressed or dissociated tissues + +- Mito fraction may be elevated. +- Review hemoglobin or stress markers separately before discarding large fractions of cells. + +## Recommended Plot Set + +- QC violin for `n_genes_by_counts`, `total_counts`, `pct_counts_mt` +- scatter of `total_counts` vs `n_genes_by_counts` +- PCA variance ratio +- UMAP colored by sample, batch, QC metrics, and cluster + +## Failure Modes + +- UMAP mostly tracks `pct_counts_mt` + - likely under-filtered low-quality cells +- clusters split by batch only + - integration or batch-aware modeling is needed +- every cluster has very similar markers + - clustering may be too fine or PCs poorly chosen + +## Export Guidance + +Minimum recommended files: + +- `results/processed.h5ad` +- `results/cluster_assignments.tsv` +- `qc/filter_summary.tsv` +- `figures/qc_violin.pdf` +- `figures/umap_by_batch.pdf` diff --git a/container/skills/structural-biology/SKILL.md b/container/skills/structural-biology/SKILL.md new file mode 100644 index 0000000..e153a19 --- /dev/null +++ b/container/skills/structural-biology/SKILL.md @@ -0,0 +1,153 @@ +--- +name: structural-biology +description: Structure retrieval, confidence-aware AlphaFold DB usage, coordinate download, PAE and pLDDT interpretation, and structure-guided biological annotation. +tool_type: python +primary_tool: AlphaFold DB +--- + +# Structural Biology + +## Version Compatibility + +Reference examples assume: + +- `biopython` 1.84+ +- AlphaFold DB public API current format +- optional visualization stack such as `py3Dmol` or PyMOL + +Verify before use: + +- Python: `python -c "import Bio; print(Bio.__version__)"` + +## Overview + +Use this skill when the task is: + +- retrieving AlphaFold-predicted structures by UniProt accession +- downloading coordinate and confidence files +- reading pLDDT or PAE to judge confidence +- mapping sequence findings onto structure + +## When To Use This Skill + +- a UniProt accession or known protein target exists +- experimental structure is absent or incomplete +- the user needs confidence-aware structural interpretation + +## Quick Route + +- known UniProt accession: query AlphaFold DB first +- novel designed sequence without AlphaFold DB entry: use a separate prediction workflow such as ColabFold +- structure interpretation request: always inspect pLDDT and PAE before making mechanistic claims + +## Progressive Disclosure + +- Read [technical_reference.md](technical_reference.md) for confidence interpretation and source-selection rules. +- Read [commands_and_thresholds.md](commands_and_thresholds.md) for AlphaFold DB retrieval patterns, URL layouts, and file conventions. + +## Expected Inputs + +- UniProt accession or sequence context +- optional residue list, mutation list, or ligand site hypothesis + +## Expected Outputs + +- `results/structures/AF-.cif` +- `results/structures/AF-.pdb` +- `results/confidence/AF--confidence.json` +- `results/confidence/AF--pae.json` +- `figures/AF--pae.png` + +## Starter Pattern + +```python +from Bio.PDB import alphafold_db + +prediction = next(alphafold_db.get_predictions("P00520")) +cif_path = alphafold_db.download_cif_for(prediction, directory="results/structures") +print(cif_path) +``` + +## Confidence Thresholds + +### pLDDT + +| pLDDT | Interpretation | +|---|---| +| `> 90` | very high confidence | +| `70-90` | good backbone confidence | +| `50-70` | low confidence | +| `< 50` | likely disorder or unreliable local structure | + +### PAE + +| PAE | Interpretation | +|---|---| +| `< 5 Å` | confident relative positioning | +| `5-15 Å` | moderate uncertainty | +| `> 15 Å` | domain orientation may be unreliable | + +## Workflow + +### 1. Choose the structure source + +- experimental structure if available and suitable +- AlphaFold DB for known proteins with UniProt accessions +- separate prediction workflow for novel sequences + +### 2. Retrieve coordinates and confidence files + +Download: + +- `mmCIF` or `PDB` +- confidence JSON +- PAE JSON + +### 3. Inspect confidence before interpretation + +Do not map mutations or infer interfaces from low-confidence regions without saying so. + +### 4. Annotate the biological question + +Map domains, active sites, mutations, motifs, or interfaces onto the structure. + +### 5. Export reusable artifacts + +Save coordinates, confidence files, and a PAE heatmap or equivalent summary. + +## Output Artifacts + +```text +results/ +├── structures/ +│ ├── AF-P00520-F1-model_v4.cif +│ └── AF-P00520-F1-model_v4.pdb +└── confidence/ + ├── AF-P00520-F1-confidence_v4.json + └── AF-P00520-F1-predicted_aligned_error_v4.json +figures/ +└── AF-P00520-F1-pae.png +``` + +## Quality Review + +- pLDDT must be reviewed before claiming local residue geometry is trustworthy +- PAE must be reviewed before claiming domain-domain arrangement is trustworthy +- residue numbering and chain mapping must be checked before mutation interpretation +- low-confidence or disordered regions should be labeled explicitly + +## Anti-Patterns + +- treating every AlphaFold region as equally reliable +- ignoring PAE when discussing domain orientation +- mapping variants onto mismatched residue numbering +- using AlphaFold DB retrieval as if it were de novo prediction for novel sequences + +## Related Skills + +- Proteomics +- Pathway Analysis + +## Optional Supplements + +- `alphafold-database` diff --git a/container/skills/structural-biology/commands_and_thresholds.md b/container/skills/structural-biology/commands_and_thresholds.md new file mode 100644 index 0000000..77b0c82 --- /dev/null +++ b/container/skills/structural-biology/commands_and_thresholds.md @@ -0,0 +1,44 @@ +# Structural Biology Commands And Thresholds + +## AlphaFold DB Retrieval With Biopython + +```python +from Bio.PDB import alphafold_db + +pred = next(alphafold_db.get_predictions("P00520")) +cif_path = alphafold_db.download_cif_for(pred, directory="results/structures") +``` + +## Direct API Query + +```python +import requests + +resp = requests.get("https://alphafold.ebi.ac.uk/api/prediction/P00520") +data = resp.json() +print(data[0]["entryId"]) +print(data[0]["cifUrl"]) +print(data[0]["pdbUrl"]) +``` + +## Confidence Thresholds + +- pLDDT `> 90`: very high confidence +- pLDDT `70-90`: usable backbone confidence +- pLDDT `< 50`: likely unreliable or disordered +- PAE `< 5 Å`: confident relative positioning +- PAE `> 15 Å`: relative orientation uncertain + +## Output Convention + +```text +results/ +├── structures/ +│ ├── AF--model_v4.cif +│ └── AF--model_v4.pdb +└── confidence/ + ├── AF--confidence_v4.json + └── AF--predicted_aligned_error_v4.json +figures/ +└── AF--pae.png +``` diff --git a/container/skills/structural-biology/technical_reference.md b/container/skills/structural-biology/technical_reference.md new file mode 100644 index 0000000..fbf4740 --- /dev/null +++ b/container/skills/structural-biology/technical_reference.md @@ -0,0 +1,35 @@ +# Structural Biology Technical Reference + +## Source Selection + +### Use AlphaFold DB when + +- the target has a UniProt accession +- a predicted structure is acceptable +- the goal is rapid structure-guided interpretation + +### Use a prediction workflow instead when + +- the sequence is novel or designed +- no AlphaFold DB entry exists + +## Confidence Interpretation + +### pLDDT + +- `> 90`: highly reliable local geometry +- `70-90`: generally useful backbone confidence +- `50-70`: low confidence +- `< 50`: often disordered or not structurally reliable + +### PAE + +- low PAE means relative positions are trustworthy +- high PAE means domain orientation may be uncertain even if local folds look good + +## Common Failure Modes + +- high pLDDT within domains but high inter-domain PAE + - local folds may be fine but domain arrangement is uncertain +- residue numbering mismatch + - mutation or site mapping becomes wrong even if the structure file is correct