Perturb-seq pipeline

The documentation below uses the K562 CRISPR screen configuration (analysis/config.yann_k562.yaml) as a concrete example.

Quick Start

# 1. Clone the repository
git clone [repository_url]
cd perturb_pipeline

# 2. Activate the existing kb conda environment
conda activate /home/groups/engreitz/Users/tony/anaconda3/envs/kb
# Note: The kb environment is pre-installed with all necessary dependencies
# If you need to set up your own environment, contact the maintainer for the environment.yml file

# 3. Copy and modify the example config
cp analysis/config.yann_k562.yaml analysis/config.my_experiment.yaml
# Edit analysis/config.my_experiment.yaml with your paths and parameters

# 4. Prepare your input files (see Required Input Files section)

# 5. Run the pipeline
cd analysis
# Dry run to check what will be executed
CONFIG=config.yann_k562.yaml ./submit.sh --dry-run
# Submit using slurm
CONFIG=config.my_experiment.yaml ./submit.sh

Required Input Files

Before running the pipeline, you need to prepare the following files.

Example files: See analysis/references_yann_k562/ for complete examples of all required input files from a real K562 CRISPR screen experiment.

1. Sample Information Excel File (`sample_info.xlsx`)

Purpose: Defines all samples, their properties, and relationships
Location: Referenced in config as sample_info_file

Required columns (must be present for all samples):

pool: Pool identifier (e.g., "1k_pilot", "pool1", "pool2")
sample: Sample name within the pool (e.g., "sample1_gex", "sample1_grna")
sample_id: Unique identifier combining pool and sample (e.g., "1k_pilot:sample1_gex")
sample_type: Either "gex" (gene expression) or "guide" (guide RNA)
fastq_dir: Directory containing FASTQ files for this sample
expected_cells: Expected number of cells for this library (e.g., 5000)
min_umi_threshold: Minimum UMI count threshold for cell calling (e.g., 100)
sample_to_well_mapping: Name of the plate mapping sheet (e.g., "plate1")
paired_guide_sample_id: Full ID of paired guide library (for GEX samples, leave empty for guide samples)

Optional columns (required only for specific features):

For undetermined read recovery:

i7_barcode: i7 index sequence (8bp, e.g., "AAGGCTAT")
i5_barcode: i5 index sequence (8bp, e.g., "AAACATCG")

Columns not used by pipeline (can be included for documentation):

notes: Optional notes about the sample
read1_length: Length of Read 1 (typically 150)
read2_length: Length of Read 2 (typically 150)
paired_guide_pool: Pool of paired guide (derived from paired_guide_sample_id)
paired_guide_sample: Sample name of paired guide (not used)

2. Plate Mapping Excel File (`plate_map.xlsx`)

Purpose: Maps well positions to biological samples
Location: Referenced in config as plate_maps_file

Structure:

Each sheet represents a different plate (sheet name matches sample_to_well_mapping in sample_info)
Required columns per sheet:
- Well Position: Well identifier (A1-H12 for 96-well plate)
- Sample: Biological sample name

Optional columns:

Rep: Replicate identifier (stored as metadata but not used by pipeline)
Any other columns will be stored as cell metadata in the AnnData object

3. Guide Library File (`guides.txt`)

Purpose: Defines all guide RNA sequences for kallisto indexing
Location: Referenced in config as guide_file

Format: Tab-separated text file with columns:

Column 1: Guide sequence (typically 20-21bp)
Column 2: Guide ID/name

4. Guide Reference File (`guides_qc_reference.txt`)

Purpose: Maps guide IDs to target genes for QC analysis
Location: Referenced in config as guide_reference

Required columns (tab-separated):

ID: Guide identifier (must match IDs in guides.txt)
gene: Target gene name(s) - use separator for multi-gene targets

Optional columns:

separator: Character to split multi-gene targets (defaults to 'AND')
Any other columns are ignored

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
analysis		analysis
.gitignore		.gitignore
README.md		README.md
sequencing_libraries_documentation.md		sequencing_libraries_documentation.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Perturb-seq pipeline

Quick Start

Required Input Files

1. Sample Information Excel File (`sample_info.xlsx`)

2. Plate Mapping Excel File (`plate_map.xlsx`)

3. Guide Library File (`guides.txt`)

4. Guide Reference File (`guides_qc_reference.txt`)

5. Fill out the `config.my_experiment.yaml` file

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

tkzeng/perturb_pipeline

Folders and files

Latest commit

History

Repository files navigation

Perturb-seq pipeline

Quick Start

Required Input Files

1. Sample Information Excel File (sample_info.xlsx)

2. Plate Mapping Excel File (plate_map.xlsx)

3. Guide Library File (guides.txt)

4. Guide Reference File (guides_qc_reference.txt)

5. Fill out the config.my_experiment.yaml file

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

1. Sample Information Excel File (`sample_info.xlsx`)

2. Plate Mapping Excel File (`plate_map.xlsx`)

3. Guide Library File (`guides.txt`)

4. Guide Reference File (`guides_qc_reference.txt`)

5. Fill out the `config.my_experiment.yaml` file

Packages