Hi-C k-mer enrichment around interesting regions

This repository contains example scripts used to:

Extract read mates from Hi-C BAM files in a given genomic interval
(both mapped and unmapped mates).
Scan a scaffold with sliding windows and extract unmapped mates whose partners map inside each window (negative controls / random regions).
Count and inspect k-mer frequencies with Jellyfish.
Plot k-mer histograms for interesting vs random regions in R.

The code is meant as a reproducible example rather than a general-purpose pipeline: you will need to adapt file paths, scaffold names, and coordinates to your own data. In practice, we often rely on visual assesment of data using IGV, JuicerBox, UGENE, and simple text editors, and GNU core utilities (awk, sort, uniq, head, etc.) for data manipulation.

Dependencies

samtools
jellyfish
GNU core utilities (awk, sort, uniq, head, etc.)
R with packages: dplyr, purrr

Scripts

`scripts/00_region_mapped_mates_pipeline.sh`

Example pipeline for a single genomic interval:

selects reads in a given coordinate range from a SAM file
restores SAM header and converts to BAM
filters paired reads by SAM flags
sorts and indexes BAM
(optionally) subsamples reads
converts to FASTA
runs Jellyfish k-mer counting and produces histogram and dump

This script was used to generate a positive control: mapped mates in the interesting region.

`scripts/01_window_unmapped_mates.sh`

Scans a scaffold with sliding windows and, for each window:

Collects read names in a pre-filtered BAM (mate_unmapped).
From the full BAM, selects reads with these names whose mates are unmapped.
Exports them as FASTA.

This script was used both for the target region and for multiple random regions (negative controls).

`scripts/02_kmer_count_and_dump.sh`

For every FASTA file in a directory:

runs jellyfish count (k-mer length configurable, e.g. k=80)
saves a k-mer histogram
dumps all k-mers with counts
dumps only high-copy k-mers with count ≥ threshold (e.g. ≥80)

Used to:

determine that the interesting region differs from negative controls around a specific k-mer frequency (~60 in our case),
inspect sequences repeated ≥80 times.

`analysis/01_kmer_histograms_IR1_IR2.R`

R script that:

reads Jellyfish histograms for an interesting region, positive control (mapped mates), and multiple negative controls (random regions),
computes median and ±1 SD envelopes across negative controls,
plots log-scaled k-mer histograms:
- black — unmapped mates in the interesting region,
- blue — median and ±SD of random regions,
- red — mapped mates in the interesting region.

Two example blocks are provided for two interesting regions (IR_1 and IR_2).

Manual step: UGENE

To investigate highly repeated k-mers (e.g. ≥80 copies), we:

Dump high-copy k-mers from Jellyfish:

jellyfish dump -L 80 int_reg_149.jf > int_reg_149.highcopy.fa

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analysis		analysis
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hi-C k-mer enrichment around interesting regions

Dependencies

Scripts

`scripts/00_region_mapped_mates_pipeline.sh`

`scripts/01_window_unmapped_mates.sh`

`scripts/02_kmer_count_and_dump.sh`

`analysis/01_kmer_histograms_IR1_IR2.R`

Manual step: UGENE

About

Uh oh!

Releases

Packages

Languages

License

genomech/TitHiC

Folders and files

Latest commit

History

Repository files navigation

Hi-C k-mer enrichment around interesting regions

Dependencies

Scripts

scripts/00_region_mapped_mates_pipeline.sh

scripts/01_window_unmapped_mates.sh

scripts/02_kmer_count_and_dump.sh

analysis/01_kmer_histograms_IR1_IR2.R

Manual step: UGENE

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`scripts/00_region_mapped_mates_pipeline.sh`

`scripts/01_window_unmapped_mates.sh`

`scripts/02_kmer_count_and_dump.sh`

`analysis/01_kmer_histograms_IR1_IR2.R`

Packages