Skip to content

ngalioto/arch3d

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ARCH3D

A foundation model for global genome architecture

Download

The model can be downloaded from https://huggingface.co/ngalioto/ARCH3D.

Install

python -m pip install --upgrade pip
python -m pip install .

Scripts

toeplitz_normalize.py

Perform observed/expected normalization on a cooler file with toeplitz_normalize.py

Parameters:

  • file_name (str): Path to mcool file
  • save_dir (str): Directory to save the normalized matrix
  • save_name (str): Output filename
  • weights_dir (str, optional): Directory to save pertinent normalization values
  • resolution (int): Resolution at which to normalize (default: 5000)
  • balance (bool): If True, normalize balanced Hi-C; otherwise normalize raw Hi-C (default: True)

Outputs:

  • {save_dir}/{save_name}.npz (scipy.sparse.csr_matrix): The normalized matrix with chromosomes 1--22

The rest are saved only if the argument weights_dir is passed

  • {weights_dir}/{save_name}_pixels.npy (numpy.ndarray): Array holding the sum of all pixels along each diagonal. The tensor index corresponds to the diagonal offset.
  • {weights_dir}/{save_name}_counts.npy (numpy.ndarray): Array holding the number of nonzero pixesl along each diagonal. The tensor index corresponds to the diagonal offset.

Example:

python digitalcell/data/toeplitz_normalize.py \
    "file_name" \
    "save_dir" \
    --save_name "save_name" \
    --weights_dir "weights_dir" \
    --resolution 5000 \
    --balance True

pretrain.py

Pre-train ARCH3D from scratch with pretrain.py

Parameters:

  • config (str): Path to the configuration file

Example:

python digitalcell/scripts/pretrain.py \
    --config "/path/to/config"

generate_embeddings.py

The steps to generate pre-trained embeddings from a Hi-C experiment are the following:

  1. Create an mcool file with the desired resolution
  2. Perform observed/expected normalization on the Hi-C experiment using toeplitz_normalize.py
  3. Generate the embeddings with generate_embeddings.py

Parameters:

  • ckpt_path (str): Path to the HiCT model checkpoint
  • data_file (str): Path to the HiC data file
  • resolution (int): Resolution for binning the data
  • save_dir (str): Directory for saving the embeddings
  • shuffle (bool, optional): Shuffle loci before generating embeddings (default: True)

Outputs:

  • {save_dir}/embeddings.pt (torch.Tensor): The generated embeddings
  • {save_dir}/mappable_idx.pt (torch.Tensor): Boolean mask of indices where the rows are not all zeros
  • {save_dir}/chromosomes.pt (torch.Tensor): Chromosome assignments (zero-indexed)
  • {save_dir}/start_bp.pt (torch.Tensor): Starting base pair positions (zero-indexed and in Mbp)

Example:

python digitalcell/scripts/generate_embeddings.py \
    --ckpt_path "path/to/checkpoint" \
    --data_file "path/to/data.npz" \
    --resolution 5000 \
    --save_dir "embeddings_output" \
    --shuffle True

Downstream tasks

Resolution enhancement

downsample.py

Downsample Hi-C experiment with downsample.py

Parameters:

  • frac (float): Fraction of contacts to retain via binomial thinning
  • mcool_file (str): Path to mcool file at 5000 bp resolution
  • save_dir (str, optional): Directory to save downsampled cooler file (default: same directory as input)

Outputs:

  • {accession}_{int(100*frac)}pct.cool (cool file): Downsampled Hi-C contact matrix

Example:

python digitalcell/tasks/enhancement/downsample.py \
  --frac 0.1 \
  --mcool_file "path/to/accession.mcool" \
  --save_dir "path/to/output/directory"

After creating the low-coverage .cool file, use zoomify to create all desired resolutions, and then balance each resolution.

cooler zoomify path/to/file_{int(100*frac)}pct.cool \
    -r 5000,10000,25000,50000,100000,250000,500000,1000000 \
    -o path/to/accession_{int(100*frac)}pct.mcool

for res in 5000 10000 25000 50000 100000 250000 500000 1000000; do
    cooler balance path/to/accession_{int(100*frac)}pct.mcool::/resolutions/$res
done

Lastly, create the inputs and targets for training. The inputs to ARCH3D are always at 5 kb.

python digitalcell/data/toeplitz_normalize.py \
  "path/to/accession_{int(100*frac)}pct.mcool" \
  "path/to/directory/holding/inputs" \
  --save_name "accession_{int(100*frac)}pct"
  --resolution 5000

The targets should be normalized at the target resolution to prevent introducing errors from pooling.

python digitalcell/data/toeplitz_normalize.py \
  "path/to/accession_{int(100*frac)}pct.mcool" \
  "path/to/directory/holding/targets" \
  --save_name "accession_$res"
  --resolution $res
enhancement.py

Train a resolution enhancement model with enhance.py

Parameters:

  • config (str): Path to the enhancement configuration file (default: digitalcell/tasks/enhancement/res_enhancement.yaml)

Example:

python digitalcell/tasks/enhancement/enhance.py \
  --config "path/to/config.yaml"

Hyperedge prediction

The workflow for this task is as follows:

  1. Run process_clusters.py
  2. Set the parameters in generate_kmers.py and run
  3. Generate embeddings from virtual or real Hi-C with generate_embeddings.py
  4. Train the model hyperedge.py
  5. Test the predictions with test_hyperedge.py

Each script is explained below:

process_clusters.py

Parameters:

  • parent_dir (str): Path to directory containing cluster files
  • parent_save_dir (str): Path to directory for saving processed files
  • resolution (int): Resolution for inference (default: 100000)

Outputs:

  • {parent_save_dir}/{parent_dir_name}/edge_list.npy (numpy.ndarray): Hyperedges for inference
  • {parent_save_dir}/{parent_dir_name}/matrix.txt (tab-separated): Contact matrix pixels
  • {parent_save_dir}/{parent_dir_name}/bins.bed (BED format): Genomic bins
  • {parent_save_dir}/{parent_dir_name}/output.cool (cool file): Cooler file at 1 kb resolution
  • {parent_save_dir}/{parent_dir_name}/output.mcool (mcool file): Multi-resolution Cooler file with balanced Hi-C

Example:

python digitalcell/tasks/hyperedge/process_clusters.py \
  --parent-dir "path/to/directory/containing/clusters/files" \
  --parent-save-dir "path/to/output" \
  --resolution 100000
generate_kmers.py

Generate k-mer hyperedges from edge_list.npy with generate_kmers.py

Parameters:

  • max_cluster_size (int): Maximum cluster size to consider
  • k_list (List[int]): List of k-mer sizes to generate
  • temp_dir (str): Directory containing edge_list.npy and for saving outputs
  • min_freq_cutoff (int): Minimum frequency threshold for k-mers (default: 2)
  • resolution (int): Resolution for inference (default: 100000)

Outputs:

  • {temp_dir}/all_{k}_counter.npy (numpy.ndarray): k-mer hyperedges for each k in k_list
  • {temp_dir}/all_{k}_freq_counter.npy (numpy.ndarray): Frequency counts for each k-mer

Example:

python digitalcell/tasks/hyperedge/generate_kmers.py \
  --max_cluster_size 25 \
  --k_list [3,4,5] \
  --temp_dir "path/to/temp" \
  --min_freq_cutoff 2 \
  --resolution 100000

Cite

Galioto, Nicholas, et al. "ARCH3D: A foundation model for global genome architecture." bioRxiv (2026): 2026-02.

@article{galioto2026arch3d,
  title={{ARCH3D}: A foundation model for global genome architecture},
  author={Galioto, Nicholas and Stansbury, Cooper and Gorodetsky, Alex Arkady and Rajapakse, Indika},
  journal={bioRxiv},
  pages={2026--02},
  year={2026},
  publisher={Cold Spring Harbor Laboratory}
}