A foundation model for global genome architecture
The model can be downloaded from https://huggingface.co/ngalioto/ARCH3D.
python -m pip install --upgrade pip
python -m pip install .toeplitz_normalize.py
Perform observed/expected normalization on a cooler file with toeplitz_normalize.py
Parameters:
file_name(str): Path to mcool filesave_dir(str): Directory to save the normalized matrixsave_name(str): Output filenameweights_dir(str, optional): Directory to save pertinent normalization valuesresolution(int): Resolution at which to normalize (default: 5000)balance(bool): If True, normalize balanced Hi-C; otherwise normalize raw Hi-C (default: True)
Outputs:
{save_dir}/{save_name}.npz(scipy.sparse.csr_matrix): The normalized matrix with chromosomes 1--22
The rest are saved only if the argument weights_dir is passed
{weights_dir}/{save_name}_pixels.npy(numpy.ndarray): Array holding the sum of all pixels along each diagonal. The tensor index corresponds to the diagonal offset.{weights_dir}/{save_name}_counts.npy(numpy.ndarray): Array holding the number of nonzero pixesl along each diagonal. The tensor index corresponds to the diagonal offset.
Example:
python digitalcell/data/toeplitz_normalize.py \
"file_name" \
"save_dir" \
--save_name "save_name" \
--weights_dir "weights_dir" \
--resolution 5000 \
--balance Truepretrain.py
Pre-train ARCH3D from scratch with pretrain.py
Parameters:
config(str): Path to the configuration file
Example:
python digitalcell/scripts/pretrain.py \
--config "/path/to/config"generate_embeddings.py
The steps to generate pre-trained embeddings from a Hi-C experiment are the following:
- Create an mcool file with the desired resolution
- Perform observed/expected normalization on the Hi-C experiment using
toeplitz_normalize.py - Generate the embeddings with
generate_embeddings.py
Parameters:
ckpt_path(str): Path to the HiCT model checkpointdata_file(str): Path to the HiC data fileresolution(int): Resolution for binning the datasave_dir(str): Directory for saving the embeddingsshuffle(bool, optional): Shuffle loci before generating embeddings (default: True)
Outputs:
{save_dir}/embeddings.pt(torch.Tensor): The generated embeddings{save_dir}/mappable_idx.pt(torch.Tensor): Boolean mask of indices where the rows are not all zeros{save_dir}/chromosomes.pt(torch.Tensor): Chromosome assignments (zero-indexed){save_dir}/start_bp.pt(torch.Tensor): Starting base pair positions (zero-indexed and in Mbp)
Example:
python digitalcell/scripts/generate_embeddings.py \
--ckpt_path "path/to/checkpoint" \
--data_file "path/to/data.npz" \
--resolution 5000 \
--save_dir "embeddings_output" \
--shuffle Truedownsample.py
Downsample Hi-C experiment with downsample.py
Parameters:
frac(float): Fraction of contacts to retain via binomial thinningmcool_file(str): Path to mcool file at 5000 bp resolutionsave_dir(str, optional): Directory to save downsampled cooler file (default: same directory as input)
Outputs:
{accession}_{int(100*frac)}pct.cool(cool file): Downsampled Hi-C contact matrix
Example:
python digitalcell/tasks/enhancement/downsample.py \
--frac 0.1 \
--mcool_file "path/to/accession.mcool" \
--save_dir "path/to/output/directory"After creating the low-coverage .cool file, use zoomify to create all desired resolutions, and then balance each resolution.
cooler zoomify path/to/file_{int(100*frac)}pct.cool \
-r 5000,10000,25000,50000,100000,250000,500000,1000000 \
-o path/to/accession_{int(100*frac)}pct.mcool
for res in 5000 10000 25000 50000 100000 250000 500000 1000000; do
cooler balance path/to/accession_{int(100*frac)}pct.mcool::/resolutions/$res
doneLastly, create the inputs and targets for training. The inputs to ARCH3D are always at 5 kb.
python digitalcell/data/toeplitz_normalize.py \
"path/to/accession_{int(100*frac)}pct.mcool" \
"path/to/directory/holding/inputs" \
--save_name "accession_{int(100*frac)}pct"
--resolution 5000The targets should be normalized at the target resolution to prevent introducing errors from pooling.
python digitalcell/data/toeplitz_normalize.py \
"path/to/accession_{int(100*frac)}pct.mcool" \
"path/to/directory/holding/targets" \
--save_name "accession_$res"
--resolution $resenhancement.py
Train a resolution enhancement model with enhance.py
Parameters:
config(str): Path to the enhancement configuration file (default:digitalcell/tasks/enhancement/res_enhancement.yaml)
Example:
python digitalcell/tasks/enhancement/enhance.py \
--config "path/to/config.yaml"The workflow for this task is as follows:
- Run
process_clusters.py - Set the parameters in
generate_kmers.pyand run - Generate embeddings from virtual or real Hi-C with
generate_embeddings.py - Train the model
hyperedge.py - Test the predictions with
test_hyperedge.py
Each script is explained below:
process_clusters.py
Parameters:
parent_dir(str): Path to directory containing cluster filesparent_save_dir(str): Path to directory for saving processed filesresolution(int): Resolution for inference (default: 100000)
Outputs:
{parent_save_dir}/{parent_dir_name}/edge_list.npy(numpy.ndarray): Hyperedges for inference{parent_save_dir}/{parent_dir_name}/matrix.txt(tab-separated): Contact matrix pixels{parent_save_dir}/{parent_dir_name}/bins.bed(BED format): Genomic bins{parent_save_dir}/{parent_dir_name}/output.cool(cool file): Cooler file at 1 kb resolution{parent_save_dir}/{parent_dir_name}/output.mcool(mcool file): Multi-resolution Cooler file with balanced Hi-C
Example:
python digitalcell/tasks/hyperedge/process_clusters.py \
--parent-dir "path/to/directory/containing/clusters/files" \
--parent-save-dir "path/to/output" \
--resolution 100000generate_kmers.py
Generate k-mer hyperedges from edge_list.npy with generate_kmers.py
Parameters:
max_cluster_size(int): Maximum cluster size to considerk_list(List[int]): List of k-mer sizes to generatetemp_dir(str): Directory containing edge_list.npy and for saving outputsmin_freq_cutoff(int): Minimum frequency threshold for k-mers (default: 2)resolution(int): Resolution for inference (default: 100000)
Outputs:
{temp_dir}/all_{k}_counter.npy(numpy.ndarray): k-mer hyperedges for each k in k_list{temp_dir}/all_{k}_freq_counter.npy(numpy.ndarray): Frequency counts for each k-mer
Example:
python digitalcell/tasks/hyperedge/generate_kmers.py \
--max_cluster_size 25 \
--k_list [3,4,5] \
--temp_dir "path/to/temp" \
--min_freq_cutoff 2 \
--resolution 100000Galioto, Nicholas, et al. "ARCH3D: A foundation model for global genome architecture." bioRxiv (2026): 2026-02.
@article{galioto2026arch3d,
title={{ARCH3D}: A foundation model for global genome architecture},
author={Galioto, Nicholas and Stansbury, Cooper and Gorodetsky, Alex Arkady and Rajapakse, Indika},
journal={bioRxiv},
pages={2026--02},
year={2026},
publisher={Cold Spring Harbor Laboratory}
}