ARCH3D

A foundation model for global genome architecture

Download

The model can be downloaded from https://huggingface.co/ngalioto/ARCH3D.

Install

python -m pip install --upgrade pip
python -m pip install .

Scripts

toeplitz_normalize.py

Perform observed/expected normalization on a cooler file with toeplitz_normalize.py

Parameters:

file_name (str): Path to mcool file
save_dir (str): Directory to save the normalized matrix
save_name (str): Output filename
weights_dir (str, optional): Directory to save pertinent normalization values
resolution (int): Resolution at which to normalize (default: 5000)
balance (bool): If True, normalize balanced Hi-C; otherwise normalize raw Hi-C (default: True)

Outputs:

{save_dir}/{save_name}.npz (scipy.sparse.csr_matrix): The normalized matrix with chromosomes 1--22

The rest are saved only if the argument weights_dir is passed

{weights_dir}/{save_name}_pixels.npy (numpy.ndarray): Array holding the sum of all pixels along each diagonal. The tensor index corresponds to the diagonal offset.
{weights_dir}/{save_name}_counts.npy (numpy.ndarray): Array holding the number of nonzero pixesl along each diagonal. The tensor index corresponds to the diagonal offset.

Example:

python digitalcell/data/toeplitz_normalize.py \
    "file_name" \
    "save_dir" \
    --save_name "save_name" \
    --weights_dir "weights_dir" \
    --resolution 5000 \
    --balance True

pretrain.py

Pre-train ARCH3D from scratch with pretrain.py

Parameters:

config (str): Path to the configuration file

Example:

python digitalcell/scripts/pretrain.py \
    --config "/path/to/config"

generate_embeddings.py

The steps to generate pre-trained embeddings from a Hi-C experiment are the following:

Create an mcool file with the desired resolution
Perform observed/expected normalization on the Hi-C experiment using toeplitz_normalize.py
Generate the embeddings with generate_embeddings.py

Parameters:

ckpt_path (str): Path to the HiCT model checkpoint
data_file (str): Path to the HiC data file
resolution (int): Resolution for binning the data
save_dir (str): Directory for saving the embeddings
shuffle (bool, optional): Shuffle loci before generating embeddings (default: True)

Outputs:

{save_dir}/embeddings.pt (torch.Tensor): The generated embeddings
{save_dir}/mappable_idx.pt (torch.Tensor): Boolean mask of indices where the rows are not all zeros
{save_dir}/chromosomes.pt (torch.Tensor): Chromosome assignments (zero-indexed)
{save_dir}/start_bp.pt (torch.Tensor): Starting base pair positions (zero-indexed and in Mbp)

Example:

python digitalcell/scripts/generate_embeddings.py \
    --ckpt_path "path/to/checkpoint" \
    --data_file "path/to/data.npz" \
    --resolution 5000 \
    --save_dir "embeddings_output" \
    --shuffle True

Downstream tasks

Resolution enhancement

downsample.py

Downsample Hi-C experiment with downsample.py

Parameters:

frac (float): Fraction of contacts to retain via binomial thinning
mcool_file (str): Path to mcool file at 5000 bp resolution
save_dir (str, optional): Directory to save downsampled cooler file (default: same directory as input)

Outputs:

{accession}_{int(100*frac)}pct.cool (cool file): Downsampled Hi-C contact matrix

Example:

python digitalcell/tasks/enhancement/downsample.py \
  --frac 0.1 \
  --mcool_file "path/to/accession.mcool" \
  --save_dir "path/to/output/directory"

After creating the low-coverage .cool file, use zoomify to create all desired resolutions, and then balance each resolution.

cooler zoomify path/to/file_{int(100*frac)}pct.cool \
    -r 5000,10000,25000,50000,100000,250000,500000,1000000 \
    -o path/to/accession_{int(100*frac)}pct.mcool

for res in 5000 10000 25000 50000 100000 250000 500000 1000000; do
    cooler balance path/to/accession_{int(100*frac)}pct.mcool::/resolutions/$res
done

Lastly, create the inputs and targets for training. The inputs to ARCH3D are always at 5 kb.

python digitalcell/data/toeplitz_normalize.py \
  "path/to/accession_{int(100*frac)}pct.mcool" \
  "path/to/directory/holding/inputs" \
  --save_name "accession_{int(100*frac)}pct"
  --resolution 5000

The targets should be normalized at the target resolution to prevent introducing errors from pooling.

python digitalcell/data/toeplitz_normalize.py \
  "path/to/accession_{int(100*frac)}pct.mcool" \
  "path/to/directory/holding/targets" \
  --save_name "accession_$res"
  --resolution $res

enhancement.py

Train a resolution enhancement model with enhance.py

Parameters:

config (str): Path to the enhancement configuration file (default: digitalcell/tasks/enhancement/res_enhancement.yaml)

Example:

python digitalcell/tasks/enhancement/enhance.py \
  --config "path/to/config.yaml"

Hyperedge prediction

The workflow for this task is as follows:

Run process_clusters.py
Set the parameters in generate_kmers.py and run
Generate embeddings from virtual or real Hi-C with generate_embeddings.py
Train the model hyperedge.py
Test the predictions with test_hyperedge.py

Each script is explained below:

process_clusters.py

Parameters:

parent_dir (str): Path to directory containing cluster files
parent_save_dir (str): Path to directory for saving processed files
resolution (int): Resolution for inference (default: 100000)

Outputs:

{parent_save_dir}/{parent_dir_name}/edge_list.npy (numpy.ndarray): Hyperedges for inference
{parent_save_dir}/{parent_dir_name}/matrix.txt (tab-separated): Contact matrix pixels
{parent_save_dir}/{parent_dir_name}/bins.bed (BED format): Genomic bins
{parent_save_dir}/{parent_dir_name}/output.cool (cool file): Cooler file at 1 kb resolution
{parent_save_dir}/{parent_dir_name}/output.mcool (mcool file): Multi-resolution Cooler file with balanced Hi-C

Example:

python digitalcell/tasks/hyperedge/process_clusters.py \
  --parent-dir "path/to/directory/containing/clusters/files" \
  --parent-save-dir "path/to/output" \
  --resolution 100000

generate_kmers.py

Generate k-mer hyperedges from edge_list.npy with generate_kmers.py

Parameters:

max_cluster_size (int): Maximum cluster size to consider
k_list (List[int]): List of k-mer sizes to generate
temp_dir (str): Directory containing edge_list.npy and for saving outputs
min_freq_cutoff (int): Minimum frequency threshold for k-mers (default: 2)
resolution (int): Resolution for inference (default: 100000)

Outputs:

{temp_dir}/all_{k}_counter.npy (numpy.ndarray): k-mer hyperedges for each k in k_list
{temp_dir}/all_{k}_freq_counter.npy (numpy.ndarray): Frequency counts for each k-mer

Example:

python digitalcell/tasks/hyperedge/generate_kmers.py \
  --max_cluster_size 25 \
  --k_list [3,4,5] \
  --temp_dir "path/to/temp" \
  --min_freq_cutoff 2 \
  --resolution 100000

Cite

Galioto, Nicholas, et al. "ARCH3D: A foundation model for global genome architecture." bioRxiv (2026): 2026-02.

@article{galioto2026arch3d,
  title={{ARCH3D}: A foundation model for global genome architecture},
  author={Galioto, Nicholas and Stansbury, Cooper and Gorodetsky, Alex Arkady and Rajapakse, Indika},
  journal={bioRxiv},
  pages={2026--02},
  year={2026},
  publisher={Cold Spring Harbor Laboratory}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
digitalcell		digitalcell
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARCH3D

Download

Install

Scripts

Downstream tasks

Resolution enhancement

Hyperedge prediction

Cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ARCH3D

Download

Install

Scripts

Downstream tasks

Resolution enhancement

Hyperedge prediction

Cite

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages