Leviathan

Leviathan is a fast, memory-efficient, and scalable taxonomic and pathway profiler for next generation sequencing (genome-resolved) metagenomics and metatranscriptomics. Leviathan is powered by Salmon and Sylph in the backend.

License Notice

You may have noticed that I have switched the code-base from public/private a few times. NewAtlantis Labs is ending operations and IP is being absorbed by Ocean BioMetrics. I am coordinating with NewAtlantis Labs and Ocean BioMetrics legal teams to finalize license details for various assets including Leviathan.

Please feel free to use for any academic usage but the details for commercial usage have not been finalized yet so please hold off on any commercial usage. These details should be finalized in the near future but the timeline is out of my control. I am actively advocating for unrestricted open-source usage as this can be a useful resource for the community.

Apologies for any inconvenience.

For any questions, please feel free to contact me at jol.espinoz@gmail.com

Install

# Create environment with dependencies
mamba create -n leviathan -c conda-forge -c bioconda python salmon sylph samtools -y

# Activate environment
mamba activate leviathan

# Install Leviathan
pip install leviathan

Modules

Citation

Leviathan: A fast, memory-efficient, and scalable taxonomic and pathway profiler for (pan)genome-resolved metagenomics and metatranscriptomics. Josh L Espinoza. bioRxiv 2025.07.14.664802; doi: https://doi.org/10.1101/2025.07.14.664802

Usage:

End-to-End Walkthrough

Detailed explanation on how to run each module including downloading test data and interpreting output files.

Benchmarking

Benchmarking against 10, 100, 1000, and 10000 genomes

Benchmarking using trimmed SRR12042303 sample with 4 threads on ram16GB-cpu4 EC2 instance (ml.m5.4xlarge)

number_of_genomes	number_of_cds_with_features	preprocess	index	profile-taxonomy	profile-pathway
10	1928	0:03	0:09	0:41	2:09
100	18410	0:31	0:26	0:41	4:29
1000	191155	5:29	3:55	0:43	12:50
10000	1684876	46:00	39:10	0:48	18:14

Benchmarking against CAMI-I and CAMI-II using 16 threads

All benchmarking and analysis was performed using a virtual machine with the following specifications: Linux Ubuntu 22.04 64-bit (x86_64), 30 Intel Xeon Platinum 8358 CPU, 222 GB memory, and 1 NVIDIA A10 GPU. Benchmarking and analysis was performed using 16 threads running 2 jobs simultaneously for Leviathan and HUMAnN.

Computational Performance

		Leviathan		HUMAnN		Fold Improvement
		Duration (minutes)	Peak Memory (GB)	Duration (minutes)	Peak Memory (GB)	Duration	Memory
CAMI_high_toy	H_S001	14.61	2.34	1083.57	32.31	74.19	13.84
	H_S002	14.89	2.35	949.73	32.28	63.78	13.75
	H_S003	14.96	2.34	875.83	32.64	58.56	13.97
	H_S004	15.02	2.35	852.18	32.33	56.72	13.79
	H_S005	15.27	2.33	826.25	32.23	54.13	13.82
CAMI_medium_toy	M2_S001	5.78	1.62	219.25	15.95	37.96	9.83
	M2_S002	5.81	1.62	174.12	16.96	29.95	10.45
CAMI_low_toy	S_S001	3.29	1.27	76.90	10.00	23.40	7.87
Marine	sample_0	13.78	2.70	119.52	17.92	8.68	6.63
	sample_1	15.22	2.69	121.30	18.00	7.97	6.69
	sample_2	14.97	2.71	120.27	17.99	8.03	6.65
	sample_3	17.10	2.71	124.05	17.83	7.25	6.59
	sample_4	14.32	2.74	118.47	17.82	8.27	6.51
	sample_5	15.53	2.72	119.40	17.80	7.69	6.54
	sample_6	16.09	2.71	119.72	17.91	7.44	6.62
	sample_7	14.90	2.73	119.92	17.92	8.05	6.56
	sample_8	16.41	2.72	121.73	17.95	7.42	6.61
	sample_9	14.45	2.73	118.87	17.79	8.23	6.51

Accuracy Performance

Ranges from 0.0 - 1.0

	Accuracy	Leviathan		HUMAnN		Improvement
Dataset	SampleID	Genome	Pangenome	Genome	Pangenome	Genome	Pangenome
CAMI_high_toy	H_S001	0.9492	0.9970	0.9049	0.9610	0.0442	0.0360
	H_S002	0.9551	0.9899	0.8992	0.9591	0.0558	0.0308
	H_S003	0.9556	0.9888	0.9004	0.9598	0.0553	0.0290
	H_S004	0.9496	0.9872	0.8947	0.9588	0.0548	0.0284
	H_S005	0.9420	0.9877	0.8901	0.9573	0.0519	0.0304
CAMI_medium_toy	M2_S001	0.9692	0.9983	0.9101	0.9620	0.0591	0.0363
	M2_S002	0.9762	0.9988	0.9177	0.9650	0.0585	0.0338
CAMI_low_toy	S_S001	1.0000	1.0000	0.9845	0.9845	0.0155	0.0155
Marine	sample_0	0.9727	0.9933	0.8783	0.9538	0.0944	0.0396
	sample_1	0.9298	0.9922	0.8793	0.9554	0.0505	0.0367
	sample_2	0.9686	0.9817	0.8768	0.9393	0.0918	0.0424
	sample_3	0.9706	0.9842	0.8596	0.9517	0.1110	0.0325
	sample_4	0.9661	0.9880	0.8454	0.9389	0.1207	0.0491
	sample_5	0.9614	0.9856	0.8740	0.9612	0.0874	0.0244
	sample_6	0.9283	0.9869	0.8684	0.9574	0.0599	0.0295
	sample_7	0.9231	0.9942	0.8719	0.9466	0.0512	0.0476
	sample_8	0.9703	0.9889	0.8764	0.9488	0.0940	0.0401
	sample_9	0.9459	0.9859	0.8657	0.9548	0.0802	0.0311

Modules

`leviathan-preprocess`

Preprocesses data into form than can be used by leviathan-index

leviathan-preprocess.py \
    -i references/manifest.tsv \
    -a references/pykofamsearch.pathways.tsv.gz \
    -o references/

`leviathan-index`

Build, update, and validate leviathan database

leviathan-index.py \
    -f references/cds.fasta.gz \
    -m references/feature_mapping.tsv.gz \
    -g references/genomes.tsv.gz \
    -d references/index/ \
    -p=-1

`leviathan-info`

Report information about leviathan database

leviathan-info.py -d references/index/

`leviathan-profile-taxonomy`

Profile taxonomy using Sylph with leviathan database

leviathan-profile-taxonomy.py \
    -1 ../Fastq/SRR12042303_1.fastq.gz \
    -2 ../Fastq/SRR12042303_2.fastq.gz \
    -n SRR12042303 \
    -d references/index/ \
    -o leviathan_output/profiling/taxonomy/ \
    -p=-1

`leviathan-profile-pathway`

Profile pathways using Salmon with leviathan database

leviathan-profile-pathway.py \
    -1 ../Fastq/SRR12042303_1.fastq.gz \
    -2 ../Fastq/SRR12042303_2.fastq.gz \
    -n SRR12042303 \
    -d references/index/ \
    -o leviathan_output/profiling/pathway/ \
    -p=-1

`leviathan-merge`

Merge sample-specific taxonomic and/or pathway profiling

leviathan-merge.py \
    -t leviathan_output/profiling/taxonomy/ \
    -p leviathan_output/profiling/pathway/ \

Utility Scripts

compile-manifest-from-veba.py - Compiles manifest.tsv file for leviathan preprocess from VEBA binning output

compile-manifest-from-veba.py
-i path/to/veba_output/binning/
-t prokaryotic,eukaryotic
-o references/manifest.tsv

Output Description

Sample Specific

Taxonomy profiles

Examples:
- Genome = Metagenome-assembled genome (MAG)
- Genome cluster = ANI ??? 95% & Alignment Fraction ??? 50%

Taxonomic abundances - Relative abundance of a genome/genome-cluster within a sample

taxonomic_abundance.genome_clusters.[parquet|tsv.gz] - Genome-cluster-level taxonomic relative abundance profiles
taxonomic_abundance.genomes.[parquet|tsv.gz] - Genome-level taxonomic relative abundance profiles

Note: Sylph is run with --estimate-unknown so relative abundances do not sum to 100% and the remaining % represents the unassigned reads.

Functional profiles

Examples:
- Feature = KEGG ortholog
- Pathway = KEGG module

Feature abundances - The (normalized) abundance of a feature relative to a genome/genome-cluster

feature_abundances.genome_clusters.number_of_reads.[parquet|tsv.gz] - Feature abundances for each genome cluster (number of reads aligned)
feature_abundances.genome_clusters.tpm.[parquet|tsv.gz] - Feature abundances for each genome cluster (TPM normalized abundances)
feature_abundances.genomes.number_of_reads.[parquet|tsv.gz] - Feature abundances for each genome (number of reads aligned)
feature_abundances.genomes.tpm.[parquet|tsv.gz] - Feature abundances for each genome (TPM normalized abundances)

Feature prevalence - The number of genome/genome-clusters where a feature is detected

feature_prevalence-binary.genome_clusters.[parquet|tsv.gz] - Presence/absence of feature relative to genome clusters
feature_prevalence-binary.genomes.[parquet|tsv.gz] - Presence/absence of feature relative to genomes
feature_prevalence-ratio.genome_clusters.[parquet|tsv.gz] - Ratio of genomes within a genome cluster with feature detected
feature_prevalence.genome_clusters.[parquet|tsv.gz] - The count of uniques that correspond to the features relative to the genome clusters
feature_prevalence.genomes.[parquet|tsv.gz] - The count of uniques that correspond to the features relative to the genomes

Gene abundances - The abundance of individual genes within genome

gene_abundances.genomes.number_of_reads.[parquet|tsv.gz] - Number of reads aligned to a gene within a genome
gene_abundances.genomes.tpm.[parquet|tsv.gz] - TPM normalized abundance of reads aligned to a gene within a genome

Pathway abundances - Pathway abundances for a genome and genome-cluster

pathway_abundances.genome_clusters.coverage.[parquet|tsv.gz] - Pathway coverage (i.e., pathway completion ratio) relative to genome clusters
pathway_abundances.genome_clusters.number_of_reads.[parquet|tsv.gz] - Pathway abundances as the number of reads aligned relative to genome clusters
pathway_abundances.genome_clusters.tpm.[parquet|tsv.gz] - TPM normalized pathway abundances as the number of reads aligned relative to genome clusters
pathway_abundances.genomes.coverage.[parquet|tsv.gz] - Pathway coverage (i.e., pathway completion ratio) relative to genomes
pathway_abundances.genomes.number_of_reads.[parquet|tsv.gz] - Pathway abundances as the number of reads aligned relative to genomes
pathway_abundances.genomes.tpm.[parquet|tsv.gz] - TPM normalized pathway abundances as the number of reads aligned relative to genomes

Merged

Taxonomy profiles

Sequence abundances can be used to determine the proportion of reads that were detected in database.

taxonomic_abundance.genome_clusters.nc - Genome-level taxonomic and sequence relative abundance profiles for all samples
taxonomic_abundance.genomes.nc - Genome-level taxonomic and sequence relative abundance profiles for all samples.

Functional profiles

Feature

feature.genome_clusters.nc - Feature abundances (number of reads, tpm) and prevalences (binary, total, ratio) of genome clusters for all samples
feature.genomes.nc - Feature abundances (number of reads, tpm) and prevalences (binary, total, ratio) of genomes for all samples

Pathway

pathway.genome_clusters.nc - Pathway abundances (number of reads, tpm) and coverages of genome clusters for all samples
pathway.genomes.nc - Pathway abundances (number of reads, tpm) and coverages of genomes for all samples

Pathway Databases

Currently, the only pathway database supported for pathway coverage calculations is the KEGG module database using KEGG orthologs as features. This database can be pre-built using KEGG Pathway Profiler or built with leviathan index if KEGG orthologs are used as features.

To maintain generalizability for custom feature sets (e.g., enzymes, reactions), the pathway database is not required but if it is not used when building leviathan index then the leviathan profile-pathway skips the pathway abundance and coverage calculations.

If custom databases are built, then the following nested Python dictionary structure needs to be followed:

# General Example
{
    id_pathway:{
        "name":Name of pathway,
        "definition":KEGG module definition,
        "classes":KEGG module classes,
        "graph":NetworkX MultiDiGraph,
        "ko_to_nodes": Dictionary of KEGG ortholog to nodes in graph,
        "optional_kos": Set of optional KEGG orthologs
    },
    }

# Specific Example
{
    'M00001': {
        'name': 'Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate',
        'definition': (
            '(K00844,K12407,K00845,K25026,K00886,K08074,K00918) '
            '(K01810,K06859,K13810,K15916) '
            '(K00850,K16370,K21071,K00918) '
            '(K01623,K01624,K11645,K16305,K16306) '
            'K01803 ((K00134,K00150) K00927,K11389) '
            '(K01834,K15633,K15634,K15635) '
            '(K01689,K27394) '
            '(K00873,K12406)'
        ),
        'classes': 'Pathway modules; Carbohydrate metabolism; Central carbohydrate metabolism',
        'graph': <networkx.classes.multidigraph.MultiDiGraph object at 0x132d2a9e0>,
        'ko_to_nodes': {
            'K00844': [[0, 2]],
            'K12407': [[0, 2]],
            'K00845': [[0, 2]],
            'K25026': [[0, 2]],
            'K00886': [[0, 2]],
            'K08074': [[0, 2]],
            'K00918': [[0, 2], [3, 4]],
            'K01810': [[2, 3]],
            'K06859': [[2, 3]],
            'K13810': [[2, 3]],
            'K15916': [[2, 3]],
            'K00850': [[3, 4]],
            'K16370': [[3, 4]],
            'K21071': [[3, 4]],
            'K01623': [[4, 5]],
            'K01624': [[4, 5]],
            'K11645': [[4, 5]],
            'K16305': [[4, 5]],
            'K16306': [[4, 5]],
            'K01803': [[5, 6]],
            'K00134': [[6, 8]],
            'K00150': [[6, 8]],
            'K00927': [[8, 7]],
            'K11389': [[6, 7]],
            'K01834': [[7, 9]],
            'K15633': [[7, 9]],
            'K15634': [[7, 9]],
            'K15635': [[7, 9]],
            'K01689': [[9, 10]],
            'K27394': [[9, 10]],
            'K00873': [[10, 1]],
            'K12406': [[10, 1]]
        },
        'optional_kos': set()
    },
    'M00002': {
        'name': 'Glycolysis, core module involving three-carbon compounds',
        'definition': (
            'K01803 ((K00134,K00150) K00927,K11389) '
            '(K01834,K15633,K15634,K15635) '
            '(K01689,K27394) '
            '(K00873,K12406)'
        ),
        'classes': 'Pathway modules; Carbohydrate metabolism; Central carbohydrate metabolism',
        'graph': <networkx.classes.multidigraph.MultiDiGraph object at 0x10d51b160>,
        'ko_to_nodes': {
            'K01803': [[0, 2]],
            'K00134': [[2, 4]],
            'K00150': [[2, 4]],
            'K00927': [[4, 3]],
            'K11389': [[2, 3]],
            'K01834': [[3, 5]],
            'K15633': [[3, 5]],
            'K15634': [[3, 5]],
            'K15635': [[3, 5]],
            'K01689': [[5, 6]],
            'K27394': [[5, 6]],
            'K00873': [[6, 1]],
            'K12406': [[6, 1]]
        },
        'optional_kos': set()
    },
    ...
}

For documentation for pathway theory or how MultiDiGraph objects are generated, please refer to the source repository for KEGG Pathway Completeness Tool as KEGG Pathway Profiler is a reimplementation for production.

Contact:

jol.espinoz@gmail.com

Disclaimer:

This software was developed at NewAtlantis Labs which is now acquired by Ocean BioMetrics.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
bin		bin
images		images
leviathan		leviathan
schematic		schematic
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
WALKTHROUGH.md		WALKTHROUGH.md
requirements.txt		requirements.txt
requirements_cli.txt		requirements_cli.txt
setup.py		setup.py

License

jolespin/leviathan

Folders and files

Latest commit

History

Repository files navigation

Leviathan

License Notice

Install

Modules

Citation

Usage:

End-to-End Walkthrough

Benchmarking

Benchmarking against 10, 100, 1000, and 10000 genomes

Benchmarking against CAMI-I and CAMI-II using 16 threads

Computational Performance

Accuracy Performance

Modules

leviathan-preprocess

leviathan-index

leviathan-info

leviathan-profile-taxonomy

leviathan-profile-pathway

leviathan-merge

Utility Scripts

Output Description

Sample Specific

Taxonomy profiles

Taxonomic abundances - Relative abundance of a genome/genome-cluster within a sample

Functional profiles

Feature abundances - The (normalized) abundance of a feature relative to a genome/genome-cluster

Feature prevalence - The number of genome/genome-clusters where a feature is detected

Gene abundances - The abundance of individual genes within genome

Pathway abundances - Pathway abundances for a genome and genome-cluster

Merged

Taxonomy profiles

Functional profiles

Feature

Pathway

Pathway Databases

Contact:

Disclaimer:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

`leviathan-preprocess`

`leviathan-index`

`leviathan-info`

`leviathan-profile-taxonomy`

`leviathan-profile-pathway`

`leviathan-merge`

Packages