sTarML prepares RefSeq genome + GTF for a given NCBI Taxonomy ID, builds start-codon–centered windows (covering 200 nt upstream and 100 nt downstream of the ATG), and predicts sRNA targets using bundled k-mer vectorizers/scalers and a trained model.
# Install the package
conda install -c vkotsira starml
conda env create -f environment.yml
conda activate starmlComma-separated sequences
starml --taxid 511145 --srna "ACGTT,GGGAT" --outdir results --cores 4FASTA of sRNAs
starml --taxid 511145 --srna-fasta my_sRNAs.fa --outdir results --cores 8Specific RefSeq accession
starml --taxid 511145 --refseq GCF_025643435.1 --srna "ACGTT" --outdir resultsRun
starml --helpfor all options.
- Initialize & logging →
logs/run.log - Download genome + GTF (RefSeq) via
ncbi-datasets-cli - Filter GTF to protein-coding mRNAs
- Create start-codon–window BED (200 nt upstream, 100 nt downstream of ATG) and (if bedtools) extract FASTA
- Normalize sRNAs (A/C/G/T/N;
U→T) and save FASTA - Vectorize/scale, run bundled model →
predictions.tsv - Finish
startcodon_windows.bed (and startcodon_windows.fasta if bedtools present)srna/srnas.fastapredictions.tsvlogs/run.log
--srnaand--srna-fastaare mutually exclusive.- If it stops after Step 5, you likely passed a filename to
--srna; use--srna-fasta file.fa. - “No assemblies found” → check
--taxidor specify a valid--refseq GCF_....
A paper is forthcoming. Until then, please cite the software:
Vasiliki Kotsira et al.. sTarML: A Sequence-Based Machine Learning Framework for Predicting Bacterial sRNA-mRNA Interactions. Version 0.1.0.
@software{starml_software,
author = {Vasiliki Kotsira},
title = {sTarML: A Sequence-Based Machine Learning Framework for Predicting Bacterial sRNA-mRNA Interactions},
year = {2025},
version = {0.1.0}
}We’ll update this section (and CITATION.cff) with the final paper citation and DOI when available.
- Code: MIT (see
LICENSE) - Model files (
starml/models/*): CC BY 4.0 (seeLICENSE-models)