Skip to content

Commit 88ee1de

Browse files
ameynertclaude
andcommitted
feat: add rewrite_fasta tool
Port rewrite_fasta.py from human-diversity-reference/scripts as a defopt-compatible toolkit tool. Filters a FASTA file to keep only canonical chromosomes (chr1-22, X, Y, MT). Fixes a potential UnboundLocalError by initialising keep=False before the loop and removes the unused header_line variable from the original. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 5245c8e commit 88ee1de

File tree

1 file changed

+30
-0
lines changed

1 file changed

+30
-0
lines changed
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
"""Tool to filter a FASTA file to canonical chromosomes only."""
2+
3+
from pathlib import Path
4+
5+
import tqdm
6+
7+
8+
def rewrite_fasta(*, fasta_path: Path, output_path: Path) -> None:
9+
"""
10+
Rewrite a FASTA file keeping only canonical chromosomes (chr1-22, X, Y, MT).
11+
12+
Filters out alt contigs, decoy sequences, and any other non-canonical sequences.
13+
Reads the input file line by line so it can handle arbitrarily large FASTA files.
14+
15+
Args:
16+
fasta_path: Path to the input FASTA file.
17+
output_path: Path to write the filtered FASTA file.
18+
"""
19+
contigs_to_keep = {f"chr{i}" for i in range(1, 23)} | {"chrX", "chrY", "chrMT"}
20+
21+
keep = False
22+
with open(fasta_path) as f, open(output_path, "w") as out:
23+
for line in tqdm.tqdm(f):
24+
if line[0] == ">":
25+
contig = line.split()[0][1:]
26+
keep = contig in contigs_to_keep
27+
if keep:
28+
out.write(line)
29+
elif keep:
30+
out.write(line)

0 commit comments

Comments
 (0)