-
Notifications
You must be signed in to change notification settings - Fork 177
Description
Hello,
I have a few questions regarding your statement:
Please see the template data at /example/sample_data/pre. If you are trying to pre-train DNABERT with your own data, please process your data into the same format as it.
Could you clarify the exact steps you followed to create this pre-training data format? Specifically:
Did you concatenate multi-line FASTA sequences into a single continuous sequence per chromosome, then remove the FASTA headers (sequence IDs)?
After that, did you split each chromosome sequence into 512 bp blocks, with each block written as a separate line in the file? (e.g., block 1 of chromosome 1 = line 1, block 2 of chromosome 1 = line 2, … block 1 of chromosome 4 = line 10, block 2 of chromosome 4 = line 11, etc.)
Was each 512 bp block then k-merized (e.g., with k=6) before being used for training?
Additionally, I noticed your example data does not contain k-mers with “N”. Were those filtered out, or were they simply not present in the input sequences?
Thanks