Skip to content

Clarification on DNABERT Pre-training Data Format #129

@vkaz39

Description

@vkaz39

Hello,

I have a few questions regarding your statement:

Please see the template data at /example/sample_data/pre. If you are trying to pre-train DNABERT with your own data, please process your data into the same format as it.

Could you clarify the exact steps you followed to create this pre-training data format? Specifically:

Did you concatenate multi-line FASTA sequences into a single continuous sequence per chromosome, then remove the FASTA headers (sequence IDs)?

After that, did you split each chromosome sequence into 512 bp blocks, with each block written as a separate line in the file? (e.g., block 1 of chromosome 1 = line 1, block 2 of chromosome 1 = line 2, … block 1 of chromosome 4 = line 10, block 2 of chromosome 4 = line 11, etc.)

Was each 512 bp block then k-merized (e.g., with k=6) before being used for training?

Additionally, I noticed your example data does not contain k-mers with “N”. Were those filtered out, or were they simply not present in the input sequences?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions