Clarification on DNABERT Pre-training Data Format

Hello,

I have a few questions regarding your statement:

```Please see the template data at /example/sample_data/pre.  If you are trying to pre-train DNABERT with your own data, please process your data into the same format as it. ```


Could you clarify the exact steps you followed to create this pre-training data format? Specifically:

Did you concatenate multi-line FASTA sequences into a single continuous sequence per chromosome, then remove the FASTA headers (sequence IDs)?

After that, did you split each chromosome sequence into 512 bp blocks, with each block written as a separate line in the file? (e.g., block 1 of chromosome 1 = line 1, block 2 of chromosome 1 = line 2, … block 1 of chromosome 4 = line 10, block 2 of chromosome 4 = line 11, etc.)

Was each 512 bp block then k-merized (e.g., with k=6) before being used for training?

Additionally, I noticed your example data does not contain k-mers with “N”. Were those filtered out, or were they simply not present in the input sequences?

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on DNABERT Pre-training Data Format #129

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Clarification on DNABERT Pre-training Data Format #129

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions