Nice work!
I wonder what the format of those raw data (wiki and bc) is. Is it that one sentence per line, and an empty line between different articles?
That would be great if you can share those two raw files you mentioned in ./preprocess/pretrain/process.sh.