@chris-ha458 has made some great improvements to BFF in the https://github.com/allenai/dolma repo. We should back-port those changes here, especially the ones that have to do with correctness (like the ones involving the choice of hash functions).
Chris' PRs are here:
They won't apply 1:1, because things have changed in the Dolma repo, but at least the important things should carry over.