Based on this paper here: https://dl.acm.org/doi/10.1145/3061665, with its PDF here, the algorithm in the OpenBlas reference, and so the algorithm in SoftBlas is around 7x slower than it needs to be.
@sigilante suggests to write alternative algorithms *nrm2_B.c with the faster algorithm.