The intermediate similarities file can be insanely huge when a minimum similarity threshold is not set. Rather than produce this whole file in one stage, then reduce it down in the next, the whole process could be combined into one. This would alleviate space usage considerably, although it would complicate the process somewhat.