After some profiling, I found that step4 & step5 are actually the most time-consuming. Could you shed some lights on how to parallelize these parts?