I haven't tried myself yet, but wanted to check whether you'd tried parallelising some of the code, to speed it up on multi-processor machines.
I'm already using command.run_in_memory = true, which gives a big improvement in performance - 60% saving on time compiling decc_model