Calling any caching or interpolating function that uses numpy.linalg.solve() if render_engine is set to MulticoreEngine() may cause severe drops in performance if numpy is built with OpenBLAS and OPENBLAS_NUM_THREADS environment variable is not set to 1. This happens because solve() also runs in parallel, and when it's called, we get N parallel processes, each of which starts N parallel threads.
The solution is to set OPENBLAS_NUM_THREADS = 1 or OMP_NUM_THREADS = 1 before running a simulation. I think this worth mentioning in the documentation, because the user not familiar with the code might not figure it out.
Affected demos: beam_into_slab.py, beam.py, plasma-and-beam.py