You cannot make an efficient fluid mechanics simulation on a 4000x4000x4000 grid if you set up a separate process for each individual gridcell. More efficient to just store your numbers in 3d arrays.
If you store the data in arrays, you can use matrix multiplication libraries such as Intel's MKL or OpenBLAS, which are written to be exceptionally optimized for use on multiple cores. I cannot emphasize enough how much time and effort has been put into these libraries to multiply matrices as fast as can possibly be done.
If you use processes such as in the Erlang VM, they're doing calculations, sure, but they're also sending messages back and forth, and they're acting as supervisors, and they're being shuffled around by the VM. There's a lot going on. And that extra stuff that's going on takes away from the time you could be multiplying stuff. And even then, there's been no optimization done for this sort of calculation. There are a lot of tricks you can do. Heck, the better matrix multiplication libraries have individual optimizations for CPUs.