I've spent some time recently with both dask and distributed. Continuum Analytics has a real gem with Matthew Rocklin! I've found the libraries very intuitive.
1/3 time is spent in the p_reduce step, and another 1/3 in elemwise. Not exactly sure what those do, but I'm guessing it's related to the reduce-map-reduce steps of evaluating the standard deviation and then dividing the elements by this value. The mean has to be calculated twice in the formula of the z score. It sounds like the client-worker communication mechanism might have extra latency.
I wonder if this would work if the dask arrays are not equal in length, for example if the files were time series of unequal duration.
Also, are there any plans for dask to support distributed numpy functions requiring kernel computation at the array boundaries? For example scipy.signal.lfilt? I believe it would require ghosting or further inter-dask-array communication that is not yet present.
There's no JVM overhead like for Spark computation. The dask array methods use the numpy C-API, which are implemented in C and run on the physical machine.
I think you might have misunderstood my comment; I was referring to the bullet point under "What didn't work":
> Reduction speed: The computation of normalized temperature, z, took a surprisingly long time. I’d like to look into what is holding up that computation.
It was rejected as being unpythonic[1] , even though the base functionality to save a particular data frame is already present.
Can what dask is doing, be adapted to a simple case scenario of saving a workspace snapshot?
[1] https://github.com/pydata/pandas/issues/12381#issuecomment-1...