Distributed NumPy Arrays

sandGorgon · on Feb 27, 2016

I recently filed a bug on pandas requesting for a save workspace feature (like RData & save image feature)

It was rejected as being unpythonic[1] , even though the base functionality to save a particular data frame is already present.

Can what dask is doing, be adapted to a simple case scenario of saving a workspace snapshot?

[1] https://github.com/pydata/pandas/issues/12381#issuecomment-1...

tomrod · on Feb 27, 2016

I've spent some time recently with both dask and distributed. Continuum Analytics has a real gem with Matthew Rocklin! I've found the libraries very intuitive.

math_and_stuff · on Feb 27, 2016

This is great! Any guesses as to what is leading to the large reduction times?

hcrisp · on Feb 27, 2016

1/3 time is spent in the p_reduce step, and another 1/3 in elemwise. Not exactly sure what those do, but I'm guessing it's related to the reduce-map-reduce steps of evaluating the standard deviation and then dividing the elements by this value. The mean has to be calculated twice in the formula of the z score. It sounds like the client-worker communication mechanism might have extra latency.

I wonder if this would work if the dask arrays are not equal in length, for example if the files were time series of unequal duration.

Also, are there any plans for dask to support distributed numpy functions requiring kernel computation at the array boundaries? For example scipy.signal.lfilt? I believe it would require ghosting or further inter-dask-array communication that is not yet present.

mrocklin · on Feb 27, 2016

See http://dask.pydata.org/en/latest/ghost.html

canavandl · on Feb 27, 2016

There's no JVM overhead like for Spark computation. The dask array methods use the numpy C-API, which are implemented in C and run on the physical machine.

math_and_stuff · on Feb 27, 2016

I think you might have misunderstood my comment; I was referring to the bullet point under "What didn't work":

> Reduction speed: The computation of normalized temperature, z, took a surprisingly long time. I’d like to look into what is holding up that computation.