MMLSpark is Microsoft’s open source initiative for advancing distributed computing in Apache Spark. MMLSpark provides deep-learning, intelligent microservices, model deployment, model interpretability, image-processing, and many other tools to help you build scalable and production quality machine learning applications. MMLSpark is usable from Python, Scala, Java, and R and can be used on any Spark cluster. For more information and setup instructions, see our github page:
Having evaluated Spark extensively for my company’s ML use cases, I came away deeply disappointed.
One thing that bugs me in particular is that it essentially presumes all workflows involve huge, distributed datasets. But most model development work, especially for projects that will eventually be trained for production using huge, distributed data sets, must begin their life cycle as small data prototypes with extremely low-overhead development cycles. I’ve never found a way to organize Spark so that it can handle both tiny WIP prototyping and also production ML workloads. Inevitably you end up with two separated tool stacks and lots of kludgy processes to migrate from a successful prototype over to a Spark implementation, because the repeated overhead of Spark in the small prototype phase is far too costly and limiting.
Other gripes include “bring your own container” solutions for controling the surrounding runtime details of different models on a project-by-project basis, and for Python at least, how to entirely bypass py4j and eliminate all JVM overhead, especially when working in cases that heavily rely on custom Python extension modules and extension types.
To boot, for use cases when you have to write your own likelihood functions for models that don’t already exist in Spark libraries, and potentially you need to get down to the level of tuning the actual optimizer you’ll use to train, Spark is extremely opaque and full of incomprehensible errors and debugging issues. Not to mention that ideally you’d like to write those routines once and be able to use them across any development environments (whether executing via Spark or not), it leads to incentives towards “vendor lock in” effects where you don’t even try things at all unless they are out of the box in Spark.
All this is to say, I wish that initiatives moving towards the very generous goal of open-sourcing ML tools to the community would view the idea of portability across different runtime environment / cluster computing / backend storage models to be a first-class requirement.
Hey, thanks for your interesting point of view on Spark. I do a lot of my small data prototyping in Spark in single machine mode and it has worked for me thus far :).
For the likelihood functions comment, I would totally agree. Autograd libraries are easier to build custom likelihood models in, which is why we created CNTK on Spark, and databricks created Tensorflow on Spark. These give you the flexibility of modern deep learning stacks with the elasticity of spark
But in the end Spark is a single tool in a collection of tools and might not be right for your project, but it's been good for a lot of our work here at MSFT :)!
On the point about autograd tools, it’s unfortunately not always helpful for types of models that are not amenable to that type of framework, like gradient-free methods, custom Bayesian inference models, or modified versions of some traditional models (like bias-corrected logistic regression).
Here again, if you tie machine learning to a big system like Spark, which is typically a huge IT cost in a lot of companies, and if they commit to having an underlying data model suitable for Spark, it necessitates orienting everything around Spark (someone with the scale of Microsoft might not suffer this problem like everybody else)... all together it just renders Spark to be usually such a limiting choice as to make it totally impractical to standardize on it and give up all the other types of models or data storage and cluster computing techniques you might need on a project-by-project basis.
I definitely would agree that you should pick the best tool for the job and not limit yourself to one ecosystem if it's too difficult.
One way to make Spark a bit easier to work with is through kubernetes or a tool like databricks that provides it as a service. Kubernetes, in particular, provides you a really nice amount of flexibility and composability when designing systems. One thing that we created to try to fill the gap of having to integrate System X with Spark was HTTP on Spark. This makes it easy to integrate Spark with other tools in a microservice architecture. When you couple this with containers, you can do a lot very quickly.
For datatypes I would look into different Spark connectors, these days there one for almost every database/streaming service/ cloud store under the sun.
This being said, Spark is a large piece of software that uses many different programming concepts which can be daunting. Our goal is to try to listen to feedback like this so we can try to make the Spark ecosystem a bit easier to use for everyone.
This is actually the part I most disagree with. The overhead of connectors to Spark, particularly any use of py4j, is far too limiting except in cases when the data workload is so large that it effectively amortizes the overhead. For small scale prototypes, it’s a disaster, and then separate there are concerns for data type marshaling through the JVM when you may have a Python-only data model.
At the time of evaluation for me, I also found Databricks had extremely limited support for runtime environments defined by arbitrary containers. You have to select cluster nodes according to their prescribed images and choices.
Say you need Tensorflow or CUDA compiled with a weird set of optimization flags, or you need other special provisions in the runtime environment. In fact, variations of the runtime environment may even be part of some reproducible experiments, so you need to execute across a variety of parameters that govern how the runtime is built.
Anything that can’t support this type of stuff out of the box is just not worth it. Anybody can hook a notebook environment up to analyze data from some data warehouse or distributed file system.
The hard part is always how to make that setup configurable and parameterizable across the needs of different projects, especially arbitrary runtime environments.
Spark is intended to work on big datasets. It's machine learning capability is very limited and it's primary strength is processing huge amounts of data. I think it's unfair to blame it for 'failing' on small datasets
I agree that Spark may be a fine choice for ETL and generic pipeline tasks.
But lots of companies will choose it as a data warehouse computation layer and then enforce a policy to standardize everything around it, including tasks like machine learning that are poorly suited for Spark.
Worse, companies like Databricks will encourage this standardization and act like yes-man consultants, promising Spark ML offerings can solve all the problems, and you quickly end up with some brittle monster of a data warehouse system that is oriented to be convenient for Spark (which can’t effectively be used to solve the problems) and everything is deeply inconvenient to pipe to non-Spark systems, and nobody is sympathetic to any budgetary needs for other systems, since they spent it all on Spark.
I am happy to see support for spark from Microsoft and such larger organizations. Intel also added support to run spark over GPU however it requires the code to be modified. Does Microsoft require us to modify our existing spark code?
We have added support for integrating anything that communicates through HTTP into spark, so if you put it behind a service we can let you use it in spark. Also the mobius library, does a similar thing:
https://github.com/Azure/mmlspark
Thank you!