Doing some design for an upcoming project and taking a survey.
I'll go first. Model training happened nightly on a Spark cluster. This output a PMML-based SVM model. The model was instantiated on a cluster of compute servers running Openscoring. A thin Node web service wrapper used the Openscoring cluster to serve realtime client prediction requests. Dataset size in the hundred millions of examples with hundreds of features. Handled thousands of requests per second, no problem.
Separating the training technology from the execution technology was nice but the PMML format is limiting in the kinds of models you can use that both you trainer and executor will support. What are people doing who use same tech for both? For something like Tensorflow, I assume you must have to save the model as binary from the train step and then send it off to the prediction cluster to be instantiated again for execution?
For deep learning oriented projects, we train on EC2 GPU instances, generally p2.xlarge instances these days to get the Nvidia K80 GPUs. We can spin up many of these in parallel if we are doing model architecture searches or hyperparameter exploration.
We have an in-house data turking setup where we can efficiently roll new UIs to get ground-truth data for given problems, generally in the thousands to tens of thousands of real data. We also use data augmentation where possible, synthetically generating millions of example data points, combining it with the real turked data for fine tuning. Note that we never look at or train with real user data, unless explicitly donated, so data efficiency is important to us.
We've standardized on TensorFlow these days, doing inference on CPUs currently on Dropbox's compute infrastructure. We have a jail setup based off of LXC and Provost that allows us to safely execute these trained models in a contained way, responding to user requests as they come in. We use standard distributed systems plumbing for RPCs, queues, etc. to respond to user requests and route them to trained models. We version our trained models, and have an in-house experiments framework that we use to deploy new models and test them against subsets of user traffic to see how they are doing as we ramp up a new model.
Most of our day-to-day work is in Python, with occasional use of C++; other parts of Dropbox sometimes use Go and Rust, though we haven't had need for that on the ML team. Note that Dropbox is one of the largest users of Python in the world (Guido van Rossum actually works here).
BTW, the Machine Learning team at Dropbox is hiring. Come join us! Details: https://www.dropbox.com/jobs/listing/533100