Scoring:
Essentially raw logs go to kafka. Java processes then read raw logs and aggregate the per-user state in real-time. State is stored in Redis. When a user's state is updated it is sent into another kafka topic that is the "features" topic. Features are simultaneously saved to HDFS for future training data and then consumed by a Storm cluster for doing prediction. Storm is running pickled scikit-learn models in bolts that read in the features and output scores. Scores are sent into a "score" kafka topic. Downstream systems reading the scores can read from this kafka topic.
Time from receiving a log to producing a score is ~x seconds or so.
Training:
Training data is stored into HDFS from the real-time system, so our training and production data is identical. We have IPython notebooks that pull in the training data, build a model and sckikit-learn feature transformation pipeline. This is pickled and saved to HDFS for versioning. The storm cluster when it starts a topology loads the appropriate model from HDFS for classifying data.
> How big a machine do you need to run notebooks effectively ?
- not too big. I've run tests with performance vs. sampling of data and for the models we use we can sample a significant amount (i.e. use a fraction of the full training data).
- If we did need a bigger machine, we're using ensemble models that parallel train nicely.
> Why a notebook...Why not a Python script
I was originally going to use a Python script, but I found it useful to have the notebook output inline performance charts and metrics. It's easier to contain them in the notebook than output a bunch of image artifacts that have to be added to VC. This way I can pull open the notebook and scroll through to check all my visual metrics.
I'm not opposed to ditching the notebook for training entirely, but for now it works just fine.
so you build a notebook and play around with it... and then run this notebook in an automated way ? so u can open the notebook anytime and work with it ?
I really love this quick iterative way of working (atleast in the early days). Could you talk about your production setup of training ? I'm just concerned about performance, etc - is it OK to train manually each time (by opening the notebook, etc)
So far our models remain fairly stable in the ~weeks timeframe. If we needed to train daily or similar I would invest time in something other than a notebook. But, right now, it's easier to have the training steps documented in the notebook that does the actual training than to build a separate system and document it.
Not claiming it's a good way of doing this, but just how it is right now.
Scoring: Essentially raw logs go to kafka. Java processes then read raw logs and aggregate the per-user state in real-time. State is stored in Redis. When a user's state is updated it is sent into another kafka topic that is the "features" topic. Features are simultaneously saved to HDFS for future training data and then consumed by a Storm cluster for doing prediction. Storm is running pickled scikit-learn models in bolts that read in the features and output scores. Scores are sent into a "score" kafka topic. Downstream systems reading the scores can read from this kafka topic.
Time from receiving a log to producing a score is ~x seconds or so.
Training:
Training data is stored into HDFS from the real-time system, so our training and production data is identical. We have IPython notebooks that pull in the training data, build a model and sckikit-learn feature transformation pipeline. This is pickled and saved to HDFS for versioning. The storm cluster when it starts a topology loads the appropriate model from HDFS for classifying data.