Hacker News new | past | comments | ask | show | jobs | submit login

We wrote a blog post detailing a lot of how we do our streaming ML system at Distil (https://resources.distilnetworks.com/all-blog-posts/building...) We classify web traffic as humans or bots in realtime.

Scoring: Essentially raw logs go to kafka. Java processes then read raw logs and aggregate the per-user state in real-time. State is stored in Redis. When a user's state is updated it is sent into another kafka topic that is the "features" topic. Features are simultaneously saved to HDFS for future training data and then consumed by a Storm cluster for doing prediction. Storm is running pickled scikit-learn models in bolts that read in the features and output scores. Scores are sent into a "score" kafka topic. Downstream systems reading the scores can read from this kafka topic.

Time from receiving a log to producing a score is ~x seconds or so.

Training:

Training data is stored into HDFS from the real-time system, so our training and production data is identical. We have IPython notebooks that pull in the training data, build a model and sckikit-learn feature transformation pipeline. This is pickled and saved to HDFS for versioning. The storm cluster when it starts a topology loads the appropriate model from HDFS for classifying data.




I never thought that ipython notebooks could be used to build production models.

Can you talk about this? How big a machine do you need to run notebooks effectively ? Why a notebook...Why not a Python script,etc


> How big a machine do you need to run notebooks effectively ?

- not too big. I've run tests with performance vs. sampling of data and for the models we use we can sample a significant amount (i.e. use a fraction of the full training data). - If we did need a bigger machine, we're using ensemble models that parallel train nicely.

> Why a notebook...Why not a Python script

I was originally going to use a Python script, but I found it useful to have the notebook output inline performance charts and metrics. It's easier to contain them in the notebook than output a bunch of image artifacts that have to be added to VC. This way I can pull open the notebook and scroll through to check all my visual metrics.

I'm not opposed to ditching the notebook for training entirely, but for now it works just fine.


so you build a notebook and play around with it... and then run this notebook in an automated way ? so u can open the notebook anytime and work with it ?

I really love this quick iterative way of working (atleast in the early days). Could you talk about your production setup of training ? I'm just concerned about performance, etc - is it OK to train manually each time (by opening the notebook, etc)


So far our models remain fairly stable in the ~weeks timeframe. If we needed to train daily or similar I would invest time in something other than a notebook. But, right now, it's easier to have the training steps documented in the notebook that does the actual training than to build a separate system and document it.

Not claiming it's a good way of doing this, but just how it is right now.


Adding on to this, how do you guys version control your notebooks?


We still just check them in directly. Not great, but it works. You can at least have github render the notebook at any point.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: