When using sklearn, I've seen a lot of folks just pickle the model and use that ...

angusb · on March 8, 2017

Yep, we made our own. I haven't heard of PMML before - quite cool! What we've made is a bit more readable for what we're using it for though, IMO. Looks like this:

    {
        "intercept": 1.0,
    
        "features": {
            "feature_1": {
                "coefficient": 1.0,
                "range": [0.1, 10.0],
                "mean_feature_score": 1.0,
                "imputation_value": 1.0
            },
            {
                ....
            }
        }
    }

sandGorgon · on March 9, 2017

Is this open source? We were looking for something like this.

angusb · on March 9, 2017

Sadly not. I'd be totally up for open sourcing if there's clear demand. If you can find it, send me an email at angus@{company_I_work_at}.com

Note that it's very tied down to our use case right now: only compatible with Logistic Regression, and currently it assumes fixed hyperparameters (will change this in future though), assumes a production pipeline of min-max scaling, imputation, then classification.

cf · on March 8, 2017

PMML is fairly verbose and limited to a particular set of models. It's often easier to pickle the models and then keep tagged versions. I think a human readable format could be created, but since most models are just a pile of numbers it's unclear what is gained.

angusb · on March 8, 2017

For Logistic Regression we find human readable config makes a lot of sense. It's pretty intuitive if there aren't too many features - if the model starts behaving weirdly, we can sometimes track it down to a change in a single feature using this (especially when viewing recent git diffs).

cf · on March 8, 2017

Sure. I tend to keep my postprocessing of a model under version control. In particular, what features were most helpful for predictions.

angusb · on March 9, 2017

Can't really talk about features on here :(