Having only hundreds of training examples and fitting hundreds of models seems l...

antognini · on March 10, 2017

Oh, that's definitely something we worry about a lot! One thing to note is that each record contains a lot of data in it, so it's not as though there's just 100 data points we have to go on. A 24 hour record has ~20 million data points, it's just that they are highly correlated with each other.

Cross validation helps to some extent. If you're fitting enough models, though, you can still trick yourself into thinking that some models are not overfitting, when they just happen to do well on your validation set by chance. But one of the things I look for is a smooth transition from underfitting to overfitting as the model capacity increases.

Then the other thing we do is probe the models so that we can understand what they're doing as much we can. I like to generate synthetic data and look at the transfer function of the model as I vary a particular parameter in my synthetic dataset, for example.