I’ve observed the same phenomenon with fine-tuning LLMs and I thought it was pre...

I’ve observed the same phenomenon with fine-tuning LLMs and I thought it was pretty strange but so far as I could tell other people were observing the same thing but mostly not commenting on it. The conclusion I’d draw is that you’re not going to benefit greatly from adding more data when your model behaves like this.

Overconfidence bugs moe because if you want to turn predictions into decisions and actions you have to be calibrated. I’ve found that some of these models that look like they are over fitting on loss are actually still improving on AUc (matters to me more than accuracy) and I can put a calibrator after the model to get the results I want.

(Still, for my current problem which has noisy labels, I find embedding + classical ML performs as well and takes a fraction of the time as fine tuning and clearly shows benefit trained on more examples than FT does. If I was going to do more model engineering on this problem I would probably resort to “stacking”)