That sounds interesting, but I'm a layman and don't know anything about it. Can ...

MurizS · 2024-12-01T22:46:30 1733093190

I think GP was probably referring to "Scaling Data-Constrained Language Models" (2305.16264) from NeurIPS 2023, which looked first at how to optimally scale LLMs when training data is limited. There is a short section on mixing code (Python) into the training data and the effect this has on performance on e.g. natural language tasks. One of their findings was that training data can be up to 50% code without actually degrading performance, and in some cases (benchmarks like bAbI and WebNLG) with improvements (probably because these tasks have an emphasis on what they call "long-range state tracking capabilities").

For reference: In the Llama 3 technical report (2407.21783), they mention that they ended up using 17% code tokens in their training data.

eru · 2024-12-02T01:33:47 1733103227

Is the network only trained on the source code, or does it have access to the results of running the code, too?

YetAnotherNick · 2024-12-03T00:45:16 1733186716

Also GPT-3.5 was another extreme if I remember correctly. They first trained only on code then they trained on other text. I can't seem to find the source though.

moffkalast · 2024-12-02T13:02:14 1733144534

There was an interview with Zuckerberg about how they initially split training llama chat models on purely normal text and codellama on code, but later realized that if they combine the training set they get a model that is better at both tasks than each specialized one was.