Wait for the first large scale LLM using source-aware training: https://github.c...

Wait for the first large scale LLM using source-aware training:

https://github.com/mukhal/intrinsic-source-citation

This is not something that can be LoRa finetuned after the pretraining step.

What we need is a human curated benchmark for different types of source-aware training, to allow competition, and an extra column in the most popular leaderboards, including it in the Average column, to incentivice AI companies to train in a source aware way, of course this will instantly invalidate the black-box-veil LLM companies love to hide behind so as not to credit original authors and content creators, they prefer regulators to believe such a thing can not be done.

In meantime such regulators are not thinking creatively and are clearly just looking for ways to tax AI companies, and in turn hiding behind copyright complications as an excuse to tax the flow of money wherever they smell it.

Source aware training also has the potential to decentralize search!