It would be an interesting place to add additional units to an already trained m...

It would be an interesting place to add additional units to an already trained model, to continue training and get better performance, or to fine tuning.

For tuning, keep the original model parameters fixed, and only let the model adjust parameters to and from new "tuning" cache units.

This would allow different tuning unit sets to be swapped in, or even used together. Foul language avoidance units + specific terminology units + be concise units, etc.

Mix and match tuned unit sets, like super prompts.

If the number of new parameters is low enough, higher order optimization (requiring higher memory) might be a possibility for very fast and effective tuning.

And maybe grow the sequence length, and number of units, during training. A few units for short sequences. Then increase training sequence length, add more units, continue training, and so on.

Perhaps some kind of performance or gradient analysis could govern cache expansion, so an arbitrary schedule is not required.