Adding attention cache memory is an extremely interesting solution to this problem.
If anyone is curious, there was another paper [0] that came out a few days ago that made a related observation in Vision Transformers. Transformer models appear to pick tokens to store global information in - they need tokens to "think". You can eek some performance improvements (and cool explanation images) by providing the model with specific tokens for this purpose.
It would be an interesting place to add additional units to an already trained model, to continue training and get better performance, or to fine tuning.
For tuning, keep the original model parameters fixed, and only let the model adjust parameters to and from new "tuning" cache units.
This would allow different tuning unit sets to be swapped in, or even used together. Foul language avoidance units + specific terminology units + be concise units, etc.
Mix and match tuned unit sets, like super prompts.
--
If the number of new parameters is low enough, higher order optimization (requiring higher memory) might be a possibility for very fast and effective tuning.
--
And maybe grow the sequence length, and number of units, during training. A few units for short sequences. Then increase training sequence length, add more units, continue training, and so on.
Perhaps some kind of performance or gradient analysis could govern cache expansion, so an arbitrary schedule is not required.
If anyone is curious, there was another paper [0] that came out a few days ago that made a related observation in Vision Transformers. Transformer models appear to pick tokens to store global information in - they need tokens to "think". You can eek some performance improvements (and cool explanation images) by providing the model with specific tokens for this purpose.
[0] https://arxiv.org/pdf/2309.16588.pdf