You can pass context in as another input, so encoder->decoder cross attention mo...

You can pass context in as another input, so encoder->decoder cross attention models. People are working on data storage and retrieval as well with transformers (so you would query out to load in different attention vectors).

So you can just view that as a memory bank with paging. You do however need to teach the network the task you want to solve. So you'd have to train it on the math part.