Didn't some of the DEC Alpha and IBM Power designs use a cached register-file for lack of a better terminology. Allowing them to support more architectural state than the wide multi-ported register file could fit? Would a small double/quad pumped AVX-512 implementation with a dual ported register file + caching/queuing the recently produced results at the functional units that produced them safe die size and still allow useful throughput?