> How wasteful is it to have a bimodal decoder that assumes 8 4-byte instructions or 8 2-byte instructions, with bail-out circuitry to dispatch fewer instructions and advance to the next instruction-size boundary in the case of a stream of mixed widths?
I suspect mixed-width streams are extremely common (at least going by compiled RISC-V code I've seen). Potentially you can alter compilers to change this but the compressed instruction set is very limited, it naturally mixes compressed and uncompressed and forcing otherwise will likely produce poorer code overall.
I'd also note that branch prediction accuracy is a key part of modern processor performance, they're maybe around 95-99% accurate most of the time. Your scheme looks to have similar mispredict penalties to branch prediction (you have to chuck things away and start over, though at least you don't have to refetch) so will likely require similar level of accuracy to work well. Plus on any mixed stream performance will be very poor as you've explicitly not built a decoder for mixed streams, so you end up dripping things through one at a time or you allow limited decoding on mixed streams (maybe 2 instructions per cycle).
Edit: Perhaps the mis-predict penalty isn't so bad compared to a mis-predicted branch as detecting the mis-predict is trivial. Indeed having just realised this you may be better off without the predictor at all, just look at the bytes and decide which you should do.
Though depending on the decoders you may end up needing two entirely separate sets of logic for the compressed instruction decoder vs uncompressed instruction decoder anyway, in which case you just decode your stream in both ways then decide which set of decodings you want to use. You still have the issue with any mixed width stream killing your performance though which I suspect would be a major issue with this design (so much so it's not useful).
I suspect mixed-width streams are extremely common (at least going by compiled RISC-V code I've seen). Potentially you can alter compilers to change this but the compressed instruction set is very limited, it naturally mixes compressed and uncompressed and forcing otherwise will likely produce poorer code overall.
I'd also note that branch prediction accuracy is a key part of modern processor performance, they're maybe around 95-99% accurate most of the time. Your scheme looks to have similar mispredict penalties to branch prediction (you have to chuck things away and start over, though at least you don't have to refetch) so will likely require similar level of accuracy to work well. Plus on any mixed stream performance will be very poor as you've explicitly not built a decoder for mixed streams, so you end up dripping things through one at a time or you allow limited decoding on mixed streams (maybe 2 instructions per cycle).
Edit: Perhaps the mis-predict penalty isn't so bad compared to a mis-predicted branch as detecting the mis-predict is trivial. Indeed having just realised this you may be better off without the predictor at all, just look at the bytes and decide which you should do.
Though depending on the decoders you may end up needing two entirely separate sets of logic for the compressed instruction decoder vs uncompressed instruction decoder anyway, in which case you just decode your stream in both ways then decide which set of decodings you want to use. You still have the issue with any mixed width stream killing your performance though which I suspect would be a major issue with this design (so much so it's not useful).