Funnily enough, Amiga and ST blitters did a similar thing a few years back. Notably they were not much faster than the CPU (the 68030 was faster) but the main advantage was the bit shifting as well as the byte transfer functions in parallel with the CPU, that gave them an edge
Huh, so DMA controllers do a lot more than I thought they did. A question though: don't we need a "jump if zero" or similar in order to be Turing complete? I see a loop instruction but nothing that could be considered an "if"... https://developer.arm.com/documentation/ddi0424/d/instructio...
but it turns out to be really useful to allow remote devices to run limited
code without interrupting the host. distributed reduction is the easiest application to think of.
Ah, you want a Propeller [1]. Basically 64 really smart digital + A/D I/O pins driven by eight 32-bit I/O processors. Enough oomph to build an entire system out of. Quirky, loads of fun if you're into cycle-counting. Not cheap, though--$18 [edit: $13 Digikey] qty 1. :(
https://github.com/jowinter/dmacu