MIPS is a fun architecture (other than the delay slots that plagued early RISC ISAs) and implementing a subset of it on an FPGA is still a pretty common undergraduate university course project. I was kind of amazed just how simple it is to get a basic CPU working, though even the 1988 version was quite a lot more sophisticated than the class project version (multiple cache levels, having an MMU, probably much better branch predictor, etc).
It makes the hardware implementation more complicated. The delay slot was perfect for the 5-pipeline original design. Once you try to push this to out of order execution (executing more than one instruction per cycle), the delay slot just doesn't make any sense.
That's my understanding as well. Software-wise, I, for one, have not had issues with reading or writing code with branch delay slots -- automatic nops, at worst. I guess it all depends how early in one's development they were introduce to the concept of delay slots.
There was one nifty thing that fell out from having delay slots - you could write a threading library without having to burn a register on the ABI. When you changed context to a different thread, you'd load in all the registers for the new thread except for one which held the jump address to the new thread's IP. Then, in that jump's delay slot, you load in the thread's value for that register and, presto, zero overhead threading!
in addition to the complexities they add to every layer of the stack that ajross and alain94040 brought up, they're not all that useful in practice. i seem to recall that they'd rarely be over 50% utilized and the majority of instructions in the delay slot were nops