It makes the hardware implementation more complicated. The delay slot was perfect for the 5-pipeline original design. Once you try to push this to out of order execution (executing more than one instruction per cycle), the delay slot just doesn't make any sense.
That's my understanding as well. Software-wise, I, for one, have not had issues with reading or writing code with branch delay slots -- automatic nops, at worst. I guess it all depends how early in one's development they were introduce to the concept of delay slots.
There was one nifty thing that fell out from having delay slots - you could write a threading library without having to burn a register on the ABI. When you changed context to a different thread, you'd load in all the registers for the new thread except for one which held the jump address to the new thread's IP. Then, in that jump's delay slot, you load in the thread's value for that register and, presto, zero overhead threading!
in addition to the complexities they add to every layer of the stack that ajross and alain94040 brought up, they're not all that useful in practice. i seem to recall that they'd rarely be over 50% utilized and the majority of instructions in the delay slot were nops