Practicality. Generally your stdlib is part of your cross compiler install. So now you'd need 2 versions of your stdlib--the small version and the speedy version. So now you aren't using -lc (which is usually built-in to the linker) foro your boot code but you are for your regular firmware. Now your makefile is more complex. And you are running a modified std library so it's a pain to upgrade. Blah blah blah the list goes on.
Of course you could do all that, but it's so much easier to just override the 5 functions you actually use in your code. Because of the way linkers work and the way standard libraries are designed, if you define your own version of a library function then the linker won't pull in the library version--effectively choosing your local version over the one in the library. That keeps the standard library clean of patches and keeps the small, tight code near the project that actually needs it.
Compilers have long ago memorized the code for memset (and memcopy and ...) and substitute optimized algorithms. As part of faking their benchmark stats for marketing purposes. So no worries, this code should result in kick-butt optimized assembler. Except for the obvious bug.
Optimized for what? Speed? While I can picture that happening, it has been my experience that this is not the case. The memory squishing optimizations I'm talking about require dumping the assembly output and I've never seen my stupid memcpy() routine suddenly turn into optimized-for-speed code. I'd be pretty upset if that were to happen...
why can't you have a '#ifdef PREFER_SIZE_OVER_SPEED' or something similar ?