It's pretty common for runtime libraries to optimize low-level routines like memcpy, math, etc. with multiple different paths chosen on the basis of CPU capability bits. It's not the whole code that's twice the size; it's small functions which are implemented 2 or 3 or 5 times depending on what features are available.
Perhaps, but size doesn't really affect runtime performance that much, especially if most codepaths are never execute -- no processor cache churn because the unused paths are never executed.
I don't really know anything about this compiler, so I'm certainly speculating. My assumption is that one writes some function foo() and the compiler prepends a dispatcher in front which forks (code paths, not processes) to one of N optimized but functionally equivalent codepaths based on the actual processor upon which the code runs.
I suspect it patches a jumptable at initialisation time based on CPU type, and all the code used by one type of CPU is bunched close together. The unused code probably isn't even paged into physical RAM.