Think the SSE2 implementation could be tightened up by using the same register for the dividend and the quotient, shifting the quotient bits into the dividend as the dividend bits are shifted out. This was common practice in software CPU division routines.