With todays CPU, particularly with branch prediction, and cache, loop unrolling is not always better.
In fact I bet the performance would be worse, especially because you have to worry about setting areas of memory that are not a multiple of the loop unroll size. (I guess you could use duff's device.)
I actually tested it while ignoring that, and the unrolled version took 34.551 seconds vs 34.239 for the regular version (but those number are meaningless since the variation in time between runs is greater than the difference in runtime between versions).
I would imagine writing the native register size bytes per iteration would be a win, with a little cleanup for the remaining bytes. For a 32-bit architecture, you'd typically be doing 1/4 the iterations.
Far more involved than that. There are alignment issues. Its several machine cycles faster to do aligned store of a large scalar e.g. 32-bit.
So a responsible implementation would do a head check for alignment, store 1,2 or 3 bytes efficiently, then do aligned stores of 32-bit values (or 64-bit or whatever native size is appropriate), then a tail check for the small change. OR better yet switch(p & 3) and handle each case with unrolled code.
Many modern platforms contain some logic, that can be (ab)used to do memset() directly in hardware without processor involvement (ie. in parallel with other code). But this tends to require so much setup and IO overhead so it simply isn't worthwhile to do.
Integrating memset/bzero logic into memory array itself will increase it's price drastically (but in some special cases it is done, generally when memory array already contains some other expensive non-memory logic).