I find this unlikely. Do have some real world evidence for this? Microbenchmarks are not at all useful for this stuff in my experience. For speed: In the uncontended case, which is what we mostly care about because if you're contended it's game over for speed, they're both a single atomic CAS, so that's no difference.
In terms of size, the pointer is bigger than you'd want - it's no pthread_mutex or the analogous Windows data structure where it's a multi cache line disaster of a data structure - but it's clearly worse than a futex or WaitOnAddress solution.
A few years ago I compared them, it was not a microbenchmark, but a real application. There were a few million(almost entirely uncontended) exclusive locks being taken on startup, SRWLock was consistently faster, though the difference was not large.
In terms of size, the pointer is bigger than you'd want - it's no pthread_mutex or the analogous Windows data structure where it's a multi cache line disaster of a data structure - but it's clearly worse than a futex or WaitOnAddress solution.