I would guess the better cache locality of a much smaller lookup table could make the smaller table with more shifts and lookups beat the larger one in speed in real-world code (where other code also competes for cache lines)
If you go down that rabbit hole, however, you might want to check whether foregoing using the shifts and lookups for 7-bit ASCII is even faster, if (as is often the case), you expect those characters to be the vast majority.
You could test whether all of the characters in a SIMD (256 or 512 bits, so 32 or 64 UTF-8 chars) register are in 7-bit range and just use simple logical operations for lower or upper casing them in a few cycles.
If you go down that rabbit hole, however, you might want to check whether foregoing using the shifts and lookups for 7-bit ASCII is even faster, if (as is often the case), you expect those characters to be the vast majority.