You need a certain amount of pixels to represent the 1080p video in one particular frame. When the next frame comes along, your head has moved enough that you'll be seeing a different subset of those total pixels. At sufficient framerates (and motion capture rates), this actually does a pretty good job of approximating the full resolution of the imagery (especially when this is happening 2 or 3 times per source video frame, as the case might be with NTSC/PAL content).