AFAIK the killer is (video) encoding latency, which in turn comes from a tradeoff between encoding efficiency (bandwidth use) and number of frames that need to be "buffered" inside the encoder. This is exhibited inside the videoconferencing software and the webcam itself (AFAIK webcams don't send an uncompressed stream to the computer).