"human intentions are not a generalisation of visual information" is a bit confusing category-wise. Question would be to what extent you can predict someone's next action, like running out to retrieve a ball, given just what a human driver can sense.
Clearly that's possible to some extent, and in theory it should be possible for some system receiving the same inputs to reach human-level performance on the task, but it seems very challenging given the imposed constraints.
Also, for clarity, note that the limitations don't require the model be trained only on driver-view data. It may be that reasoning capability is better learned through text pretraining for instance.
They do have some in-distribution generalisation capabilities, but human intentions are not a generalisation of visual information.