*>Models don't need to have been trained on every single possibility - it's poss...

Ukv · 2024-10-11T13:51:23 1728654683

"human intentions are not a generalisation of visual information" is a bit confusing category-wise. Question would be to what extent you can predict someone's next action, like running out to retrieve a ball, given just what a human driver can sense.

Clearly that's possible to some extent, and in theory it should be possible for some system receiving the same inputs to reach human-level performance on the task, but it seems very challenging given the imposed constraints.

Also, for clarity, note that the limitations don't require the model be trained only on driver-view data. It may be that reasoning capability is better learned through text pretraining for instance.