This doesn't sound like much of a barrier to me. If you're a human training the LiDAR system, couldn't you just consult the image or video to help label whatever the LiDAR is seeing?
A good supervised learning process requires teaching humans to label consistently.
Imagine trying to write down precise instructions to train hundreds or thousands of humans to label many different types of objects using a tool like the above. Now hire, train, and manage those humans.
Compare that to having the humans draw rectangles around 2d color pictures of cars.
Also note that such tools need to be built and improved.
Is it possible to transfer learning from vision to LIDAR? Maybe if it's possible to map visual images to LIDAR images and vice-versa (by running a car with both cameras and LIDAR and learning their associations)
Probably everybody does it for themselves. So Tesla tries to have solely vision FSD while Waymo I guess does what you suggest. However seeing with lidar is not like vision only. Maybe at somepoint they share their code as OSS, probably not or very late.
Isn't the typical training data used in self-driving basically things like object labeling/segmentation and motion prediction? I'm not sure why that would be significantly different for visual vs depth-map data.
This is one argument for vision alone. It’s easier for humans to teach the deep neural network what to do if they both see and label the same thing.
It’s harder to build labeling systems that work on representations that humans don’t understand like point clouds or noisy depth maps.
That isn’t to say that other sensors including radar, gps, LiDAR, other spectrum, etc don’t help.
But you have to develop more complex labeling methods or move away from supervised learning.