What I really want to know is what kind of robot arm motion is produced when the network is given a cat image to classify. More specifically, what kind of insights has it learned from one control domain that it then applied to another?
I imagine that the simulated 3D environment and the actual control of the robot arm must have some degree of interconnection neurally.
You could also train for this kind of interconnectedness by designing tasks that are explicitly multi-modal. For example, you could:
- Stack boxes collaboratively by controlling your own arm and communicating with another agent helping you.
- First produce a plan in text that another agent has to use to predict how you're going to control the arm. You'd get rewarded for both stacking correctly and being predictable based on the stated plan.
I imagine that the simulated 3D environment and the actual control of the robot arm must have some degree of interconnection neurally.