You could also train for this kind of interconnectedness by designing tasks that...

You could also train for this kind of interconnectedness by designing tasks that are explicitly multi-modal. For example, you could:

- Stack boxes collaboratively by controlling your own arm and communicating with another agent helping you.

- First produce a plan in text that another agent has to use to predict how you're going to control the arm. You'd get rewarded for both stacking correctly and being predictable based on the stated plan.