What do you think should be a good benchmark for such hybrid models? Here they created a toy dataset of simple geometric shapes with simple relationships. This is fine to start with, but we need to come up with some more realistic and useful scenario. Even MNIST for image classification is both realistic and useful. I wonder what would be an equivalent of MNIST or ImageNet for models which implement reasoning and common sense.