I think a lot of it has to do with the fact that a 2D map is going to have roads and valleys and big green spaces in the mountainsides and peaks. The neural net identifies that villages and curvy roads represent valley floors and interpolate where the mountain slopes are.
I think it's using the contour lines on the map, not the villages and roads. The paper mentions the training data is contour maps and relief maps of the same area.