I think a machine learning algorithm wouldn't care about that, because with a large enough training data set it would start to account for that and be able to accurately predict energy output based on the image alone.
Regardless of how big the dataset is, the image recognition algorithm is bound to get confused by the large differences in colour that exposure and sensitivity results in. It will likely look for the overall gray-to-blue gradient and estimate results from that; on the gray end of the things alone, the camera settings make a very, very big difference. You can't really tell the algorithm to ignore these and only determine the level of 'cloudiness.'
Another issue with this dataset is the overlay changing over time in text content, font, and colour. The algorithm might overfit and think e.g. yellow font presence means higher output simply because the output was higher during that period. You could strip away the text, but then you're introducing potential errors into the dataset yourself.