I feel like YouTube is going to be a major source of language data in the future. The last statistic I saw was 500 hours of video uploaded every minute. If only 10% of those videos have original speaking in them and those average 40 words a minute, that’s almost 300 GB of transcribed speech per year.
YouTube would be a great source for spoken language. But only a tiny portion of YouTube has subtitles, and it doesn't yet feel like automatic transcription is at a level where you would want to use its output to train something else. That day will surely come though