From the article, it looks like they have observed that the chat data is used for cross referencing image hashes to text, but extending it to incorporate and leverage representations learned via ML algorithms (supervised, or self-supervised) should be quite straightforward.