I'm no AI expert, but I'm not uninformed. Can you explain what "informed" means ...

I'm no AI expert, but I'm not uninformed. Can you explain what "informed" means in this context? I'm aware of the use of synthetic data for training in the context of a curated training effort, with HITL checking and other controls.

What we're talking about here is a world where 1) the corpus is polluted with an unknown quantity of unlabeled AI-generated content, and 2) reputational indicators (link counts, social media sharing/likes) may amplify AI-generated content, and lead to similar content being intentionally AI-generated.

At that point, can the incorrect info in the training set really be controlled for using noise reduction or other controls?