I'm no AI expert, but I'm not uninformed. Can you explain what "informed" means in this context? I'm aware of the use of synthetic data for training in the context of a curated training effort, with HITL checking and other controls.
What we're talking about here is a world where 1) the corpus is polluted with an unknown quantity of unlabeled AI-generated content, and 2) reputational indicators (link counts, social media sharing/likes) may amplify AI-generated content, and lead to similar content being intentionally AI-generated.
At that point, can the incorrect info in the training set really be controlled for using noise reduction or other controls?
What we're talking about here is a world where 1) the corpus is polluted with an unknown quantity of unlabeled AI-generated content, and 2) reputational indicators (link counts, social media sharing/likes) may amplify AI-generated content, and lead to similar content being intentionally AI-generated.
At that point, can the incorrect info in the training set really be controlled for using noise reduction or other controls?