Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think paper itself expresses that

Page 9: There is no deterministic path from model memorization to outputs of infringing works. While we’ve used probabilistic extraction as proof of memorization, to actually extract a given piece of 50 tokens of copied text often takes hundreds or thousands of prompts. Using the adversarial extraction method of Hayes et al. [54], we’ve proven that it can be done, and therefore that there is memorization in the model [16, 27]. But this is where, even though extraction is evidence of memorization, it may become important that they are not identical processes (Section 2). Memorization is a property of the model itself; extraction comes into play when someone uses the model [27]. This paper makes claims about the former, not the latter. Nevertheless, it’s worth mentioning that it’s unlikely anyone in the real world would actually use the model in practice with this extraction method to deliberately produce infringing outputs, because doing so would require huge numbers of generations to get non-trivial amounts of text in practice



Yes perhaps deliberate extraction is impractical, but I wonder about accidental cases? One group of researchers is a drop in the bucket compared to the total number of prompts happening everyday. I would like to see a broad statistical sampling of responses matched against training data to demonstrate the true rate of occurrence. Which begs the question, what is the acceptable rate?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: