Sounds interesting, but it also sounds like something that could very well be ci...

Sounds interesting, but it also sounds like something that could very well be circumvented by using a technique similar to speculative decoding: you use the censored model like you'd use the fast llm in speculative decoding, and you check whether the other model agrees with it or not. But instead of correcting the token every time both models disagree like you'd do with speculative decoding, you just need to change it often enough to mess with the watermark detection function (maybe you'd change every other mismatched token, or maybe one every 5 tokens would be enough to reduce the signal-to-noise ratio below the detection threshold).

You wouldn't even need to have access to an unwatermarked model, the “correcting model” could even be watermaked itself as long as it's not the same watermarking function applied to both.

Or am I misunderstanding something?