etc. etc, then simulating a two-edge protocol is, while more time-consuming than simulating a one-edge protocol, still pretty easy. If the record looks like
Trial 1: PASS (v1 red -- v2 green, v3 green -- v4 blue)
Trial 2: PASS (v1 blue -- v2 red, v7 blue -- v9 red)
Trial 3: PASS (v1 green -- v2 red, v7 green -- v4 red)
then after Trial 3 you can tell that the prover is cheating. The fake trials seem to achieve your properties (1) and (3). I guess they violate #2? What is the "distribution" of a simulated transcript?