So I assume the video is the ground-truth, then the AI has access to the DOM and the video, and generates a selector based on the video during the test run (each time) in order to do avoid flakiness due to DOM/class/attribute changes?
Right now the generated script is the ground truth but we’ve been working on augmenting this with images & videos to fall back on. We think defaulting to code is good because it is faster, cheaper and more easy to reason about in the 95%+ of times it works. Plain old Selenium will get you pretty far, especially if creating scripts is much easier.