Compare parts of screenshot and see if they changed. I didn't want to use DOM at all. My hypothesis was multi model AI agents will get cheaper over time (Gemini flash is crazy cheap) and people would start putting in attacks in the DOM to confuse AI.
Additionally, existing tools that I used struggled interacting with sites like reddit. So I set out to skip DOM and focus on a generalized approach.
I tried to go cheaper by using ui-tars, open source model by bytedance to run test locally without needing anthropic but it wasn't reliable enough.
That short test link is interesting. I didn't know they existed. Wow, the field is moving fast.
Additionally, existing tools that I used struggled interacting with sites like reddit. So I set out to skip DOM and focus on a generalized approach.
I tried to go cheaper by using ui-tars, open source model by bytedance to run test locally without needing anthropic but it wasn't reliable enough.
That short test link is interesting. I didn't know they existed. Wow, the field is moving fast.