It's quite sad that application interoperability requires parsing bitmaps instead of exchanging structured information. Feels like a devastating failure in how we do computing.
See https://github.com/OpenAdaptAI/OpenAdapt for an open source alternative that includes operating system accessibility API data and DOM information (along with bitmaps) where available.
It's super cool to see something like this already exists! I wonder if one day something adjacent will become a standard part of major desktop OSs, like a dedicated "AI API" to allow models to connect to the OS, browse the windows and available actions, issue commands, etc. and remove the bitmap parsing altogether as this appears to do.
It's really more of a reflection on where we're at in the timeline of computing, with humans having been the major user of apps and webs site up until now. Obviously we've had screen scraping and terminal emulation access to legacy apps for a while, and this is a continuation of that.
There have been, and continue to be, computer-centric ways to communicate with applications though, such as Windows COM/OLE, WinRT and Linux D-Bus, etc. Still, emulating human interaction does provide a fairly universal capability.
If the goal is to emulate human behavior, I'd say there is a case to be made to build for the same interface, and not rely on separate APIs that may or may not reflect the same information as a user sees.
You can blame normies for this. They love their ridiculous point and click (and tap) interfaces.
Fortunately, with function-calling (and recently, with guaranteed data structure), we've had access to application interoperability with LLMs for a while now.
Don't get mad at a company for developing for the masses - that's what they are expected to do.
Your comment takes 630 bits, the screenshot of your comment on my computer takes 2.1 MB, about 218k times the size. Either this is a compute overhead the LLM has to do before it can think about the meaning of the text, or if it's a E2E feedforward architecture, less thinking about it.
This is simple for us because neurons in the retina pre-process its stream so that less than 0.8 % is sent to the visual cortex and because we have evolved to very efficiently and quickly extract meaning from our vision. This is a prime example of the Moravec's paradox.