It's quite sad that application interoperability requires parsing bitmaps instea...

abrichr · on Oct 22, 2024

See https://github.com/OpenAdaptAI/OpenAdapt for an open source alternative that includes operating system accessibility API data and DOM information (along with bitmaps) where available.

We are also planning on extracting runtime information using COM/AppleScript: https://github.com/OpenAdaptAI/OpenAdapt/issues/873

accrual · on Oct 22, 2024

It's super cool to see something like this already exists! I wonder if one day something adjacent will become a standard part of major desktop OSs, like a dedicated "AI API" to allow models to connect to the OS, browse the windows and available actions, issue commands, etc. and remove the bitmap parsing altogether as this appears to do.

HarHarVeryFunny · on Oct 22, 2024

It's really more of a reflection on where we're at in the timeline of computing, with humans having been the major user of apps and webs site up until now. Obviously we've had screen scraping and terminal emulation access to legacy apps for a while, and this is a continuation of that.

There have been, and continue to be, computer-centric ways to communicate with applications though, such as Windows COM/OLE, WinRT and Linux D-Bus, etc. Still, emulating human interaction does provide a fairly universal capability.

chillee · on Oct 22, 2024

It's very much in the "worse is better" camp.

janalsncm · on Oct 22, 2024

Apps are built for people rather than computers.

smartician · on Oct 22, 2024

If the goal is to emulate human behavior, I'd say there is a case to be made to build for the same interface, and not rely on separate APIs that may or may not reflect the same information as a user sees.

caeril · on Oct 22, 2024

You can blame normies for this. They love their ridiculous point and click (and tap) interfaces.

Fortunately, with function-calling (and recently, with guaranteed data structure), we've had access to application interoperability with LLMs for a while now.

Don't get mad at a company for developing for the masses - that's what they are expected to do.

But they built for us, first.

rfoo · on Oct 22, 2024

It's quite sad that application interoperability requires parsing text passed via pipes instead of exchanging structured information.

Like others said, worse is better.

jlpom · on Oct 23, 2024

Your comment takes 630 bits, the screenshot of your comment on my computer takes 2.1 MB, about 218k times the size. Either this is a compute overhead the LLM has to do before it can think about the meaning of the text, or if it's a E2E feedforward architecture, less thinking about it. This is simple for us because neurons in the retina pre-process its stream so that less than 0.8 % is sent to the visual cortex and because we have evolved to very efficiently and quickly extract meaning from our vision. This is a prime example of the Moravec's paradox.

SuaveSteve · on Oct 22, 2024

The people have chosen apps over protocols.

jampekka · on Oct 22, 2024

Worse is better.