That time is just for the very first prompt. It is basically the startup time for the model. Once it is loaded, it is much much faster in responding to your queries. Depending on your hardware of course.
The models feel pretty snappy when interacting with them directly via ollama, not sure about the TPS
However I've also ran into 2 things: 1) most models don't support tools, sometimes it's hard to find a version of the model that correctly uses tools, 2) even with good TPS, since the agents are usually doing chain-of-thought and running multiple chained prompts, the experience feels slow - this is even true with Cursor using their models/apis
People have all sorts of hardware, TPS is meaningless without the full spec of the hardware, and GPU is not the only thing, CPU, ram speed, memory channel, PCIe speed, inference software, partial CPU offload? RPC? even OS, all of these things add up. So if someone tells you TPS for a given model, it's meaningless unless you understand their entire setup.
I’ve been playing around with Zed, supports local and cloud models, really fast, nice UX. It does lack some of the deeper features of VSCode/Cursor but very capable.
I’ve been using Cursor and I’m kind of disappointed. I get better results just going back and forth between the editor and ChatGPT
I tried localforge and aider, but they are kinda slow with local models